SESSION GRID SERVICES, SCHEDULING, AND ...

Int'l Conf. Grid Computing and Applications | GCA'08 |

SESSION GRID SERVICES, SCHEDULING, AND RESOURCE MANAGEMENT + RELATED ISSUES Chair(s) TBA

1

2



3

Network-aware Peer-to-Peer Based Grid Inter-Domain Scheduling 1

Agust´ın Caminero1,∗ , Omer Rana2 , Blanca Caminero1 , Carmen Carrio´ n1 Albacete Research Institute of Informatics, University of Castilla La Mancha, Albacete (SPAIN) 2 Cardiff School of Computer Science, Cardiff University, Cardiff (UK) ∗ Corresponding author ([email protected])

Abstract Grid technologies have enabled the aggregation of geographically distributed resources, in the context of a particular application. The network remains an important requirement for any Grid application, as entities involved in a Grid system (such as users, services, and data) need to communicate with each other over a network. The performance of the network must therefore be considered when carrying out tasks such as scheduling, migration or monitoring of jobs. Moreover, the interactions between different domains are a key issue in Grid computing, thus their effects should be considered when performing the scheduling task. In this paper, we enhance an existing framework that provides scheduling of jobs to computing resources to allow multi-domain scheduling based on peer-to-peer techniques. Keywords: Grid computing, interdomain scheduling, peer-topeer, network

I. Introduction Grid computing enables the aggregation of dispersed heterogeneous resources for supporting large-scale parallel applications in science, engineering and commerce [12]. Current Grid systems are highly variable environments, made of a series of independent organizations that share their resources, creating what is known as Virtual Organizations (VOs) [13]. This variability makes Quality of Service (QoS) highly desirable, though often very difficult to achieve in practice [21]. One of the reasons for this limitation is the lack of control over the network that connects various components of a Grid system. Achieving an end-to-end QoS is often difficult, as without resource reservation any guarantees on QoS are often hard to satisfy. However, for applications that need a timely response (such as collaborative visualization [15]), the Grid must provide users with some kind of assurance about the use of resources – a non-trivial subject when viewed in the context of network QoS. In a VO, entities communicate with each other using an interconnection network – resulting in the network playing an essential role in Grid systems [21]. As a VO is made of different organizations (or domains), the interactions between different domains becomes important when executing jobs. Hence, a user

Fig. 1.

Several administrative domains.

wishing to execute a job with particular QoS constraints (such as response time) may contact a resource broker to discover suitable resources – which would need to look across multiple domains if local resources cannot be found. Metrics related to network QoS (such as latency, bandwidth, packet loss and packet jitter) are important when performing scheduling of jobs to computing resources – in addition to the capabilities of the computing resources themselves. As mentioned above, the lack of suitable local (in the user’s administrative domain) resources requires access to those from a different domain to run a job. However, the connectivity between the two domains now becomes important, and is the main emphasis of this work. Figure 1 depicts a number of administrative domains connected with each other by means of network connections. Each connection between two peers has an effective bandwidth, whose calculation will be explained in this paper. Each pair of neighbor peers may have different network paths linking each other, thus we rely on networking protocols, such as the Border Gateway Protocol (BGP) [20] to decide the optimal path between two destination networks. The main contribution of this paper is a proposal for inter-domain scheduling, which makes use of techniques used in Peer-to-Peer (P2P) systems. Also, an analytical evaluation has been performed showing the behavior of our proposal under normal network and computing resource workloads. This paper is structured as follows: Section II explains current proposals on network QoS in Grid computing and the lack of attention paid to

4


inter-domain scheduling. Also, existing proposals for inter-domain scheduling are revised. Section III explains our proposal of inter-domain scheduling. Section IV provides an evaluation, demonstrating the usefulness of our work, and Section V shows some guidelines for our future work. II. Related work The proposed architecture supports the effective management of network QoS in a Grid system, and focuses on the interactions between administrative domains when performing the scheduling of jobs to computing resources. P2P techniques are used to decide which neighboring domain a query should be forwarded to, in the absence of suitable local resources. We will first provide a brief overview of existing proposals for managing network QoS in Grids. General-purpose Architecture for Reservation and Allocation (GARA) [21] provides programmers and users with convenient access to end-to-end QoS for computer applications. It provides mechanisms for making QoS reservations for different types of resources, including computers, networks, and disks. These uniform mechanisms are integrated into a modular structure that permits the development of a range of high-level services. Regarding multi-domain reservations, GARA must exist in all the traversed domains, and the user (or a broker acting on his behalf) has to be authenticated with all the domains. This makes GARA difficult to scale. The Network Resource Scheduling Entity (NRSE) [5] suggests that signalling and per-flow state overhead can cause end-to-end QoS reservation schemes to scale poorly to a large number of users and multi-domain operations – observed when using IntServ and RSVP, as also with GARA [5]. This has been addressed in NRSE by storing the per-flow/per application state only at the end-sites that are involved in the communication. Although NRSE has demonstrated its effectiveness in providing DiffServ QoS, it is not clear how a Grid application developer would make use of this capability – especially as the application programming interface is not clearly defined [3]. Grid Quality of Service Management (G-QoSM) [3] is a framework to support QoS management in computational Grids in the context of the Open Grid Service Architecture (OGSA). G-QoSM is a generic modular system that, conceptually, supports various types of resource QoS, such as computation, network and disk storage. This framework aims to provide three main functions: 1) support for resource and service discovery based on QoS properties; 2) provision for QoS guarantees at application, middleware and network levels,

and the establishment of Service Level Agreements (SLAs) to enforce QoS parameters; and 3) support for QoS management of allocated resources, on three QoS levels: ‘guaranteed’, ‘controlled load’ and ‘best effort’. G-QoSM also supports adaptation strategies to share resource capacity between these three user categories. The Grid Network-aware Resource Broker (GNRB) [2] is an entity that enhances the features of a Grid Resource Broker with the capabilities provided by a Network Resource Manager. This leads to the design and implementation of new mapping/ scheduling mechanisms to take into account both network and computational resources. The GNRB, using network status information, can reserve network resources to satisfy the QoS requirements of applications. The architecture is centralized, with one GNRB per administrative domain – potentially leading to the GNRB becoming a bottleneck within the domain. Also, GNRB is a framework, and does not enforce any particular algorithms to perform scheduling of jobs to resources. Many of the above efforts do not take network capability into account when scheduling tasks. GARA schedules jobs by using DSRT and PBS, whilst GQoSM uses DSRT. These schedulers (DSRT and PBS) only pay attention to the workload of the computing resource, thus a powerful unloaded computing resource with an overloaded network could be chosen to run jobs, which decreases the performance received by users, especially when the job requires a high network I/O. Finally, VIOLA [24] provides a meta-scheduling framework that provides co-allocation support for both computational and network resources. It is able to negotiate with the local scheduling systems to find, and to reserve, a common time slot to execute various components of an application. The meta-scheduling service in VIOLA has been implemented via the UNICORE Grid middleware for job submission, monitoring, and control. This allows a user to describe the distribution of the parallel MetaTrace application and the requested resources using the UNICORE client, while the allocation and reservation of resources are undertaken automatically. A key feature in VIOLA is the network reservation capability; this allows the network to be treated as a resource within a meta-scheduling application. In this context, VIOLA is somewhat similar to our approach – in that it also considers the network as a key part in the job allocation process. However, the key difference is the focus in VIOLA on co-allocation and reservation – which is not always possible if the network is under ownership of a different administrator. Choosing the most useful domain is a key issue when propagating a query to another administrative domain.


DIANA [4] performs global meta-scheduling in a local environment, typically in a LAN, and utilizes metaschedulers that work in a P2P manner. Each site has a meta-scheduler that communicates with all other metaschedulers on other sites. DIANA has been developed to make decisions based on global information. This makes DIANA unsuitable for realistic Grid testbeds – such as the LHC Computing Grid [1]. The Grid Distribution Manager (GridDM) is part of the e-Protein Project [18], a P2P system that performs inter-domain scheduling and load balancing within a cluster – utilizing schedulers such as SGE, Condor etc. Similarly, Xu et al. [25] present a framework for the QoS-aware discovery of services, where the QoS is based on feedback from users. Gu et al. [14] propose a scalable aggregation model for P2P systems – to automatically aggregate services into a distributed application, to enable the resulting application to meet user defined QoS criteria. Our proposal is based on the architecture presented in [6] and extended in [7]. This architecture provides scheduling of jobs to computing resources within one or more administrative domains. A key component is the Grid Network Broker (GNB), which provides scheduling of jobs to computing resources, taking account of network characteristics. III. Inter-domain scheduling The proposed architecture is shown in Figure 2 and has the following entities: Users, each one with a number of jobs; computing resources, e.g. clusters of computers; routers; GNB (Grid Network Broker), a job scheduler; GIS (Grid Information Service), such as [11], which keeps a list of available resources; resource monitor (for example, Ganglia [16]), which provides detailed information on the status of the resources; BB (Bandwidth Broker) such as [22], which is in charge of the administrative domain, and has direct access to routers. The BB can be used to support reservation of network links, and can keep track of the interconnection topology between two end points within a network. A more in-depth description of the functionality of the architecture can be found in [7]. We make the following assumption in the architecture: (1) each domain must provide the resources it announces – i.e. when a domain publishes X machines with Y speed, those machines are physically located within the domain. The opposite case would be that a domain contains just a pointer to where the machines are. This is used to calculate the number of hops between the user and the domain providing the resource(s); (2) the resource monitor should provide

5

Fig. 2.

One single administrative domain.

exactly the same measurements in all the domains. Otherwise, no comparison can be made between domains. We use a Routing Indices (RI) [10] to enable nodes to forward queries to neighbors that are more likely to have suitable resources. A node continues to forwards the query to a subset of its neighbors, based on its local RI, rather than by selecting neighbors at random or by flooding the network (i.e. by forwarding the query to all neighbors). This minimizes the amount of traffic generated within a P2P system. A. Routing Indices Routing Indices (RI) [10] were initially developed to support document discovery in P2P systems, and they have also been used to implement a Grid information service in [19]. The goal of RIs is to help users find documents with content of interest across potential P2P nodes efficiently. The RI represents the availability of data of a specific type in the neighbor’s information base. We use a version of RI called Hop-Count Routing Index (HRI) [10], which considers the number of hops needed to reach a datum. Our implementation of HRI calculates the aggregate capability of a neighbor domain, based on the resources it contains and the effective bandwidth of the link between the two domains. More precisely, Equation (1) is applied. I lp =

num machines p i=0

max num processesi × e f f bw(l, p) current num processesi (1)

where I lp is the information that the local domain l keeps about the neighbor domain p; num machines p is the number of machines domain p has; current num processesi is the current number of processes running in the machine; max num processes i is the maximum number of processes that can be run in that machine; e f f bw(l, p) is the effective bandwidth of the network connection between the local domain l and the peer domain p, and it is calculated as follows. At every interval, GNBs forward a query along the path to their neighbor GNBs, asking for the number of transmitted bytes for each interface the query goes

6


through (the OutOctets parameter of SNMP [17]). By using two consecutive measurements (m 1 and m2 , m1 shows X bytes, and m 2 shows Y bytes), and considering the moment when they were collected (m 1 collected at time t1 seconds and m2 at t2 seconds), and the capacity of the link C, we can calculate the effective bandwidth of each link as follows: e f f bw(l, p) = C −

Y−X t2 − t1

(2)

The effective bandwidth of the path is the smallest effective bandwidth of links in that path. Also, predictions on the values of the resource power and the effective bandwidth can be used, for example, calculated as pointed out in [7]. As we can see, the network plays an important role when calculating the quality of a domain. Because of space limitations, we cannot provide an in-depth explanation of the formulas, see [8] for details on the terms in equation 1. We used HRI as described in [10]: in each peer, the HRI is represented as a M × N table, where M is the number of neighbors and N is the horizon (maximum number of hops) of our Index. The n th position in the mth row is the quality of the domain(s) that can be reached going through neighbor m, within n hops. As an example, the HRI of peer P 1 looks as shown in Table I (for the topology depicted in Figure 1), where S x.y is the information for peers that can be reached through peer x, and are y hops away from the local peer (P 1 ).

Peer P2 P3

1 hop S 2.1 S 3.1

2 hops S 2.2 S 3.2

3 hops S 2.3 S 3.3

TABLE I HRI for peer P1 .

So, S 2.2 is the quality of the domain(s) which can be reached through peer P 2 , whose distance from the local peer is 2 hops. Each S x.y is calculated by means of formula 3. In this formula, d(P x , Pi ) is the distance (in number of hops) between peers P x and Pi . S x.y is calculated differently based on the distance from the local peer. When the distance is 1, then S x.y = IPPxl , because the only peer that can be reached from local peer Pl through P x within 1 hop is P x . Otherwise, for those peers Pi whose distance from the local peer is y, we have to add the information that each peer P t (which is the neighbor of P i ) keeps about them. So, the HRI of peer P1 will be calculated as shown in Table II.

S x.y

⎧ Pl ⎪ IPx , when y = 1, ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ Pt =⎪ ⎪ ⎪ ⎪ i IP , ∀Pi , d(Pl , Pi ) = y ∧ . . . ⎪ ⎪ ⎩ · · · ∧ i d(P , P ) = y − 1 ∧ d(P , P ) = 1, otherwise l t t i

Peer P2 P3

1 hop IPP21 IPP31

2 hops IPP42 + IPP52 IPP63 + IPP73

(3)

3 hops P5 P5 IPP84 + IPP94 + IP10 + IP11 6 6 7 7 IPP12 + IPP13 + IPP14 + IPP15

TABLE II HRI for peer P1 .

In order to use RIs, a key component is the goodness function [10]. The goodness function will decide how good each neighbor is by considering the HRI and the distance between neighbors. More concretely, our goodness function can be seen in Equation (4). goodness(p) =

S p. j F j−1 j=1..H

(4)

In Equation (4), p is the peer domain to be considered; H is the horizon for the HRIs; and F is the fanout of the topology. As [10] explains, the horizon is the limit distance, and those peers whose distance from the local peer is higher than the horizon will not be considered. Meanwhile, the fanout of the topology is the maximum number of neighbors of a peer. B. Search technique In the literature, several techniques are used for searching in P2P networks, including flooding (e.g. Gnutella) or centralized servers (e.g. Napster). More effective searches are performed by systems based on distributed indices. In these configurations, each node holds a part of the index. The index optimizes the probability of finding quickly the requested information, by keeping track of the availability of data to each neighbor. Algorithm 1 shows the way that our architecture performs the scheduling of jobs to computing resources. In our system, when a user wants to run a job, he/she submits a query to the GNB of the local domain. This query is stored (line 7) as it arrives for the first time to a GNB. Subsequently, the GNB looks for a computing resource in the local domain matching the requirements of the query (line 9). If the GNB finds a computing resource in the local domain that matches the requirements, then it tells the user to use that resource to run the job (line 22). Otherwise, the GNB will forward the query to the GNB of one of the neighbor domains.


7

This neighbor domain will be chosen based on the HopCount Routing Index, HRI, explained before (line 13). The parameter T oT ry is used to decide which neighbor should be contacted next (in Figure 3, p3 will contact p6); if the query is bounced back, then the 2 nd best neighbor will be contacted (p3 will contact peer p7), and so on. Hence, a neighbor domain will only be contacted if there are no local computing resources available to fulfill the query (finish the job before the deadline expires, for instance). Algorithm 1 Searching algorithm. 1: Let q = new incoming query 2: Let LocalResource = a resource in the local domain 3: Let NextBestNeighbor = a neighbor domain select by the 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

goodness function Let T oT ry = the next neighbor domain to forward the query to for all q do LocalResource := null if (QueryStatus(q) = not present) then QueryStatus(q) := 1 LocalResource := MatchQueryLocalResource(q) end if if (LocalResource == null) then T oT ry := QueryStatus(q) NextBestNeighbor := HRI(q, T oT ry) if (NextBestNeighbor == null) then Recipient := Sender (q) else Recipient := NextBestNeighbor QueryStatus(q) += 1 end if ForwardQueryToRecipient(q, Recipient) else SendResponseToRequester(q) end if end for

Fig. 3. A query (Q) is forwarded from p1 to the best neighbors (p3, p6, and p7).

links with even numbered labels will be heavily used, and are depicted with a thicker line. 1. 109 8. 108 6. 108 4. 108 2. 108

0

Fig. 4.

In order to illustrate the behavior of our design, we will present an evaluation showing how our HRIs evolve when varying the measurements. We use the topology presented in Figure 3. Only data from peer p1 are shown. For simplicity and ease of explanation, we assume that bandwidth of all links is 1 Gbps, and that all the peers manage a single computational resource with 4 Gb of memory and CPU speed of 1 GHz. For equation 1, we have approximated the values of the current number of processes as a uniform distribution between 10 and 100, and the maximum number of processes as 100. Regarding the e f f bw(l, p), we have considered a Poisson distribution for those links that are heavily loaded, and Weibull distribution for those links which are not so loaded, as [9] suggests. In Figure 3,

50

100

150

200

250

300

Use of a not heavily loaded link (Weibull distribution).

1. 109 8. 108 6. 108 4. 108 2. 108

0

Fig. 5.

IV. Evaluation

0

0

50

100

150

200

250

300

Use of a heavily loaded link (Poisson distribution).

To calculate the parameters for these distributions (the mean μ for the Poisson distribution, and scale β and shape α for the Weibull distribution), we have considered that the level of use of heavily used links is 80%, whilst no heavily used links exhibit a 10% usage. This way, if a heavily used link transmits 800 Mb in 1 second, and the maximum transfer unit of the links is 1500 bytes, the inter-arrival time for packets is 0.000015 seconds. Thus, this is the value for the μ of the Poisson distribution. In the same way, we calculate the value for the β parameter of the Weibull distribution, and the value we get is 0.00012 seconds. We can now calculate the inter-arrival time for packets and the effective bandwidth. We have simulated a measurement period of 7 days, with measurements collected every 30 minutes. Figures 4, 5 and 6 present the variation on the use of links and the number of processes, following the

8


100

80

60

40

20

0

0

50

100

150

200

250

300

Fig. 6. Variation of the number of processes (Uniform distribution).

mathematical distributions explained before. Figures 4 and 5 represent the level of use of links compared to the actual bandwidth (1 Gbps), per measurement. Heavily used links get a higher used bandwidth than not heavily used links. Thus, the data shown in these figures are used for our HRIs in order to decide where to forward a query. 0.00040 0.00035 0.00030 0.00025

was expected. We must recall that the higher the HRI is, the better, because it means that the peer is powerful and well connected. Also, we see that when the link is not heavily loaded, the S has more high values, and values are more scattered across the figure. As opposed to it, when the link is heavily loaded, more values are grouped together at the bottom of the figure. Also, for Figure 9, S 2.2 = IPP42 + IPP52 , and S 3.2 = IPP63 + IPP73 , which means that to calculate S 2.2 and S 3.2 , both heavily and not heavily used links are used. Figures 10 and 11 show the variation of the goodness function for both neighbors of peer p1. Recall that the link between p1 and p2 is unloaded, and the link between p1 and p3 is loaded. These facts reflect in both goodness functions: in the case of p2 it shows higher values than the goodness function for p3. It can also be seen that the function of p2 has less values grouped near the zero axis. To summarize, a job originated in p1 will more likely be scheduled through peer p2 than through peer p3, as expected due to the links conditions.

0.00020 0.00015

0.00040

0.00010

0.00035 0.00030

0.00005 0.00000

0.00025 0

50

100

150

200

250

300 0.00020 0.00015

Fig. 7.

S 2.1 =

p1 I p2

(link p1 − p2 is not heavily loaded).

0.00010 0.00005 0.00000

0.00040

Fig. 10.

0.00035

0

50

100

150

200

250

300

Goodness function for peer p2 (link p1 − p2 unloaded).

0.00030 0.00025 0.00020 0.00040 0.00015 0.00035 0.00010 0.00030 0.00005 0.00025 0.00000

0

50

100

150

200

250

300

0.00020 0.00015

Fig. 8.

p1

S 3.1 = I p3 (link p1 − p3 is heavily loaded).

0.00010 0.00005 0.00000

0.00040

Fig. 11.

0.00035

0

50

100

150

200

250

300

Goodness function for peer p3 (link p1 − p3 loaded).

0.00030 0.00025

V. Conclusions and future work

0.00020 0.00015 0.00010 0.00005 0.00000

Fig. 9.

0

50

100

150

200

250

300

S 2.2 (S 3.2 would also look like this).

Figures 7, 8 and 9 present the variation of the S x.y for both heavily/ unheavily loaded links. These figures have been calculated by means of the formulas explained in Section III-A, and applying them to the mathematical distributions mentioned above. As we explained in p1 p1 Tables I and II, S 2.1 = I p2 , and S 3.1 = I p3 . We can see that the network performance affects the HRI, as

The network remains an important requirement for any Grid application, as entities involved in a Grid system (such as users, services, and data) need to communicate with each other over a network. The performance of the network must therefore be considered when carrying out tasks such as scheduling, migration or monitoring of jobs. Also, inter-domain relations are key in Grid computing. We propose an extension to an existing scheduling framework to allow network-aware multi-domain scheduling based on P2P techniques. More precisely,


our proposal is based on Routing Indices (RI). This way we allow nodes to forward queries to neighbors that are more likely to have answers. If a node cannot find a suitable computing resource for a user’s job within its domain, t forwards the query to a subset of its neighbors, based on its local RI, rather than by selecting neighbors at random or by flooding the network by forwarding the query to all neighbors. Our approach will be evaluated further using the GridSim simulation In this way, we will be able to study how the proposed technique behaves in complex scenarios, in a repeatable and controlled manner. Acknowledgement Work jointly supported by the Spanish MEC and European Commission FEDER funds under grants “Consolider Ingenio-2010 CSD2006-00046” and “TIN200615516-C04-02”; jointly by JCCM and Fondo Social Europeo under grant “FSE 2007-2013”; and by JCCM under grants “PBC-05-007-01”, “PBC-05-005-01”. References [1] LCG (LHC Computing Grid) Project. Web Page, 2008. http: //lcg.web.cern.ch/LCG. [2] D. Adami et al. Design and implementation of a grid networkaware resource broker. In Intl. Conf. on Parallel and Distributed Computing and Networks, Innsbruck, Austria, 2006. [3] R. Al-Ali et al. Network QoS Provision for Distributed Grid Applications. Intl. Journal of Simulations Systems, Science and Technology, Special Issue on Grid Performance and Dependability, 5(5), December 2004. [4] A. Anjum, R. McClatchey, H. Stockinger, A. Ali, I. Willers, M. Thomas, M. Sagheer, K. Hasham, and O. Alvi. DIANA scheduling hierarchies for optimizing bulk job scheduling. In Second Intl. Conference on e-Science and Grid Computing, Amsterdam, Netherlands, 2006. [5] S. Bhatti, S. Sørensen, P. Clark, and J. Crowcroft. Network QoS for Grid Systems. The Intl. Journal of High Performance Computing Applications, 17(3), 2003. [6] A. Caminero, C. Carrión, and B. Caminero. Designing an entity to provide network QoS in a Grid system. In 1st Iberian Grid Infrastructure Conference (IberGrid), Santiago de Compostela, Spain, 2007. [7] A. Caminero, O. Rana, B. Caminero, and C. Carrión. An Autonomic Network-Aware Scheduling Architecture for Grid Computing. In 5th Intl. Workshop on Middleware for Grid Computing (MGC), Newport Beach, USA, 2007. [8] A. Caminero, O. Rana, B. Caminero, and C. Carrión. Providing network QoS support in Grid systems by means of peer-to-peer techniques. Technical Report DIAB-08-01-1, Dept. of Computing Systems. Univ. of Castilla La Mancha, Spain, January 2008. [9] J. Cao, W. Cleveland, D. Lin, and D. Sun. Nonlinear Estimation and Classification, chapter Internet traffic tends toward Poisson and independent as the load increases. Springer Verlag, New York, USA, 2002. [10] A. Crespo and H. Garcia-Molina. Routing Indices For Peerto-Peer Systems. In Intl. Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2002. [11] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for configuring high-performance distributed computations. In 6th Symposium on High Performance Distributed Computing (HPDC), Portland, USA, 1997.

9

[12] I. Foster and C. Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2 edition, 2003. [13] I. T. Foster. The anatomy of the Grid: Enabling scalable virtual organizations. In 1st Intl. Symposium on Cluster Computing and the Grid (CCGrid), Brisbane, Australia, 2001. [14] X. Gu and K. Nahrstedt. A Scalable QoS-Aware Service Aggregation Model for Peer-to-Peer Computing Grids. In 11th Intl. Symposium on High Performance Distributed Computing (HPDC), Edinburgh, UK, 2002. [15] F. T. Marchese and N. Brajkovska. Fostering asynchronous collaborative visualization. In 11th Intl. Conference on Information Visualization, Washington DC, USA, 2007. [16] M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(5-6):817–840, 2004. [17] K. McCloghrie and M. T. Rose. Management Information Base for Network Management of TCP/IP-based internets: MIB-II. Internet proposed standard RFC 1213, March 1991. [18] A. O’Brien, S. Newhouse, and J. Darlington. Mapping of Scientific Workflow within the E-Protein Project to Distributed Resources. In UK e-Science All-hands Meeting, Nottingham, UK, 2004. [19] D. Puppin, S. Moncelli, R. Baraglia, N. Tonellotto, and F. Silvestri. A Grid Information Service Based on Peer-to-Peer. In 11th Intl. Euro-Par Conference, Lisbon, Portugal, 2005. [20] Y. Rekhter, T. Li, and S. Hares. A Border Gateway Protocol 4 (BGP-4). Internet proposed standard RFC 4271, January 2006. [21] A. Roy. End-to-End Quality of Service for High-End Applications. PhD thesis, Dept. of Computer Science, Univ. of Chicago, 2001. [22] S. Sohail, K. B. Pham, R. Nguyen, and S. Jha. Bandwidth Broker Implementation: Circa-Complete and Integrable. Technical report, School of Computer Science and Engineering, The University of New South Wales, 2003. [23] A. Sulistio, G. Poduval, R. Buyya, and C.-K. Tham. On incorporating differentiated levels of network service into GridSim. Future Generation Computer Systems, 23(4), May 2007. [24] O. Waldrich, P. Wieder, and W. Ziegler. A Meta-scheduling Service for Co-allocating Arbitrary Types of Resources. In 6th Intl. Conference on Parallel Processing and Applied Mathematics (PPAM), Poznan, Poland, 2005. [25] D. Xu, K. Nahrstedt, and D. Wichadakul. QoS-Aware Discovery of Wide-Area Distributed Services. In 1st Intl. Symp. on Cluster Comp. and the Grid (CCGrid), Brisbane, Australia, 2001.

10


Using a Web-based Framework to Manage Grid Deployments. 1

Georgios Oikonomou1 and Theodore Apostolopoulos1 Department of Informatics, Athens University of Economics and Business, Athens, Greece

Abstract - WebDMF is a Web-based Framework for the Management of Distributed services. It is based on the Web-based Enterprise Management (WBEM) standards family and introduces a middleware layer of entities called “Representatives”. Details related to the managed application are detached from the representative logic, making the framework suitable for a variety of services. WebDMF can be integrated with existing WBEM infrastructures and is complementary to web service-based management efforts. This paper describes how the framework can be used to manage grids without modifications to existing installations. It compares the proposed solution with other research initiatives. Experiments on an emulated network topology indicate its viability. Keywords: WebDMF, Grid Management, Distributed Services Management, Web-based Enterprise Management, Common Information Model.

1

Introduction

During the past decades the scenery in computing and networking has undergone revolutionary changes. From the era of single, centralised systems we are steadily moving to an era of highly decentralised, interconnected nodes that share resources in order to provide services transparently to the end user. Traditionally, legacy management approaches such as the Simple Network Management Protocol (SNMP) [1], targeted single nodes. The current paradigm presents new challenges and increases complexity in the area of network and systems management. There is need for solutions that view a distributed deployment as a whole, instead of as a set of isolated hosts. The Web-based Distributed Management Framework (WebDMF) is the result of our work detailed in [2]. It is a framework for the management of distributed services and uses standard web technologies. Its core is based on the Web-based Enterprise Management (WBEM) family of specifications [3], [4], [5]. It is not limited to monitoring but is also capable of modifying the run-time parameters of

the managed service. Finally, it has a wide target group. It can perform the management of a variety of distributed systems, such as distributed file systems, computer clusters and computational or data grids. However, multiprocessor, multi-core, parallel computing and similar systems are considered out of the scope of our work, even though they are very often referred to as “distributed”. The main contribution of this paper is three-fold: •

We demonstrate how a WebDMF deployment can be used for the management of a grid, without any modification to existing WBEM management infrastructures. • We provide indications for the viability of the approach through a preliminary performance evaluation. • We show that WebDMF is not competitive to emerging Web Service-based grid management initiatives. Instead, it is a step towards the same direction. Section 2 summarizes some recent approaches in the field of grid management and compares our work with those efforts. In order to familiarize the reader with some basic concepts, section 3 presents a short introduction to the WBEM family of standards. In section 4 we briefly describe WebDMF’s architecture and some implementation details. In the same section we demonstrate how the framework can be used to manage grids. Finally, we discuss the relationship between WebDMF and Web Service-based management and we present some preliminary evaluation results. Section 5 presents our conclusions.

2

Related Work – Motivation

In this section we aim to outline some of the research initiatives in the field of grid management. The brief review is limited to the most recent ones.

2.1

Related Work

An important approach is the one proposed by the Open Grid Forum (OGF). OGF’s Grid Monitoring Architecture (GMA) uses an event producer – event consumer model to monitor grid resources [6]. However, as the name suggests, GMA is limited to monitoring. It lacks active management and configuration capabilities.


gLite is a grid computing middleware, developed as part of the Enabling Grids for E-sciencE (EGEE) project. gLite implements an “Information and Monitoring Subsystem”, called R-GMA (Relational GMA), which is a modification of OGF’s GMA. Therefore it also only serves monitoring purposes [7]. The Unified Grid Management and Data Architecture (UGanDA) is an enterprise level workflow and grid management system [8]. It contains a grid infrastructure manager called MAGI. MAGI has many features but is limited to the management of UGanDA deployments.

11

•

Whether the approach is generic or focuses on infrastructures implemented using a specific technology. Our motivation to design WebDMF was to provide a framework that would be generic enough to manage grid deployments regardless of the technology used to implement their infrastructure. At the same time, it should not be limited to monitoring but also provide “set” capabilities. Other advantages are: • •

MRF is a Multi-layer resource Reconfiguration Framework for grid computing [9]. It has been implemented on a grid-enabled Distributed Shared Memory (DSM) system called Teamster-G [10]. MonALISA stands for “Monitoring Agents using a Large Integrated Services Architecture”. It “aims to provide a distributed service architecture which is used to collect and process monitoring information” [11]. Many Globus deployments use MonALISA to support management tasks. Again, the lack of capability to modify the running parameters of the managed resource is notable. Finally, we should mention emerging Service – Based management initiatives, such as the Web Services Distributed Management (WSDM) [12] standard and the Web Services for Management (WS-Man) specification [13]. Due to their importance, they are discussed in greater detail in section 4 of this paper.

2.2

Motivation

Table I compares WebDMF with the solutions that we presented above. For this comparison we consider three factors: TABLE I.

COMPARING WEBDMF WITH OTHER GRID MANAGEMENT SOLUTIONS.

Name OGF’s GMA

Monitoring Y

Set

Target Group Wide

gLite – R-GMA

Y

UGanDA – MAGI

Y

Y

Focused Focused

MRF – Teamster-G

Y

Y

Focused

MonALISA

Y

WebDMF

Y

•

•

3

It is based on WBEM. This is a family of open standards. WBEM allows easy integration with web service – based management approaches. WBEM has been considered adequate for the management of applications, as opposed to other approaches (e.g. SNMP) that focus on the management of devices. It provides interoperability with existing WBEM-based management infrastructures.

Web-based Enterprise Management

Web-Based Enterprise Management (WBEM) is a set of specifications published by the Distributed Management Task Force (DMTF). A large number of companies are also involved in this ongoing management initiative. This section presents a brief introduction to the WBEM family of standards. Fig. 1 displays the three core WBEM components. The “Common Information Model” (CIM) is a set of specifications for the modeling of management data [3]. It is an object-oriented, platform-independent model maintained by the DMTF. It includes a “core schema” with definitions that apply to all management areas. It also includes a set of “common models” that represent common management areas, such as networks, hardware, software and services. Finally, the CIM allows manufacturers to define technology-specific “extension schemas” that directly suit the management needs of their implementations. Data Model Common Information Model

Wide Y

Wide

CIM in XML Encoding

• •

The ability to perform monitoring. Whether the approach can actively modify the grid’s run-time parameters.

CIM over HTTP Transport CIM-XML

Fig. 1.

The three core WBEM components.

For the interaction between WBEM entities (clients and managed elements), WBEM uses a set of well-defined

12


request and response data packets. CIM elements are encoded in XML in accordance with the xmlCIM specification [4]. The resulting XML document is then transmitted over a network as the payload of an HTTP message. This transport mechanism is called “CIM Operations over HTTP” [5]. WBEM follows the client-server paradigm. The WBEM client corresponds to the term “management station” used in other management architectures. A WBEM server is made up of components as portrayed in Fig. 2.

present some implementation details. Due to length restrictions we can not provide deep technical design details. We explain how the framework can be used to manage grid deployments. The section continues with a discussion about the relationship between WebDMF and Web Service-based management. It concludes with a preliminary performance evaluation, indicating the viability of the approach.

4.1

Design

WebDMF stands for Web-based Distributed Management Framework. It treats a distributed system as a number of host nodes. They are interconnected over a network and share resources to provide services to the end user. The proposed framework’s aim is to provide management facilities for them. Through their management, we achieve the management of the entire deployment. The architecture is based on the WBEM family of technologies. Nodes function as WBEM entities; clients, servers or both, depending on their role in the deployment. The messages exchanged between nodes are CIM-XML messages.

Fig. 2.

WBEM instrumentation.

The WBEM client does not have direct access to the managed resources. Instead, it sends requests to the CIM Object Manager (CIMOM), using CIM over HTTP. The CIMOM handles all communication with the client, delegates requests to the appropriate providers and returns responses. Providers act as plugins for the CIMOM. They are responsible for the actual implementation of the management operations for a managed resource. Therefore, providers are implementation – specific. The repository is the part of the WBEM server that stores the definitions of the core, common and extension CIM schemas. A significant number of vendors have started releasing WBEM products. The SBLIM open source project offers a suite of WBEM-related tools. Furthermore, OpenPegasus, OpenWBEM and WBEMServices are some noteworthy, open source CIMOM implementations. There are also numerous commercial solutions.

4

WebDMF: Web-based Management of Distributed Services

In this section we introduce the reader to the concept and design of the WebDMF management framework and

WebDMF’s design introduces a middleware layer of entities that we call “Management Representatives”. They act as peers and form a management overlay network. This new layer of nodes is integrated with the existing WBEMbased management infrastructure. Representatives act as intermediaries between existing WBEM clients and CIM Object Managers. In our work we use the terms “Management” and “Service” node when referring to those entities. This resembles the “Manager of Managers” (MoM) approach. However, in MoM there is no direct communication between domain managers. Representatives are aware of the existence of their peers. Therefore, WebDMF adopts the “Distributed Management” approach. By distributing management over several nodes throughout the network, we can increase reliability, robustness and performance, while network communication and computation costs decrease [14]. Fig. 3 displays the three management entities mentioned above, forming a very simple topology. A “Management Node” is a typical WBEM client. It is used to monitor and configure the various operational parameters of the distributed service. Any existing WBEM client software can be used without modifications. A “Service Node” is the term used when referring to any node – member of the distributed service. For instance, in the case of a data grid, the term would be used to describe a storage device. Similarly, in a computational grid, the term can describe an execution host. As stated


previously, the role of a node in a particular grid deployment does not affect the design of our framework.

Fig. 3.

13

•

Exchanges messages with other representatives regarding the state of the system as a whole. • Keeps a record of Service Nodes that participate in the deployment. • Redirects requests to other representatives. Fig. 5 displays the generic case of a distributed deployment. Communication between representatives is also performed over WBEM.

Management entities.

Typically, a Service Node executes an instance of the (distributed) managed service. As displayed in Fig. 4 (a), a WBEM request is received by the CIMOM on the Service Node. A provider specifically written for the service handles the execution of the management operation. The existence of such a provider is a requirement. In other words, the distributed service must be manageable through WBEM. Alternatively, a service may be manageable through SNMP, as shown in Fig. 4 (b). In such a case the node may still participate in WebDMF deployments but some functional restrictions will apply.

Fig. 5.

A generic deployment.

The initial requests do not state explicitly which service nodes are involved in the management task. The decision about the destination of the intermediate message exchange is part of the functionality implemented in the representative. The message exchange is transparent to the management node and the end user. In order to achieve the above functionality, a representative is further split into building blocks, as shown in Fig. 6. It can act as a WBEM server as well as a client. Initial requests are received by the CIMOM on the representative. They are delegated to the WebDMF provider module for further processing. The module performs the following functions:

Fig. 4.

Service node.

The framework’s introduces an entity called the “Management Representative”. This entity receives requests from a WBEM client and performs management actions on the relevant service nodes. After a series of message exchanges, it will respond to the initial request. A representative is more than a simple ‘proxy’ that receives and forwards requests. It performs a number of other operations including the following:

Fig. 6.

•

WebDMF representative.

Determines whether the request can be served locally.

14


•

If the node can not directly serve the request then it selects the appropriate representative and forwards it. • If the request can be served locally, the representative creates a list of service nodes that should be contacted and issues intermediate requests. • It processes intermediate responses and generates the final response. • Finally, it maintains information about the distributed system’s topology. In some situations, a service node does not support WBEM but is only manageable through SNMP. In this case, the representative attempts to perform the operation using SNMP methods. This is based on a set of WBEM to SNMP mapping rules. This has limitations since it is not possible to map all methods. However, even under limitations, the legacy service node can still participate in the deployment.

can call WBEM methods on instances of this schema. In doing so, they can define the management operations that they wish to perform on the target application. Each request towards the distributed deployment is treated as a managed resource itself. For example, a user can create a new request. They can execute it periodically and read the results. They can modify it, re-execute it and finally delete it. Each request is mapped by the representative to intermediate WBEM requests issued to service nodes.

In a WebDMF deployment, a representative is responsible for the management of a group of service nodes. We use the term “Domain” when referring to such groups. Domains are organized in a hierarchical structure. The top level of the hierarchy (root node of the tree) corresponds to the entire deployment. The rationale behind designing the domain hierarchy of each individual deployment can be based on a variety of criteria. For example a system might be separated into domains based on the geographical location of nodes. WebDMF defines two categories of management operations: i) Horizontal (Category A) and ii) Vertical (Category B).

Fig. 7.

Request Factory CIM Schema.

Horizontal Operations enable management of the WebDMF overlay network itself. Those functions can, for example, be used to perform topology changes. The message exchange that takes place does not involve Service Nodes. Therefore, the managed service is not affected in any way.

Request factory classes are generic. They are not related in any way with the CIM schema of the managed application. This makes WebDMF appropriate for the management of a wide variety of services. Furthermore, it does not need re-configuration when the target schema is modified.

On the other hand, vertical operations read and modify the CIM schema on the Service Node, thus achieving management of the target application. Typical examples include:

4.2

•

Setting new values on CIM objects of many service nodes. • Reading operational parameters from service nodes and reporting an aggregate (e.g. sum or average). In line with the above, we have designed two CIM Schemas for WebDMF, the core schema (“WebDMF_Core”) and the request factory. They both reside on the representatives’ repositories. The former schema models the deployment’s logical topology, as discussed earlier. It corresponds to horizontal functions. The latter schema is represented by the class diagram in Fig 7 and corresponds to vertical functions. The users

Implementation

The WebDMF representative is implemented as a single shared object library file (.so). It is comprised of a set of WBEM providers. Each one of them implements the management operations for a class of the WebDMF schemas. The interface between the CIMOM and the providers complies with the Common Manageability Programming Interface (CMPI). Providers themselves are written using C++ coding. This does not break CIMOM independence, as described in [15]. The representative was developed on Linux 2.6.20 machines. We used gcc 4.1.2 and version 2.17.50 of binutils. It has been tested with version 2.7.0 of the Open Pegasus CIMOM.

4.3

Using WebDMF to Manage Grids

In a grid environment, a service node can potentially be an execution host, a scheduler, a meta-scheduler or a


resource allocation host. The previous list is non-inclusive. The role of a node does not affect the design of our framework. What we need is a CIM schema and the relevant providers that can implement WBEM management for the service node. Such schemas and providers do exist. For example, an architecture for flexible monitoring of various WSRF-based grid services is presented in [16]. This architecture uses a WBEM provider that communicates with WSRF hosts and gathers status data. In a WebDMF deployment, we could have many such providers across various domains. Each would reside on a service node and monitor the managed application.

4.4

WebDMF and Web Services

15

4.5

Performance Evaluation

In this section we present a preliminary evaluation of WebDMF’s performance. Results presented here are not simulation results. They have been obtained from actual code execution and are used as an indication of the solution’s viability. In order to perform measurements, we installed a testbed environment using ModelNet [19]. Our topology emulates a wide-area network. It consists of 250 virtual nodes situated in 3 LANs. Each LAN has its own gateway to the WAN. The 3 gateways are interconnected via a backbone network, with high bandwidth, low delay links. We have also installed two WebDMF representatives (nodes R1 and R2). This is portrayed in Fig. 8.

The grid community has been working for more than five years to transform grid computing systems into a group of web service-based building blocks. In line with this effort, management of the resulting infrastructures has also moved towards web service-based approaches. The recent OASIS Web Services Distributed Management (WSDM) [12] standard and the DMTF Web Services for Management (WS-Man) specification [13] have been considered enablers of this vision. WebDMF adopts a resource – centric approach. This may seem to be a step in the opposite direction. It is not. The authors of this paper consider web service – based approaches to be a very necessary and extremely valuable effort. However, service – oriented management approaches are model – agnostic. They do not define the properties, operations, relationships, and events of managed resources [12]. Two important reasons why we choose WBEM for the resource layer are the following: •

WS-Management exposes CIM resources via web services, as defined in [17]. CIM is an inherent part of WBEM, as explained earlier in this paper. • DMTF members are working on publishing a standard for the mapping between WS-Man operations and WBEM Generic Operations [18]. Furthermore, in order to implement a WS-Man operation, a Web Service endpoint needs to delegate requests to instrumentation that can operate on the managed resource. In current, open source WS-Man implementations, management requests are eventually served by a WBEM Server and the appropriate providers. WS-Man and WBEM are related and complementary to each other. The WebDMF representative has been implemented as a WBEM provider Therefore, if the CIMOM operating on the Representative node provides WS-Man client interfaces, the WebDMF provider will operate normally.

Fig. 8.

The emulated topology and test scenario.

We assume that for this network deployment, we wish to obtain the total amount of available physical memory for the 200 nodes hosted in one of the LANs. Among those, 50 do not support basic WBEM instrumentation. They only offer SNMP-based management facilities. In this scenario, the client will form a WBEM CreateInstance() request for class WebDMF_RequestWBEM of the request factory. It is initially sent to the WebDMF Representative (R1). The request will get forwarded to R2. R2 will collect data from the 200 service nodes as follows: •

•

R2 sends intermediate requests to the 150 WBEMenabled nodes. Those requests invoke the EnumerateInstances() operation for class Linux_OperatingSystem. Responses are sent back to R2 from the service nodes. As stated previously, in this scenario there are 50 SNMP-enabled nodes. R2 sends SNMP-Get packets to

16


those hosts, requesting the value of the hrMemorySize object. This object is part of the HOST-RESOURCES-MIB defined in RFC 1514 [20]. The transformation is based on the mapping rules mentioned in a previous section. After collecting the responses, R2 calculates the aggregate (sum) of the reported values. This value becomes part of the response that is sent to R1. R1 sends the final response to the client. We repeated the above experiment 200 times. Table II summarizes the results. Times are in seconds. Consider the fact that this scenario involves 204 request-response exchanges among various nodes. Furthermore, consider that the packets crossing the network are of a small size (a few bytes). The total execution time includes the following: TABLE II.

Dispersion

6 [1] [2]

EVALUATION RESULTS.

Metrics N Repetitions Central Tendency

existing infrastructure. It should not be limited by the technology used to implement the grid and be generic in order to support future changes. In this paper we introduce WebDMF, a Web-based Distributed Management Framework and present how it can be used to manage grids. We discuss its generality and demonstrate its viability through performance evaluation. Finally, the paper presents its advantages compared to alternative approaches and shows how it is complementary to emerging Web Servicebased management approaches.

Values 200

Arithmetic Mean

6.237139

Median

6.193212

Variance

0.015187

Standard Deviation

0.123237

[3] [4] [5] [6] [7]

[8]

95% Confidence Interval for the Mean From

6.220059

To

6.254218

[9]

•

Communication delays during request-response exchanges. This includes TCP connection setup for all WBEM message exchanges. This does not apply to the SNMP case. SNMP uses UDP at the transport layer, therefore no connection is used. • Processing overheads on R1 and R2. This is imposed by WebDMF’s functionality. • Processing at the service nodes to calculate the requested value and generate a response. The absolute value of the average completion time may seem rather high. However, in general terms, processing times are minimal compared to TCP connection setup and message exchange. With that in mind, we can see that each of the 204 request-responses completes in 30.57 milliseconds on average. This is normal. After 200 repetitions we observe low statistical dispersion (variance and standard deviation). This indicates that the measured values are not widely spread around the mean. We draw the same conclusion by estimating a 95% confidence interval for the mean. This indicates that the same experiment will complete in the same time under similar network traffic conditions.

5

[10]

[11]

[12] [13] [14]

[15] [16]

[17] [18] [19]

Conclusions

Ideally, a management framework should support grid deployments without need for major modifications on the

[20]

References W. Stallings, SNMP, SNMPv2, SNMPv3, RMON 1 and 2. Addison Wesley, 1999. G. Oikonomou, and T. Apostolopoulos, “WebDMF: A Web-based Management Framework for Distributed Services”, in Proc. The 2008 International Conference of Parallel and Distributed Computing (ICPDC 08) to be published. CIM Infrastructure Specification, DMTF Standard DSP0004, 2005. Representation of CIM in XML, DMTF Standard DSP0201, 2007. CIM Operations over HTTP, DMTF Standard DSP0200, 2007. A Grid Monitoring Architecture, Open grid Forum GFD.7, 2002. A. W. Cooke, et al, “The Relational Grid Monitoring Architecture: Mediating Information about the Grid,” Journal of Grid Computing, vol. 2, no. 4, pp. 323-339, 2004. K. Gor, D. Ra, S. Ali, L. Alves, N. Arurkar, I. Gupta, A. Chakrabarti, A. Sharma, and S. Sengupta, "Scalable enterprise level workflow and infrastructure management in a grid computing environment," in Proc. Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05), Cardiff, UK, 2005, pp. 661–667. P.-C. Chen, J.-B. Chang, T.-Y. Liang, C.-K. Shieh, and Y.-C. Zhuang, "A multi-layer resource reconfiguration framework for grid computing," in Proc. 4th international workshop on middleware for grid computing (MGC'06), Melbourne, Australia, 2006, p. 13. T.-Y. Liang, C.-Y. Wu, J.-B. Chang, and C.-K. Shieh, "Teamster-G: a grid-enabled software DSM system," in Proc. Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05), Cardiff, UK, 2005, pp. 905–912. I.C. Legrand, H.B. Newman, R. Voicu, C. Cirstoiu, C. Grigoras, M. Toarta, and C. Dobre, “MonALISA: An Agent based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications,” in Proc. Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland, 2004. An Introduction to WSDM, OASIS committee draft, 2006. Web Services for Management (WS Management), DMTF Preliminary Standard DSP0226, 2006. M. Kahani and P. H. W. Beadle, "Decentralised approaches for network management," ACM SIGCOMM Computer Communication Review, vol. 27, iss. 3, pp. 36–47, 1997. Common Manageability Programming Interface, The Open Group, C061, 2006. L. Peng, M. Koh, J. Song, and S. See, "Performance Monitoring for Distributed Service Oriented Grid Architecture," in Proc. The 6th International Conference on Algorithms and Architectures (ICA3PP2005), 2005. WS-CIM Mapping Specification, DMTF Preliminary Standard DSP0230, 2006. WS-Management CIM Binding Specification, DMTF Preliminary Standard DSP0227, 2006. A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, D. Kostic, J. Chase, and D. Becker "Scalability and Accuracy in a Large-Scale Network Emulator," in Proc. 5th Symposium on Operating Systems Design and Implementation (OSDI), December 2002. Host Resources MIB, IETF Request For Comments 1514, 1993.


17

SEMM: Scalable and Efficient Multi-Resource Management in Grids Haiying Shen Department of Computer Science and Computer Engineering University of Arkansas, Fayetteville, AR 72701

Abstract - Grids connect resources to enable the worldwide collaboration. Conventional centralized or hierarchical approaches to grid resource management are inefficient in large-scale grids. Distributed Hash Table (DHT) middleware overlay has been applied to grids as a mechanism for providing scalable multi-resource management. Direct DHT overlay adoption breaks the physical locality relationship between nodes. This paper presents a Scalable and Efficient Multi-resource Management mechanism (SEMM). It collects resource information based on the physical locality relationship among resource hosts as well as the resource attributes. Simulation results demonstrate the effectiveness of SEMM in locality-awareness and overhead reduction in comparison with another approach. Keywords: Resource management, Resource discovery, Grid, Peer-to-Peer, Distributed Hash Table

1

Introduction

Grids enable the sharing, selection, and aggregation of a wide variety of resources to enable world-wide collaboration. Therefore, scalable and efficient resource management is vital to the performance of grids. As a successful model that achieves high scalability in distributed systems, Distributed Hash Table (DHT) middleware overlay [1, 2, 3, 4, 5] facilitates the resource management in large-scale grid environment. However, direct DHT overlay adoption breaks the physical locality relationship of nodes in the underlying IP-level topology. Since resource sharing and communication among physically close nodes enhance resource management efficiency, it is desirable that DHT middleware can preserve the locality relationship of grid nodes. Most current DHT-based approaches for resource management are not sufficiently scalable and efficient. They let the resources be shared in a system-wide scale. Thus, a node may need

to ask a node very far away for resources, resulting in inefficiency. Since a grid may have a very large scale, neglecting resource host locality in resource management prevents the system from achieving higher scalability. Locality-aware resource management is critical to the scalability and efficiency of a grid system. To meet the requirements, we propose a scalable and efficient multiresource management mechanism (SEMM), which is built on a DHT structure. SEMM provides locality-aware resource management by mapping physically close resource requesters and providers. Thus, resources can be shared between physically close nodes, and the efficiency of resource sharing will be significantly improved. The rest of this paper is structured as follows. Section 2 presents a concise review of representative resource management approaches for grids. Section 3 introduces SEMM, focusing on its architecture and algorithms. Section 4 shows the performance of SEMM in comparison with another approach in terms of a variety of metrics. Section 5 concludes this paper.

2

Related Work

Over the past years, the immerse popularity of grids has produced a significant stimulus to grid resource management approaches such as Condor-G [6], Globus toolkit [7], Condor [8], Entropia [9], AppLes [10], Javelin++ [11]. However, relying on centralized or hierarchical policies, these systems have limitation in a largescale dynamic multi-domain environment with variation of resource availability. To cope with these problems, more and more grids resort to DHT middleware overlay for resource management. DHT overlays is an important class of the peerto-peer overlay networks that map keys to the nodes of a network based on a consistent hashing function [12]. Some DHT-based approaches adopt one DHT overlay for each resource, and process multi-resource queries in

18


parallel in corresponding DHT overlays [13]. However, depending on multiple DHT overlays for multi-resource management leads to high structure maintenance overhead. Another group of approaches [14, 15, 16, 17] organize all grid resources into one DHT overlay, and assign all information of a type of resource to one node. Such an approach results in imbalance of load distribution among nodes caused by information maintenance and resource scheduling. It also leads to high cost for searching resource information among a huge volume of information in a node. Moreover, few of current approaches are able to deal with the locality feature of grids. Unlike most existing approaches, SEMM preserves the physical locality relationship between nodes in networks and achieves locality-aware resource management. This feature contributes to the high scalability and efficiency characters of SEMM in grid resource management.

3 3.1

Scalable and Efficient MultiResource Management Overview

SEMM is developed based on Cycloid DHT overlay [5]. We first briefly describe Cycloid DHT middleware overlay followed by a high-level view of SEMM architecture. Cycloid is a lookup efficient constant-degree overlay with n=d · 2d nodes, where d is dimension. It achieves a time complexity of O(d) per lookup request by using O(1) neighbors per node. Each Cycloid node is represented by a pair of indices (k, ad−1 ad−2 . . . a0 ), where k is a cyclic index and ad−1 ad−2 ......a0 is a cubical index. The cyclic index is an integer, ranging from 0 to d − 1, and the cubical index is a binary number between 0 and 2d − 1. The nodes with the same cubical indices are ordered by their k mod d on a small cycle, which we call cluster. The node with the largest cyclic index in a cluster is called the primary node of the nodes at the cluster. All clusters are ordered by their cubical indices mod 2d on a large cycle. For a given key or a node, its cyclic index is set to the hash value of the key or IP address modulated by d and the cubical index is set to the hash value divided by d. A key will be assigned to a node whose ID is closest to its ID. Briefly, the cubical index represents the cluster that a node or an object locates, and the cyclic index represents its position in a cluster. The overlay network provides two main functions: Insert(key,object) and Lookup(key) to store an object to a node responsible for the key and to retrieve the object. For more information about Cycloid, please refer to [5].

3.2

Locality-aware Middleware Construction

Before we present the details of SEMM, let’s introduce a landmarking method to represent node closeness on the network by indices. Landmark clustering has been widely adopted to generate proximity information [18, 19, 20, 21]. We assume m landmark nodes that are randomly scattered in the Internet. Each node measures its physical distances to the m landmarks, and uses the vector of distances < d1 , d2 , ..., dm > as its coordinate in Cartesian space. Two physically close nodes will have similar vectors. We use space-filling curves [22], such as Hilbert curve [19], to map m-dimensional landmark vectors to real numbers, such that the closeness relationship among the nodes is preserved. We call this number Hilbert number of the node, denoted by H. H indicates the physical closeness of nodes on the Internet. SEMM builds a locality-aware Cycloid architecture on a grid. Specifically, it uses grid node i’s Hilbert number, Hi , as its cubical index, and the consistent hash value of node i’s IP address, Hi , as its cyclic index to generate the node’s ID, denoted by (Hi , Hi ). Recall that in a Cycloid ID, the cubical indices differentiate clusters and the cyclic indices differentiate node positions in a cluster. Therefore, the physically close nodes with the same H will be in a cluster, and those with similar H will be in nearby clusters in Cycloid. As a result, a locality-aware Cycloid is constructed, in which the logical proximity abstraction derived from overlay matches the physical proximity information in reality.

3.3

Resource Reporting and Query

We define resource information, represented by Ir , as the information of available resources and resource requests. It includes the information of resource host, resource ID represented by IDr , and etc. In DHT overlay networks, the objects with the same key will be stored in a same node. Based on this principle and node ID determination policy, SEMM lets node i compute the consistent hash value of its resource r, denoted by Hr , and use (Hr , Hi ) to represent IDr . The node uses the DHT overlay function Insert(IDr , Ir ) to store resource information to a node in its cluster. As a result, the information of the same type of resources in physically close nodes will be stored in a same repository node, and different nodes in one cluster are responsible for different types of resources within the cluster. Furthermore, resources of Ir stored in nearby clusters to node i are located physically close to node i. A repository node periodically conducts resource scheduling between resource providers and requesters.

19

100 90

Logical communication cost for requests

Percentage of resource amount assigned (%)


80 70 60 50 40 30 20

Mercury SEMM

10 0

30000

SEMM Mercury

25000 20000 15000 10000 5000 0

0

5

10

15

20

Physical Distance by Hops

(a) CDF of allocated resource

1

2

3

4

5

Resources in each request

(b) Logical communication cost

Figure 1. Communication cost of different resource management approaches.

When node i queries for different resources, it sends out a request Lookup(Hr , Hi ) for each resource r. Each request will be forwarded to its repository node in node i’s cluster, which will reply to node i if it has the resource information for the requested resource.

4


We designed and implemented a simulator in Java for evaluation of SEMM. We compared the performance of SEMM with Mercury [13]. Mercury uses multiple DHT overlays and lets each DHT overlay responsible for one resource. We used Chord for attribute hub in Mercury. We assumed that there are 11 types of resources, and used Bounded Pareto distribution function to generate the resource amount owned and requested by a node. This distribution reflects the real world where there are available resources that vary by different orders of magnitude. In the experiment, we generated 1000 requests, and ranged the number of resources in a resource request from 1 to 5 with step size of 1. We used a transit-stub topology generated by GT-ITM [23] with approximately 5,000 nodes. “ts5k-large” has 5 transit domains, 3 transit nodes per transit domain, 5 stub domains attached to each transit node, and 60 nodes in each stub domain on average. “ts5k-large” is used to represent a situation in which a grid consists of nodes from several big stub domains.

4.1

Efficiency of Resource Management

In this experiment, we tested the Cumulative distribution function (CDF) of the percentage of allocated resources. It reflects the effectiveness of a resource management mechanism to map physically close resource requesters and providers. We randomly generated 5000

resource requests, and recorded the distance between the resource provider and resource requester of each resource request. Figure 1(a) shows the CDF of the percentage of allocated resources versus the distances in different resource management approaches in “ts5k-large”. We can see that SEMM is able to locate 97% of total resource requested within 11 hops, while Mercury locates only about 15% within 10 hops. Almost all allocated resources are located within 15 hops from requesters in SEMM, while 19 hops in Mercury. The results show that SEMM can allocate most resources within short distances from requesters but Mercury allocates most resource in long distances from the resource requesters. The more resources are located in shorter distances, the higher locality-aware performance of a resource management mechanism. Using physically close resources to itself, a node can achieve higher efficiency in distributed operations such as distributed computing and data sharing. In addition, communicating with physically close nodes for resources saves cost in node communication. The results indicate that the performance of SEMM mechanism is better than Mercury in terms of locality-aware resource management. Locality-aware resource management helps to achieve higher efficiency and scalability of a grid system. A resource node needs to communicate repository nodes for requested resources. Its request is forwarded by a number of hops based on DHT overlay routing algorithm. Thus, communication cost constitutes a main part in the resource management cost. In this test, we evaluated the communication cost of resource requesting. We define logical communication cost as the product of message size and logical path length in hops of the message travelled. It represents resource management efficiency in terms of the numbers of messages and nodes for message forwarding in resource queries. It is assumed that the size of a message is 1 unit. Figure 1(b) plots the log-

20


Average maintained outlinks per node

140 120 100 80 Mercury SEMM

60 40 20 0 100

1100

2100 3100 Number of nodes

4100

Figure 2. Overhead of different resource management approaches. ical communication cost versus the types of resources in a request for resource requesting. In the experiment, resource searching stops once requested resources are discovered. We can observe that SEMM incurs less cost than Mercury. The lookup path length is O(log n) in Chord, which is longer than lookup path length O(d) in Cycloid. A request with m resources needs m lookups, which amplifies the difference of communication cost between Mercury and SEMM. Hence, relying on the Cycloid DHT as the underlying structure for resource management, SEMM greatly reduces the node communication cost in resource management in a grid system.

4.2

Overhead of Resource Management

Since the resource management mechanisms depend on DHT overlays as middlewares for resource management in grids, the maintenance overhead of the DHT overlays constitute a main part in the overhead in resource management. In a DHT overlay, a node needs to maintain its neighbors in its routing table. The neighbors play an important role in guaranteeing successful message routing. We define the number of outlink of a node is the number of the node’s neighbors in its routing table, i.e., the average routing table size maintained by the node. In this experiment, we tested the number of outlinks per node. It represents the overhead to maintain the DHT resource management middleware architecture. Figure 2(a) plots the average outlinks maintained by a node in different resource management approaches. The results show that each node in Mercury maintains dramatically more outlinks than in others. Recall that Mercury has multiple DHTs with each DHT overlay responsible for one resource. Therefore, a node has a routing table for each DHT overlay, and has outlinks equal to the product of routing table size and the number of DHT overlays. The results show that SEMM leads to less maintenance overhead than Mercury, which implies that SEMM has high scalability with less DHT overlay

maintenance cost in a large-scale grid.

5

Conclusions

Rapid development of grids requires a scalable and efficient resource management approach for its high performance. This paper presents a Scalable and Efficient Multi-resource Management mechanism (SEMM), which is built on a DHT overlay. SEMM maps physically resource requesters and providers to achieve localityaware resource management, in which resource allocation and node communication are conducted among physically close nodes, leading to higher scalability and efficiency. Simulation results show the superiority of SEMM in comparison with another resource management approach in terms of locality-aware resource management, node communication cost, and maintenance cost of the underlying DHT structure. Acknowledgements This research was supported in part by the Acxiom Corporation.

References [1] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup protocol for Internet applications. IEEE/ACM Transactions on Networking, 1(1):17–32, 2003. [2] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proc. of ACM SIGCOMM, pages 329–350, 2001.


[3] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for largescale peer-to-peer systems. In Proc. of IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, 2001. [4] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. Kubiatowicz. Tapestry: An Infrastructure for Fault-tolerant wide-area location and routing. IEEE Journal on Selected Areas in Communications, 12(1):41–53, 2004. [5] H. Shen, C. Xu, and G. Chen. Cycloid: A scalable constant-degree P2P overlay network. Performance Evaluation, 63(3):195–216, 2006. An early version appeared in Proc. of International Parallel and Distributed Processing Symposium (IPDPS), 2004. [6] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-g: a computation management agent for multiinstitutional grids. In Proc. 10th IEEE Symposium on High Performance Distributed Computing, 2001. [7] I. Foster and C. Kesselman. Globus: a metacomputing infrastructure toolkit. Int. J. High Performance Computing Applications, 2:115–128, 1997. [8] M. Mutka and M. Livny. Scheduling remote processing capacity in a workstation-processing bank computing system. In Proc. of the 7th International Conference of Distributed Computing Systems, September 1987. [9] A. Chien, B. Calder, S. Elbert, and K. Bhatia. Entropia: architecture and performance of an enterprise desktop grid system. Journal of Parallel and Distributed Computing, 63(5), May 2003. [10] F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail, M. Faerman, S. Figueira, J. Hayes, G. Obertelli, J. Schopf, G. Shao, S. Smallen, N. Spring, A. Su, and et al. Adaptive computing on the grid using apples. IEEE Transactions on Parallel and Distributed Systems, 14(4), Apr. 2003. [11] M. O. Neary, S. P. Brydon, P. Kmiec, S. Rollins, and P. Capello. Javelin++: Scalability issues in global computing. Future Generation Computing Systems Journal, 15(5-6):659–674, 1999. [12] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and Panigrahy R. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. of the 29th Annual ACM Symposium on Theory of Computing (STOC), pages 654–663, 1997.

21

[13] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: Supporting scalable multi-attribute range queries. In Proc. of ACM SIGCOMM, pages 353– 366, 2004. [14] J. Chen M. Cai, M. Frank and P. Szekely. Maan: A multi-attribute addressable network for grid information services. Journal of Grid Computing, 2004. An early version appeared in Proc. of GRID’03. [15] A. Andrzejak and Z. Xu. Scalable, efficient range queries for grid information services. In Proc. the 2nd Int. Conf. on Peer-to-Peer Computing (P2P), pages 33–40, 2002. [16] M. Cai and K. Hwang. Distributed aggregation algorithms with load-balancing for scalable grid resource monitoring. In Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007. [17] S. Suri, C. Töth, and Y. Zhou. Uncoordinated load balancing and congestion games in P2P systems. In Proc. of the Third International Workshop on Peerto-Peer Systems, 2004. [18] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Topologically-aware overlay construction and server selection. In Proc. of IEEE Conference on Computer Communications (INFOCOM), 2002. [19] Z. Xu, M. Mahalingam, and M. Karlsson. Turning heterogeneity into an advantage in overlay routing. In Proc. of IEEE Conference on Computer Communications (INFOCOM), 2003. [20] H. Shen and C.-Z. Xu. Hash-based Proximity Clustering for Efficient Load Balancing in Heterogeneous DHT Networks. Journal of Parallel and Distributed Computing (JPDC), 2008. [21] H. Shen and C. Xu. Locality-aware and churnresilient load balancing algorithms in structured peer-to-peer networks. IEEE Transactions on Parallel and Distributed Systems, 2007. An early version appeared in Proc. of ICPP’05. [22] T. Asano, D. Ranjan, T. Roos, E. Welzl, and P. Widmaier. Space filling curves and their use in geometric data structure. Theoretical Computer Science, 181(1):3–15, 1997. [23] E. Zegura, K. Calvert, and S. Bhattacharjee. How to model an internetwork. In Proc. of IEEE Conference on Computer Communications (INFOCOM), 1996.

22


A QoS Guided Workflow Scheduling Algorithm for the Grid Fangpeng Dong and Selim G. Akl School of Computing, Queen’s University Kingston, ON Canada K7L3N6 {dong, akl}@cs.queensu.ca

Abstract - Resource performance in the Computational Grid is not only heterogeneous, but also changing dynamically. However, scheduling algorithms designed for traditional parallel and distributed systems, such as clusters, only consider the heterogeneity of the resources. In this paper, a workflow scheduling algorithm is proposed. This algorithm uses given QoS guidance to allocate tasks to distributed computer resources whose performance is subject to dynamic change and can be described by predefined probability distribution functions. This new algorithm works in an offline way that allows it to be easily set up and run with less cost. Simulations have been done to test its performance with different workflow and resource settings. The new algorithm can also be easily expanded to accommodate Service Level Agreement (SLA) based workflows. Keywords: Workflow scheduling algorithm, Grid, Quality of Services, stochastic

1

Introduction

The development of Grid infrastructures now enables workflow submission and execution on remote computational resources. To exploit the non-trivial power of Grid resources, effective task scheduling approaches are necessary. In this paper, we consider the scheduling problem of workflows that can be represented by directed acyclic graphs (DAG) in the Grid. The objective function of the scheduling is to minimizing the total completion time of all tasks (also known as makespan) in a workflow. As the performance of Grid resources fluctuates dynamically, it is difficult to apply scheduling algorithms that were designed for traditional systems and treat the performance as a known static parameter. Therefore, some countermeasures are introduced to capture relevant information about resource performance fluctuation (e.g., performance prediction 0), or try to provide some guaranteed performance to users (e.g., resource reservation, [2]). These approaches make it possible for Grid schedulers to get relatively accurate resource information. In this paper, we assume that resource performance can be described by some

probability mass functions (PMF) which can be derived from task execution records in the past (e.g., log files) [8]. Since the performance information is not deterministic, the proposed algorithm takes an input parameter as a resource selection criterion (QoS). This algorithm is a list heuristic and consists of two phases: the task ranking phase and the task-toresource mapping phase. In the ranking phase, tasks will be ordered according to their priorities. The rank of a task is determined by the task’s computational demand, the mathematical expectation of resource performance, and the communication cost for data transfer. In the mapping phase, the scheduler will pick up unscheduled tasks in the order their priorities and assign them to resources according to performance objective and the QoS guide. The rest of this paper is organized as follows: in Section 2, related work is introduced; Section 3 presents the application and resources models used by the proposed algorithm; Section 4 describes the algorithm in detail; Section 5 presents simulation results and analysis; finally, conclusions are given in Section 6.

2

Related Work

The DAG-based task graph scheduling problem in parallel and distributed computing systems is an interesting research area, and algorithms for this problem keep evolving with computational platforms, from the age of homogeneous systems, to heterogeneous systems and today’s computational Grids [6]. Due to its NP-complete nature [7], most of algorithms are heuristic based, such as the widely cited HEFT algorithm [3]. In [9], the authors proposed a list DAG scheduling algorithm, which is based on deterministic resource performance prediction. In [12], a robust resource allocation algorithm is proposed, which uses the same resource performance model as this paper does. However, it only considers the scheduling of independent tasks. In [11], a SLA based workflow scheduling algorithm is proposed. However, that algorithm does not use resource performance modeling explicitly and works in an online manner, which means it has to monitor the execution of tasks in a workflow continuously.


e(i, j ) (2) Bk ,l According to (1) and (2), Ci,k and Di,j,k,l are independent variables as Pk and Bk, l are independent, and the PMF of Ci,k and Di,j,k,l, PCi,k(x) and PDi,j,k,l(x), can be easily obtained from PPi (x) and PBi, j (x). Di , j ,k ,l =

t1 5 1 1

t4 5 3

4

t2 8

t3 10

t5 10

t6 8

1

2

t7 5

3

The targeted system of this paper consists a group of computational nodes r1, …, rn distributed in a Grid. Two nodes ri and rj can communicate with each other via a network connection wi,j. It is assumed that the available performance of both computational nodes and network connections is stochastic and follows some probability mass functions (PMF). Fig. 1 presents an example of a PMF. The PMF can be obtained by investigating historical application running records using statistical measures. In this paper, it is assumed such functions are already known by the scheduler. The PMF of the performance Pi of a computational node ri is denoted as PPi (x), and the PMF of the performance of a network connection wi,j between ri and rj is denoted as PBi,j(x). It is assumed that for all 1≤ i, j≤ n, random variables Pi and Bi,j are independent.

1

Models and Definitions

1

3

23

1

Probability

t8 5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Fig. 2: A DAG depicting a workflow, in which a node represents a task and a labelled directed edge represents a precedence order with a certain volume of intermediate result transportation.

4 10

15

20

25

Available Performance

Fig. 1: Performance probability mass function of a computational resource in the Grid. In this paper, a workflow is assumed to be represented by a DAG G. An example DAG is presented in Fig. 2. A circular node ti in G represents a task, where 1≤i≤v, and v is number of tasks in the workflow. The number qi (1≤i≤v) shown below ti represents the computational power consumed to finish ti. For example, in Fig. 2, q1 = 5. An edge e(i, j) from ti to tj means that tj needs an intermediate result from ti, so that tj∈succ(ti), where succ(ti) is the set of all immediate successor tasks of ti. Similarly, we have tj∈pred(ti), where pred(ti) is the set of immediate predecessors of ti. The weight of e(i, j) gives the size of intermediate results (or communication volume, for simplicity) transferred from ti to task tj. For example, the communication volume from t1 to t2 is one in Fig. 2. Following the above performance and workflow models, the completion time Ci,k of task ti on resource rk and the communication cost Di,j,k,l for data transferring between task ti on resource rk and tasks tj or resource rl are also two random variables that can be denoted as Eq. (1) and Eq. (2) respectively q (1) Ci , k = i Pk

The QoS Guided Scheduling Algorithm

The primary objective of the proposed algorithm is to map tasks to proper computational resources such that the makespan of the whole workflow can be as short as possible. As an instance of list heuristics, the proposed algorithm has two phases: the task ranking phase and task-to-resource mapping phase. In the task ranking phase, the priority of a task ti in a given DAG is computed iteratively from the exiting node of the DAG upwards to the entrance node as Eq. (3) shows. rank (t i ) = avg _ ci + max (e(i, j ) × avg _ d i , j + rank (t j )) (3) t j ∈succ ( ti )

avg _ ci = Avg(E(Ci ,k )) k

avg _ d i , j = Avg( E ( Di , j ,k ,l )) k ,l

In Eq. (3), avg_ci is the average value of the mathematical expectation of Ci, k (denoted as E(Ci, k)) on each computational resource rk. Similarly, avg_di, j is the average value of the mathematic expectation of Di,j,k,l for every network connection pair rk and rl to which ti and tj could be mapped respectively. According to Eq. (3), the priority value of a task is actually an estimate of time consumption, which is from the start time of ti to the completion time of the whole workflow, based on the average expected performance of computational resources and network connections. Once the priorities of tasks are known, the scheduler will put the tasks in a queue in a non-ascending order (ties are broken randomly). In the mapping phase, the scheduler will fetch

24


unscheduled tasks from the head of the priority queue formed in the ranking phase and map it to a selected resource. Since the priorities are computed upwards, it is guaranteed that a task will always have a higher priority than all of its successors. Therefore, it will be mapped before any of its successors. This ordering eliminates the case that a successor task occupies a resource while its predecessor is waiting for that resource so that deadlocks can be avoided. For tasks that are not related with each other, this approach lets those farther from the exiting task get resource allocation earlier, which will in turn give them a bigger chance of starting earlier and produce a shorter makespan. If the resource performance is deterministic, a popular and easy way to schedule a task in a heterogeneous environment is choosing the resource that can complete that task the earliest, as the HEFT and PFAS [9] algorithms do. However, in the resource model here, the performance is not deterministic. If only the best performance of a resource is considered, the schedule may suffer a long makespan in the real world due to the small probability of this performance being achieved. To overcome this difficulty, the mathematical expectation of the random performance can be used, just like what we have done in the ranking phase. However, in a non-deterministic system, only providing an estimated mean value might not be sufficient, because the real situation might be quite different. From the users’ point of view, more concerns may be put onto the possibility of achieving a certain performance other than a given static mean value. Therefore, in this paper, a flexible and adaptive way is used which allows the user to provide a QoS requirement to guide the resource selection phase. To simplify the presentation, a binary mapping function M (ti, rk) is defined in Eq. (4): ⎧ 1 if t i is mapped to rk (4) M (t i , rk ) = ⎨ ⎩0 otherwise In a workflow, the earliest start time EST(ti, rk) of a task ti on a resource rk depends on the data ready time DRT(ti, rk) and the resource ready time RRT(ti, rk): (5) EST (t i , rk ) = max( DRT (t i , rk ), RRT (t i , rk )) The data ready time DRT(ti, rk) is determined by the time when ti receives its last input data from its predecessors. It can be expressed by Eq. (6): (6) DRT (ti , rk ) = max ( RT (t j , rk )) t j ∈ pred ( ti )

RT (t j , rk ) = CT (t j ) + D j ,i ,l ,k | M (t j , rl ) = 1

(7)

RT(tj, rk) is the intermediate data ready time from predecessor tj, which is a predecessor of ti. CT(tj) is the completion time of tj and Dj,i,l,k is the intermediate result transfer time from rl to rk. Here rl is the resource that tj is mapped to. As all tasks mapped to the same resources will be executed sequentially, the resource ready time RRT(ti, rk) is determined by the completion time of the last task in the job queue of rk. Let t′q be the last task in rk’s job queue currently, then RRT(ti, rk) can be noted as

(8) RRT (t i , rk ) = CT (t 'q ) Finally, the estimated completion time ECT( ti, rk) of ti on rk is given by (9) ECT (ti , rk ) = EST (ti , rk ) + Ci ,k To meet a QoS requirement, we need to know the PMF of ECT(ti, rk), which depends on the PMF of CT(tj), RT(tj) and DRT(ti, rk). Since all predecessors of ti have been scheduled by the time ti is being scheduled, the PMF of CT(tj) is already known (see Eq. (17)), so is the PMF of RRT(ti, rk). According to probability theory, the PMF of the sum of two independent discrete random variables is the discrete convolution of their PMF. Therefore, according to Eq. (7), the PMF of RT(tj, rk) can be expressed as: x

PRT j ,k ( x) = ∑ PCT j (i ) PD j ,i ,l ,k ( x − i )

(10)

i =0

Again, by probability theory, the probability distribution function (also known as the cumulative distribution function (CDF)) of the maximum value of a set of independent variables is the product of the probability distribution functions of these variables. Let FESTi,k, FDRTi,k and FRRTi,k be the CDF of EST(ti, rk) , DRT(ti, rk), RRT(ti, rk) respectively. The following equation can be obtained according to Eq. (5): (11) FEST i ,k ( x) = FDRT i ,k ( x) FRRT i ,k ( x) Similarly, the FDRTi,k can be obtained from Eq. (12). FDRT i ,k ( x ) = FRT ( t ' , p ) ( x )...FRT ( t ' , p ) ( x ) | t1' ,..., t m' ∈ pred (ti ) 1

k

m

k

(12) For discrete random variable X, its CDF F(x) can be obtained from its PMF P(x) by Eq. (13). (13) F ( x) = Pr( X ≤ x ) = ∑ P ( xi ) xi ≤ x

On the other hand, if F(x) is known, the PMF P(x) can also be obtained as (14) P( xi ) = F ( xi ) − F ( xi −1 ) By Eq. (13) the CDF of RT(tj, rk) can be acquired using the results from Eq. (10), so can the CDF of RRT(ti, rk). The PMF of DRT(ti, rk) can be obtained from Eq. (10), Eq. (12) and Eq. (14). Following the same procedure the PMF of EST(ti, rk) can be obtained, which is denoted as PESTi,k(x). According to Eq. (9), the PMF of ECT(ti, rk) can then be expressed as: x

PECT i ,k ( x) = ∑ PEST i ,k (i ) PC i ,k ( x − i )

(15)

i =0

Let FECTi,k be the CDF of ECT(ti, rk). FECTi,k can be obtained from the PMF given in Eq. (15). Now, given a QoS requirement Q as a percentage number, the scheduler will first find a value T in the CDF of ECT whose cumulative probability is greater than Q (Fig. 3). Let ect(ti, rk)l be the lth possible value of ECT and pl be its probability, then the mathematical expectation of values to the left of T (including T itself), which is denoted as R(ti, rk) can be denoted as Eq. (16). By this means, it will cover at least the lower Q percent part of the ECT value distribution.


R(t i , rk ) =

∑ p ect (t , r )

l ect ( ti ,rk )l ≤T

i

k l

(16)

Fig. 3: An example of CDF (A) and PMF (B) of EST. Given the QoS requirement Q, the ceiling point is the left end of the first CDF interval above Q. Only ECT instances and their probabilities left to the ceiling point (shading bars in (B)) are considered when the scheduler selects a resource for the current task. Input: G, Q, PMF of ri and Wi, j, 1≤i≤n. Output: a schedule of G to r1, …, rn. 1. Compute rank of each task in G, using Eq. (3), and order the tasks into a queue J in non-ascending order of their ranks. 2. while (J is not empty) do 3. Pop the first task t from J; 4. for every resource r 5. Compute PMF of RRT(t, r) ; //Eq. (8) 6. for every t′∈pred(t) 7. Compute PMF of RT(t′, r);//Eq. (10) 8. end for 9. Compute PMF of DRT(t, r);//Use results of Line 7, Eq. (12) and (14). 10. Compute PMF of ECT(t, r); // Eq. (15) 11. Compute R(t, r), using Q and PMF of ECT(t, r); //Eq. (16) 12. end for 13. Find the resource r′ that R(t, r′) = min(R(ti, rk)) and insert t to the job queue of r'; 14. end of while

Fig. 4: Pseudo code of the QoS guided workflow scheduling algorithm.

The scheduler then chooses the lowest value of all R(ti, rk), say R(ti, rk′), and maps task ti to rk′. At this point, the PMF of CT(ti) can be known as: (17) PCT i ( x) = PECT i ,k ' ( x) When the exiting node tv of the whole graph is scheduled, the algorithm will stop. From the PMF PCTv(x), we can tell the probability of different makespans of the workflow. The pseudo code of the above procedures is given in Fig. 4.

25

5

Experiments

In the simulation, the basic resource topology is generated by a toolkit called GridG1.0 [4], which can generate heterogeneous computational resources and network connections. Based on the basic resource setup, two groups of experiments are run. One assumes the resource performance follows a uniform distribution and the other assumes it follows a discrete normal distribution. The workflow graphs used by the simulation are generated by a program called Task Graph For Free (TGFF) [5]. TGFF has the ability of generating a variety of task graphs according to different configuration parameters, e.g., the number of task nodes of each graph. Two groups of graphs are used in the simulation. One group is randomly generated, and the other is generated to simulate some real workflow applications in the Grid, such as BLAST [10], which has balanced parallel task chains in its DAGs. The performance of algorithms tested is measured by the scheduled length ratio (SLR), which is a normalized value of the makespan: Real Makespan SLR = Estimated Minimum Critical Path Length In each experiments, five groups of task graphs are used, which have 40, 80, 160, 320 and 640 tasks nodes, respectively. On each of the task groups, the HEFT algorithm and the QoS guided algorithms are tested. For the HEFT algorithm, the mathematical expectation of the resource performance is applied. For the QoS algorithm, two QoS values are applied: 80% and 50%, which are respectively denoted as QoS 1 and QoS 2 in Fig. 5~Fig. 8. The results of experiments on randomly generated graphs are presented in Fig. 5 and Fig. 6. In Fig. 5, the resource performance follows a uniform distribution. It can be observed that all algorithms perform worse as the number of tasks in a workflow increases. Due to the nature of these heuristic algorithms, the longer the critical path in a graph is, the more cumulative errors they will make when computing the priorities of tasks, and the higher probability that they will chose sub-optimal mappings. In Fig. 5, QoS 1 achieves the best performance among the three stategies. The HEFT algorithm yields QoS 1 with a small margin and is closely followed by QoS 2 which uses 50% as the selection criterion. In Fig. 6, the HEFT and QoS 1 almost get the same results while the performance of QoS 2 is significantly degraded. Filtering the 20% worst performance cases out in a uniform distribution, the expected performance can be improved noticeably, and the resources selected by these means will have a good chance (with a possiblity of 80%) to get a better performance than the mean value which is used by the HEFT. This explains why QoS 1 can perform better than HEFT does in Fig. 5. On the other hand, as the QoS 2 set the selection criterion as 50%, it may cut too much of the random domain and therefore suffer a higher possibility of inaccurate estimate in the reality. The

26


shortcoming of a too optimistic criterion (low QoS value) is even more obvious in Fig. 6, where the resource performance follows a normal distribution. In this kind of distribution, the mean value of a PMF happens to be the one that has the highest probability. Therefore, the HEFT algorithm can perform well in this situation. HEFT

80 70 60 50 SLR 40 30 20 10 0

QoS 1

in all cases, the SLR is shorter compared with the results in Fig. 5 and Fig. 6. This is due to the balanced structure of the task graphs, which makes the length of all paths from the starting task to the exiting task roughly identical so that the possibility of sub-optimum task ranking decreases.

QoS 2

HEFT

50 40 30 SLR 20 10 0

40 40

80 160 320 640 Number of task nodes

Fig. 5: Simulation result of uniform performance distribution and random generated graphs. HEFT

QoS 1

QoS 2

80 70 60 50 SLR 40 30 20 10 0 40

80

160

320

Number of task nodes

640

Fig. 6: Simulation result of discrete normal distribution of performance and random generated graphs. HEFT

50 40 30 SLR 20 10 0 40

QoS 1

80

160

QoS 2

320

640


Fig. 7: Simulation result of uniform performance distribution and multi-parallel-way graphs.

Fig. 7 and Fig. 8 show the results of task graphs having multiple balanced parallel task chains. The three scheduling approaches present similar behaviors as they do in the previous experiments. The performance of the HEFT algorithm is still the best in the normal distribution. QoS 1 performs close to HEFT and QoS 2 suffers from its too optimistic resource selection criterion. It is worth to note that,

QoS 1

80

160

QoS 2

320


640

Fig. 8: Simulation result of discrete normal distribution of performance and multi-parallel-way graphs.

6

Conclusions

In this paper, a QoS guided workflow scheduling algorithm is proposed and tested by simulation. The algorithm can be applied in the Grid computing scenarios, in which resource performance is not deterministic but changes, according to certain random probability mass functions. The contribution of this work is twofold. Firstly, the procedures to obtain the PMF of the makespan of a workflow are presented in detail. As the probabilities of different completion times are known, more sophisticated algorithms can be easily developed (although, this is not covered is this paper). For example, if the deadline to finish a workflow is given by the user, the scheduler will be able to tell the probabilities of meeting the deadline in different schedules and then react accordingly. This is very important, as SLA is becoming a popular way for resource allocation in the Grid. Secondly, the proposed algorithm uses a QoS guidance to find the task-toresource mapping, and the effects of different QoS settings in different resource performance distributions are tested. Our future work includes developing new algorithms that consider SLA scenarios and testing the QoS guided method with more probability distribution functions.

References: [1]

[2]

L. Yang, J. M. Schopf and I. Foster. “Conservative Scheduling: Using Predicted Variance to Improve Scheduling Decisions in Dynamic Environments”. Proc. of the 2003 Supercomputing, pp:31-46, November 2003. G. Mateescu. “Quality of Service on the Grid via Metascheduling with Resource Co-Scheduling and CoReservation”. International Journal of High Performance Computing Applications, Vol. 17, No. 3, pages: 209-218, 2003.


[3]

H. Topcuoglu, S. Hariri and M.Y. Wu. “PerformanceEffective and Low-Complexity Task Scheduling for Heterogeneous Computing”. IEEE Trans. on Parallel and Distributed Systems, Vol. 13, No. 3, pages: 260 274, 2002. [4] D. Lu and P. Dinda. “Synthesizing Realistic Computational Grids”. Proc. of ACM/IEEE Supercomputing 2003, Phoenix, AZ, USA, 2003. [5] R.P. Dick, D.L. Rhodes and W. Wolf. “TGFF Task Graphs for Free”. Proc. of the 6th. International Workshop on Hardware/Software Co-design, 1998. [6] F. Dong and S. G. Akl. “Grid Application Scheduling Algorithms: State of the Art and Open Problems”. Technical Report No. 2006-504, School of Computing, Queen's University, Canada, Jan 2006. [7] H. El-Rewini, T. Lewis, and H. Ali. Task Scheduling in Parallel and Distributed Systems, ISBN: 0130992356, PTR Prentice Hall, 1994. [8] H. Li, D. Groep, L. Wolters. “Mining Performance Data for Metascheduling Decision Support in the Grid”. Future Generation Computer Systems 23. pp 92-99, Elsevier, 2007. [9] F. Dong, S. Akl. “PFAS: A resource-performancefluctuation-aware workflow scheduling algorithm for grid computing”. Proc. of the Sixteenth International Heterogeneity in Computing Workshop, International Conference on Parallel and Distributed Systems (IPDPS), Long Beach California, US, March 2007. [10] D. Sulakhe, A. Rodriguez, et al. “Gnare: an Environment for Grid-Based High-throughput Genome Analysis”. Proc. of CCGrid’05, pp. 455- 462, Cardiff, UK, May 2005. [11] D.M. Quan and J. Altmann. “Mapping of SLA-based workflows with light communication onto Grid resources”. Proc. of the 4th International Conference on Grid Service Engineering and Management, Leipzig, Germany, Sept. 2007. [12] V. Shestak, J. Smith, et al. “A stochastic approach to measuring the robustness of resource allocations in distributed systems”. Proc. of International Conference on Parallel Processing, pp. 459–470, Columbus, Ohio, US, Aug. 2006.

27

28


A Grid Resource Broker with Dynamic Loading Prediction Scheduling Algorithm in Grid Computing Environment Yi-Lun Pan 1, Chang-Hsing Wu1, and Weicheng Huang 1 1 National Center for High-Performance Computing, Hsinchu, Taiwan Abstract - In a Grid Computing environment, there are various important issues, including information security, resource management, routing, fault tolerance, and so on. Among these issues, the job scheduling is a major problem since it is a fundamental and crucial step in achieving the high performance. Scheduling in a Grid environment can be regarded as an extension to the scheduling problem on local parallel systems. The NCHC research team developed a resource broker with proposed scheduling algorithm. The scheduling algorithm used the adaptive resource allocation and dynamic loading prediction functions to improve the performance of dynamic scheduling. The resource broker is based on the framework of GridWay which is a metascheduler for Grid, so the proposed resource broker can be implemented in real Grid Computing environment. Furthermore, this algorithm also has been considered several properties of the Grid Computing environment, including heterogeneous, dynamic adaptation and the dependent relationship of jobs. Keywords: Job Scheduling, Resource Broker, Metascheduler, Grid Computing.

1

Introduction

In the beginning of the 1990’s, there was a brilliant success in Internet technology because of the birth of a new computing environment - Grid, which is composed by huge heterogeneous platforms, geographic distributed resources, and dynamic networked resources [1]. The infrastructure of Grid Computing is the intention to offer a seamless and unified access to geographically distributed resources connected via internet. Thus, the facilities can be utilized more efficiently to help application scientists and engineers in performing their works, such as the so called “grand challenge problems”. These distributed resources involved in the Grid environment are loosely coupled to form a virtual machine and each of these resources is managed by their own local authority independently, instead of centrally controlled. The ultimate target is to provide a mechanism such that once the users specify their requirement of resource. The virtual computing sites will allocate the most appropriate physical computing sites to carry out the execution of the application.

Therefore, the research team developed a Grid Resource Broker (GRB) based on the framework of the GridWay [2] which is a meta-scheduler for Grid. The GridWay then drives the Globus Toolkit [3] to enable the actual Grid Computing. A scheduling algorithm and its implementation, the Scheduling Module, are developed. According to each different required job criteria, the proposed scheduling algorithm used the dynamic loading prediction and adaptive resource allocation functions to meet users’ requirements. The major task of the proposed Grid Resource Broker is to dynamically identify and characterize the available resources, correctly monitor the queue status of local scheduler. Finally, the presented scheduling algorithm helps to select and allocate the most appropriate resources for each given job. The aim of the presented scheduling algorithm is to minimize the total time to delivery for the individual users’ requirements and criteria. According to the above scenario, those issues encourage the motivation of our development. Specifically, the design and implementation of the proposed resource broker includes the dynamic loading prediction and adaptive resource selection functions. Moreover, the scheduling algorithm has been considered not only the general features of resource broker but also the "dynamic" feature, i.e. the proposed resource broker which monitors the status of each local queue and provides resources as users’ criteria dynamically. Therefore, the proposed resource broker can select resource efficiently and automatically. Further, the effective job scheduling algorithm also can improve the performance and integrate the resources to supply remote user efficiency in the heterogeneous and dynamic Grid Computing environment [4].

2 2.1

State of the Art Grid computing

The Grid is an infrastructure of computing providing users the ability to dynamically combine the distributed resources as an ensemble to support the execution of applications that need a large scale of resources. As Ian Foster redefined Grid Computing, “Grid Computing is concerned with coordinated resource sharing and problem


solving in dynamic, multi-institutional virtual organizations” [5]. The key concept of grid computing is its ability to deal with resource sharing and resource management [6]. This technology can bring powerful computing ability by integrating computing resources from all over the world. The Grid software is not a single, monolithic package, but rather a collection of interoperating software packages. The characteristics of the Grid computing environment are: 1) the environment is heterogeneous. The applications submitted from all kinds of users and web sites can share the resources from the entire Grid computing environment; 2) these resources are dynamic and complex in this environment. This phenomenon makes scheduling a major issue in the environment. This paper focuses on the scheduling of dependent jobs, which means the jobs may have a reciprocal effect, or correlation, in the order between each other. Because of the scheduling, it can help Grid Computing to increase and integrate the utilization of local and remote computing resources. Therefore, scheduling can improve the performance in the dynamic grid computing environment.

2.2

Existing Resource Broker and Scheduling Algorithm

A general architecture of scheduling for resource broker which is defined as the process of making scheduling decisions involving resources over multiple administrative domains [7]. There are three important features of a grid resource broker which are resource discovery, resource selection, and job execution from the previous research. As we know, a lot of researches of resource broker are on going to provide access resources for different applications, such as Condor – G, EDG Resource Broker, AppLes, and so on [8], [9]. The above resource brokers also can provide the capability of monitoring computing resources information and resource selection. Nevertheless, those researches do not deal with monitoring the dynamic queuing information, either to make precise and effective scheduling policy. On the scheduling algorithm scenario, the dynamic job scheduling is crucial and fundamental issue in Grid Computing environment. The purpose of job scheduling is to find the dynamic and optimal method of resources allocation. Mostly, there researchers are applying the traditional job scheduling strategies to allocate computing resources statically, such as list heuristic and the listing scheduling (LS) [10]. The above algorithms focus on the allocation of machines statically. However, the presented research focuses on dynamic job scheduling for each job from users’ requirements and criteria. Definition 1: The list heuristic scheduling algorithms have three variants - First-Come-First-Serve (FCFS) - the scheduler starts the jobs in the order of their arrival time; Random - the next job to be scheduled is randomly selected

29

among all jobs. No jobs are preferred; Backfilling [11] - the scheduling strategy is out-of-order of FCFS scheduling that tries to prevent the unnecessary idle time. Actually, there are two kinds of backfill. One is EASY-backfilling, and the other is conservative-backfilling. Furthermore, most of the researches are discussed under the assumption that jobs are executed independently and statically [12], [13], [14]. In fact, these assumptions are not appropriate in Grid Computing environment, since these jobs are always dependent and dynamic. In this research, the proposed algorithm is designed for scheduling the dependent jobs dynamically. As the following section, the Performance Evaluation, it will use the dependent jobs which are the Computational Fluid Dynamics (CFD) Message Passing Interface (MPI) programs to test.

3 3.1

Proposed Scheduling Algorithm of Grid Resource Broker (GRB) Research Object

The heterogeneous and dynamic properties are considered when designing the job scheduling algorithm. Therefore, the presented proposed algorithm can make job scheduling to achieve minimum makespan (defined at Definition 2), which is to minimize the total time to delivery for the individual users’ requirements and criteria. That is the main contribution of this work to present a Grid Resource Broker (GRB) with the designed job scheduling algorithm. The GRB can provide the faultless mechanism such that once the users specify users’ requirement of resources. Finally, these virtual computing sites will allocate the most appropriate physical computing sites to carry out the execution of the application. Definition 2: The completion time is defined as the time from the job being assigned to one machine until the time the job is finished. The complete time is also called makespan time.

3.2

Model and Architecture

To simplify Grid Computing, each distributed computing resource can be connected by high speed network. There is one important component of middleware resource broker which plays essential role in such an environment. The responsibilities of resource brokers are to search where the computing resources locate, store the information, and satisfy the users’ requirements of computing resources. Therefore, the dynamic loading prediction job scheduling algorithm should utilize available computational resources fairly and efficiently in Grid. This proposed algorithm is designed for resource brokers of whole Grid Computing. The purpose of proposed algorithm is to help improve the performance of job

30


scheduling. As a matter of fact, the proposed scheduling algorithm can preserve the good scheduling sequence of optimal or near-optimal solution, which are generated the best host candidate or better host candidates. And then, the presented Scheduling Module can get rid of the unfit scheduling sequence in the searching process for the scheduling problems. The implementation resource broker architecture is designed as shown in the following FIG. 1 in NCHC.

update the status of the resources in the InformationDB which in turn is queried by the RM to serve the users. The Information System also responses to the query from the GRB and provides the resources status. The dynamical information of local resources, which takes the XML format, is queried from Ganglia, Network Weather Service (NWS), MDS2, MDS4, and so on. The first database is JobsDB. The JobsDB will be updated by the Job Monitor module with pre-defined time interval. Such information is used by the Job Monitor component to response to the query from the GRB. Finally, the last database is InformationDB which stores all data provided by the Information System. In order to handle the related processes of job submitting, the GRB adopts the presented Scheduling Module to find the appropriate scheduling sequence, and then dispatches jobs to the local schedulers. Specially, the most important part is the core of resource broker which is the proposed new Scheduling Module which illustrated as the FIG. 2 is shown. The major characteristic of Scheduling Module is the presented scheduling policy, Dynamic Loading Prediction Scheduling (DLPS) algorithm, which provides dynamic loading prediction and adaptive resource allocation functions. The scheduling policy will be described later section.

FIG. 1 System Architecture The Grid users can submit jobs with using Grid Resource Broker through Grid Portal as the FIG.1 is shown. There are five related functional components and two databases in the System Architecture. The first, Grid Portal serves as an interface to the users. Therefore, the Grid jobs are submitted via the Grid Portal which in turn passes the jobs to the Resource Broker to drive the backend resources. The resource as well as job status are displayed back to the portal for users via either Resource Monitor module or Job Monitor module. Second, Resource Monitor (RM) queries the status of resources from the Information DB and displays the results onto the Grid Portal. Thus, the users’ knowledge about the status of resources is kept up to date. Third, Job Monitor is similar to the Resource Monitor. The Job Monitor component accesses the job information from the JobsDB which maintains the updated information about Grid jobs. The fourth, Grid Resource Broker (GRB) is the core of the system architecture. It takes the requirements from Grid jobs, and then compares with resource information which provided by the Information System. Therefore, the GRB can select the most appropriated resources automatically to meet the requirements from jobs. It will further dispatch jobs to local schedulers for processing. The detailed information about the mechanism of GRB is explained in the later part. The last component, Information System is used to collect the dynamic information of local computing resources periodically and

INPUT Resource Broker Core

OUTPUT Resource Broke Core

Grid File Transfer Services

Grid Execution Services

Grid Information Services

FIG. 2 Grid Resource Broker Architecture

3.3

Grid Resource Broker Architecture

We will explain the whole Architecture of GRB, and each function of each component, at this section. As the FIG.2 is shown, upon receiving a job request from users, the Request Manager will manage the job submission with


31

proper job attributes. On the other hand, the user maintains the control over the jobs submitted through the Request Manager. Following Request Manager, the Dispatch Manager invokes the Scheduling Module to acquire the resource list. The resource list is based on the criteria posted by the job as well as the resource status. With the suggestion from the Scheduling Module, the job is then dispatched. The Dispatch Manager also provides the ability to reschedule jobs and reporting pending status back to the Request Manager. In the Scheduling Module, there are three main functions, namely the Job Priority Policy, the Resource Selection, and the Dynamic Resource Selection. The Job Priority Policy is responsible for initialize the selection process with existing policy such as presented scheduling policy - DLPS, FCFS, Round-Robin, Backfill, Small Job First, Big Job First, and so on. The Resource Selection provides resource recommendation based static information such as hardware specification, specific software, and the High-Performance Linpack Benchmark results. The Dynamic Resource Selection issues suggestions based on dynamic information such as the users’ application requirement, network status as well as work load of individual machines. With the combined efforts of the three components, the Scheduling Module provides the features of the automatic scheduling and re-scheduling mechanism. The automatic scheduling chooses the most appropriate resource candidate followed by the second best choice and so on, while the re-scheduling provides the service to compensate the miss-selection of resource by the users. Once the Scheduling Module provides the best selection of resource, the process is passed to the Execution Manager, the Transfer Manager, and the Information Manager that drive Middleware Access Driver (MAD) layer to submit, control and monitor the resource fabric layer of the Grid Computing environment.

3.4

There are some inputs, such as Resource Specification Language (RSL), job template, and the form of Grid Portal from users’ requirements. Therefore, the operations of DLPS are to select, schedule, reschedule, and submit jobs to the most appropriate resources. The logical flow chart of the DLPS is illustrated as the FIG. 3. First, the DLPS retrieves the resource information from Information System, and then filters out unsuitable resources with the adaptive resource allocation function. After the adaptive resource allocation function, DLPS compares free nodes with requirement nodes. If current free nodes are enough for fulfilling, DLPS will give higher weight (defined at Definition 4). Otherwise, the following step enters the dynamic loading prediction function with EstBacklog and minimum Job Expansion Factor (defined at Definition 5 and Definition 6) methods to predict which computing resources responses and executes job quickly, and then calculates weight. The DLPS finally ranks all available resources and selects the most appropriate resources to dispatch job.

weight k =

Rn ´ M capability fn

Scheduling Policy - Dynamic Loading Prediction Scheduling Algorithm (DLPS)

The proposed job scheduling algorithm of Scheduling Policy is called Dynamic Loading Prediction Scheduling (DLPS) in the GRB. The objective function of DLPS is achieving the minimized makespan. Thus, we designed the following equation to describe it, as in (1) :

M * = Min[ max(d k ) - min(s k )]

(1)

*

Definition 3: The M means the minimized makespan. In order to predict precisely, the equation uses d k and S k . The

d k is the maximum job ending time of the kth job, which means the end time of job completed. And the S k is the minimum job submitting time of the kth job, which means the time stamp when users submit the kth job.

FIG. 3 The Logical Flow Chart of DLPS The each step of DLPS algorithm is descried as the following. The each step of algorithm thusly - Step 1: Process users’ criteria and job requirements from RSL, job template, or Portal form specification, including the HighPerformance Linpack Benchmark, data set, execution programs, and queue type, etc. Step 2: Make communicate with each GRIS of resource for getting the static and dynamic resource information. Step 3: (1) Store the features and status of each cluster into InformationDB through Information System; (2) Filter out unsuitable resource with the adaptive resource allocation function. Step 4: Compare these free nodes with required nodes. If current free nodes

32


are enough for fulfilling, DLPS calls weighted function to calculate weight a and weight b (defined at Definition 4).

Definition 7: After getting EstBacklog and job expansion factor, the Weight c metric is calculated by following equation (4) :

Definition 4: When free nodes fulfill required nodes, the designed weighted function is weight k = Where the R n means required nodes,

Rn ´ M capability . fn

f n means free nodes

and the M capability means the capability of each computing resources.

Weight c = l ´

JEFi n

å JEF

+ (1 - l ) ´

i

i =1

EBL*i n

(4)

å EBL

* i

i =1

Where the l is the system modulated parameter which can be obtained from numerous trials. The EstBacklog more can be respected the dynamic situation of queuing system generally. Therefore, it always uses the higher l value.

Step 5: If current free nodes are not enough for fulfilling, DLPS calls dynamic loading prediction function with two methods to calculate Weight c (defined at

Consequently, the Step 6: Calculated the minimum time of total deliver or response time.

Definition 7), including EstBacklog and minimum Job Expansion Factor methods.

4

Definition 5: The EstBacklog means estimated backlog of queued work in hours. The general EstBacklog form is shown as the following equation (2):

QueuePS ´ CPUAccuracy )´ TotalJobsCompleted Total Pr ocHours ´ 3600 ´ AvailablePorcHours ( ) Dedicated Pr ocHours EBL*i = (

(2)

EBL*i , it means the ith EstBacklog time. QueuePS is the idle time of queued jobs, CPUAccuracy is the actual run time of job, the TotalJobsCompleted is the number of jobs completed, the ToatlPorcHours is the total number of prochours required to complete running jobs. The AvailableProcHours is the total proc-hours available to the scheduler, and the last variable, DedicatedProcHours, is the total proc-hours made available to jobs. The some of above values are from the system historical statistic values of queuing system loading and the others are from real-time queuing situation. The output is divided into two categories, Running and Completed. The Running statistics include information about jobs that are currently running. The Completed statistics are compiled using historical information from both running and


4.1

Experimental Environment

In order to test the efficiency of the presented GRB with the developed scheduling algorithm, we execute the Computational Fluid Dynamics (CFD) Message Passing Interface (MPI) programs on a heterogeneous research testbed. We adopt the NCHC testbed, including Nacona, Opteron, and Formosa Extension two (Formosa Ext. 2) clusters. Those main environment characteristics are summarized in table 1. And we also measure the High-Performance Linpack Benchmark of these clusters. The Rmax means the largest performance in Gflop/s for each problem run on a machine, and the Rpeak means the theoretical peak performance Gflop/s for the machine. Hence, users can choose the higher Rmax value or set the criteria of computing power when users submit jobs, according to the information, in table 2. Table 1 Summary Environment Characteristics of NCHC Grid Resources Resource

Nacona

Opteron

*

completed jobs. Ergo, the EBLi can forecast the backlog of each computing site with above information. Definition 6: The job expansion factor subcomponent has an effect similar to the queue time factor but favors shorter jobs based on their requested wallclock run time. In its canonical form, the job expansion factor metric is calculated by the information from local queuing system which described as the equation (3) :

JEFi =

QueuedTime + RunTime WallClockLimit

(3)

Formosa Ext. 2

CPU Model Intel(R) Xeon(TM) CPU 3.20GHz AMD Opteron(tm) Processor 248 Intel(R) Xeon(TM) CPU 3.06GHz

Memory (GB)

CPU Speed (MHz)

#CPUs

Nodes

Job Manager

4

3200

16

8

Torque

4

2200

16

8

Moab

4

3060

22

11

Maui

Table 2 High-Performance Linpack Benchmark of NCHC Grid Resources


46.791424 34.08 102 70

Number of CPUs 16 The Efficiency of CPU 2.924 (Gflops/CPU)

16

22

2.13

3.429

Experimental Scenario

The preliminaries of experiment are needed to set up, including the start time of jobs, the convergence of MPI matrix, and the number of required computing CPUs. The above preliminaries are generated by normal distribution. Therefore, we generate several execution MPI programs with 2, 4, and 8 CPUs randomly. The evaluation also has been performed on three clusters with three experiment models, including 4096*4096 matrix which required 2 CPUs to compute, 4096*4096 matrix which required 4 CPUs to compute, and finally the last model is the 8192*8192 matrix which required 8 CPUs to compute.

Sec

The following carry on experiment is to compare the performance of DLPS job scheduling algorithm with several algorithms, such as Round-Robin, Short-Job-First (SJF), Big-Job-First (BJF), and First-Come-First-Serve (FCFS). We submitted testing jobs which were generated randomly with the synthetic models as the FIG. 4 is shown. The vertical axle is the value of min makespan (seconds) and the horizontal axle is the number of jobs. The makespan of presented GRB with DLPS job scheduling algorithm is notably less than other algorithms; especially the huge job numbers are submitted. Therefore, the objective function of DLPS approaches the minimized makespan. The dynamic loading prediction characteristic of presented GRB is proved be better under this experiment.

MakespanA lg orithm - MakespanDLPS Makespan A lg orithm

(5)

When the small numbers of jobs are submitted, the efficiency of DLPS may be worse than other algorithms, especially for SJF and FCFS. This situation is reasonable very well, because small jobs are easy consumed by SJF and FCFS. When the number of jobs is increasing, the developed DLPS is absolutely better than SJF and FCFS, because the notable drawback of SJF and FCFS is happened, which the large numbers of jobs are starvation as the FIG. 5 and FIG. 8. Comprehensively discussed the above efficiency figures, the best efficiency of DLPS is occurred extreme full usage of each cluster. 60 40 20 0 -20

1

6

11

16

21

26

31

36

41

-40 # of Jobs DLPS and SJF

FIG. 5 Compare the Efficiency of DLPS with SJF 60 50 40 30 20 10 0 -10 1

900 800 700 600 500 400 300 200 100 0

´ 100%

The equation (5) means each algorithm is compared with DLPS. If the efficiency is positive, it means DLPS more efficient. Otherwise, the DLPS is non-efficient.

6

11

16

21

26

31

36

41

# ofJob DLPS and Round-Robin

FIG. 6 Compare the Efficiency of DLPS with RoundRobin DLPS

100

Round-Robin 1

4

7 10 13 16 19 22 25 28 31 34 37 40 43 # of Jobs

SJF BJF FCFS

FIG. 4 Compare Makespan of DLPS with Other Algorithms Finally, the last experiment is discussing about the efficiency of all algorithms (defined at Definition 8) in FIG. 5, FIG. 6, FIG. 7, and FIG. 8.

Efficiency (%)

4.2

75.447 134.64

Definition 8: The efficiency of all algorithms is measured by the following equation, as in (5) :

Efficiency (%)

Nacona Cluster

Formosa Opteron Ext.2 Cluster Cluster

Efficiency (%)

HighPerformance Linpack Benchmark Rmax(Gflops) Rpeak(Gflops)

33

80 60 40 20 0 1

6

11

16

21

26

31

36

41

#of Jobs DLPS and BJF

FIG. 7 Compare the Efficiency of DLPS with BJF

34


[5] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, CA, 2004 .

Efficiency (%)

60 50 40 30 20 10 0 -10 1 -20

6

11

16

21

26

31

36

41

# of Jobs DLPS and FCFS

FIG. 8 Compare the Efficiency of DLPS with FCFS

5

Conclusions Conclusion and Future Work

The Grid Resource Broker takes a step further in the direction of establishing virtual computing sites for the Computing Grid service. The presented GRB can satisfy the users’ requirements, including the hardware specification, specific software, and the High-Performance Linpack Benchmark results. And then, the presented GRB can automatically select the most appropriate physical computing resource. With the features of automatic scheduling and dynamic loading prediction provided by the Scheduling Module, the Grid users are no longer required to select the execution resource for the computing job. Instead, the Grid Resource Broker will provide an automatic selection mechanism which integrates both static information and dynamic information of resources, to meet the demand from the Grid jobs. According to the pervious experiments, the dynamic loading prediction job scheduling has better efficiency and performance than other algorithms; especially the huge job numbers are submitted into the Grid Computing sites. Ultimately, we can obtain an important property that the algorithm is appropriated to deal with large amount of jobs in grid computing environment.

6

References

[1] R. AI-Khannak, and B. Bitzer, "Load Balancing for Distributed and Integrated Power Systems using Grid Computing," International Conference on Clean Electrical Power (ICCEP), 22-26 May, 2007, pp. 123-127. [2] http://www.gridway.org/ [3] http://www.globus.org/ [4] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour, "Evaluation of Job-Scheduling Strategies for Grid Computing," Proceedings of 1st IEEE/ACM International Workshop on Grid Computing (Grid 2000) at 7th International Conference on High Performance Computing (HiPC-2000), Bangalore, India, LNCS 1971, 2002, pp. 191-202.

[6] Yi-Lun Pan, Yuehching Lee, Fan Wu, "Job Scheduling of Savant for Grid Computing on RFID EPC Network," IEEE International Conference on Services Computing (SCC), July 2005, 75-84. [7] J. M. Alonso, V. Hernandez, and G. Molto, "Towards On-Demand Ubiquitous Metascheduling on Computational Grids," the 15th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP), February 2007, pp 5. [8] J. Schopf, "A General Architecture for Scheduling on the Grid, "Journal of Parallel and Distributed Computing, special issue, April 2002, p. 17. [9] A. Othman, P. Dew, K. Djemame and I. Gourlay, "Toward an Interactive Grid Adaptive Resource Broker," Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK, September 2003, pp. 4. [10] M. Grajcar, "Strengths and Weakness of Genetic List Scheduling for Heterogeneous Systems," Application of Concurrency to System Design, 2001. Proceedings. 2001 International Conference, 25-29 June 2001, pp. 123-132. [11] Barry G. Lawson, and E. Smirni, "Multiple-queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems," ACM SIGMETRICS Performance Evaluation Review, vol. 29, Issue 4, March 2002, pp. 4047. [12] N. Fujimoto, and K. Hagihara, "A Comparison among Grid Scheduling Algorithms for Independent CoarseGrained Tasks," Symposium on Applications and the Internet-Workshops, Tokyo, Japan, 26 – 30, January, 2004, p. 674. [13] N. Fujimoto, and K. Hagihara, "Near-optimal Dynamic Task Scheduling of Independent Coarse-grained Tasks onto a Computational Grid," In 32nd Annual International Conference on Parallel Processing (ICPP-03), 2003, pp. 107–131. [14] P. Lampsas, T. Loukopoulos, F. Dimopoulos, and M. Athanasiou, "Scheduling Independent Tasks in Heterogeneous Environments under Communication Constraints," the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 4-7 December 2006, Taipei, Taiwan.


35

Resource Consumption in Heterogeneous Environments Niko Zenker

Martin Kunz, Steffen Mencke

Business Informatics Department of Technical and Business Information Systems Otto-von-Guericke Universität Magdeburg P.O. Box 4120 D-39016 Magdeburg GERMANY +49(0)391 67 18012 [email protected] − magdeburg.de

Software Engineering Department of Distributed Systems Otto-von-Guericke Universität Magdeburg P.O. Box 4120 D-39016 Magdeburg GERMANY +49(0)391 67 12662 [makunz, mencke] @ivs.cs.uni − magdeburg.de

Abstract—Depending on the server, services need different resources. This paper tries to identify measurable resources and the dimensions theses resources need to have. Once measurable a guideline to measure and evaluate theses services is given. Keywords: Resource Heterogenous Environment

I. I NTRODUCTION An international discussion about global warming and CO2 reduction led to a trend called GreenIT. Each individual computer center can contribute to this trend, but to do so it is necessary to measure the resource consumption of the computer center. The sum of all the services executed in the computer center are responsible for the overall resource consumption. Therefore it is necessary to have a closer look at each individual service and its resource consumption. This helps to redesign the computer center for a resource saving environment. How much CO2 is used by a Google search? In the internet several figures are discussed. Let us assume that the overall consumption of the data centers, the human resources necessary for the operation of these data centers, the network infrastructure and of course the individual computer used for the Google search is summed up to 5 grams1 of CO2 . Once the user hits the “Search” button the Google servers start working and present a result within seconds. In respect to the individual bandwidth of each user it is probably not necessary to get the result as fast as possible. Performing the search on a “slower” node2 in the data center can reduce the overall usage of CO2 . The result is presented to the user and due to the lower bandwidth, the different search method is not even mentioned. Many other scenarios are imaginable where a different node can reduce the overall consumption of CO2 in the data center. But for all these scenarios it is necessary to measure available resources. This paper will break down the data center to a place where different services are executed. The authors 1 Due to insufficient proof of the numbers presented by http://blogs.sun.com/ rolfk/entry/your co2 footprint when using a smaller amount is assumed 2 Assuming that a slower node is using less power and therefore consuming less CO2

assume that each node in the data center is able to execute all services and therefore different results are comparable. The paper is structured as followes. At the beginning different resources are defined and an effort to describe these resources in standardized terms is shown. An approach to measure the resources is presented. Once measured all the values have to be summarized for the whole chain of services executed. Therefore a formula to create a resource-concerned model is motivated. Finally the outlook presents ideas and projects where the service measurement is needed and the conclusion ends the paper. II. R ESOURCES The term resource is used throughout every field of the scientific community. The W3C sees resources as entities3 , but they give no further description. Following the intentions of the Resource Description Framework presented by the W3C, resource are websites, or in general information in the internet. In the same manner the Uniform Resource Identifier (URI) is just a description of the place where a single website can be found. An economist describes resources as current assets, a sociologist identifies the social status of a person as a resource, and the psychologist enriches the term resource with social skills and talents of a patient. For this paper a resource is defined as a physical entity that is measurable. In addition to this only resources in a data center are taken into consideration. Human resources are not considered. Before the measurement can start, it is necessary to redefine the dimension of each resource. Current attempts to compare different systems and environments like [ZBP06] and [BBS07] deliver only limited information like CPU Time (measured in MIPS) or disk performance (measured in MB/s). Especially [BBS07] suggest an overall resource classification for IT making it possible to compare between different environments. The authors assume that a multi-dimensional index filled with different measurable values is able to compare these environments. Therefore each dimension is described with 3 http://www.w3.org/TR/rdf-mt/

36


multiple values. The described dimension points can be used as basic structure for measurement but also as a base for comparison of a single service between different servers. An unexhausted list of values for that index is shown below: RAM String Name = "RAM" String Type = "RDRAM" | "SDRAM" Int Size = 1024 String Unit = "B" | "MB" | "GB" String Name = "HD" Int Size = 30 String Unit = "MB" | "GB" Int Partition = 1 String PartitionType = "NTFS" | "EXT3" String Name = "CPU" String Type = "Pentium4" | "AMD64-X2" Float MHZ = 4.2 Int Flops = 54638255 Int NumberOfCores = 2 Int CurrentCore = 0 After identifying the base-measures (the single values derived from each dimension) the goal should be to create a combined metric, or a set of metrics. This is especially important when this resource metric is used to compare and evaluate services between different server environments. Such an evaluation also demands thresholds and quality models for an analysis of the measured values. A single metric is much easier comparable than several different values derived from the single dimension. Nevertheless it is very important to save all dimensional values to have a closer look at all services if necessary. For the combined metric it is important to assess the basemeasures as to their priority for the metric. Such an assessment should be transparent, to keep the information entropy and to avoid errors due to wrong priority weighting. Furthermore research has to take a closer look at a run-time based metric, with elements from the list above. It may be possible that there is also a correlation between source code, design, or structural description and values derived from a metric. Performance properties are measured and compared in [RKSD07]. This approach can deliver good results, especially for the first way of measurement presented later in this paper. Besides the definition of metrics and indicators another important issue is the evaluation and analysis of measured values. Therefore an adaption of web service measurement presented in [Sch04] is proposed. The values are measured in a distinct time frame. Therefore a diagram of all values can be created and the quality of the service is directly viewable and actions, like restart or rearrange, of a service can take place. Figure 1 gives an example of service measurement over this distinct time frame. In a service oriented architecture not every service is utilized in the same way as others. The definition of the time frame

Fig. 1.

Diagram of Ping and Invocation Times

is difficult, because it needs a detailed look at each service to determine its utilization span. Some services are used every other minute (e.g. sending a document to a printer), but some are just utilized once a year (e.g. the creation of the annual financial statements). Both services are vital to the profit of the company, but the second is only necessary once a year. The time frame for the measurement has to be adopted to get a informative result, useful for evaluation and comparison between different servers. Using techniques like BPEL it is easy to define certain services and transactions in a process. The resources are directly connected to each step in the process. After the knowledge about the resources is extracted, a combination of all resources used in the process is possible. This means that at design time the estimated amount of resources is predictable. Using the resource information at run-time several scenarios of rearranging services are imaginable. In this manner the resource information can support virtualisation in a way that the virtual environment can be configured precisely according to the resource consumption of the service that will run in that environment. III. M EASUREMENT OF R ESOURCES Identifying a server in a computer center is an easy task. The same applies for measuring the CPU load at this server. The result is a number showing an administrator how well or poorly the server is utilized. It is also easy to look at the capacity of the hard disks and to make a prediction how long the remaining space will last, before maintenance has to take place. Breaking it down to an individual service requires amore complex measurement scenario. This fact derives not only from the more detailed view on the services, it is also motivated by the commingling of the services itself. This means that a service built up by sub-services, with special resource demands, needed for its execution. In order to evaluate the whole service, all the values have to be considered. The measurement itself can take place at the underlying operating system. But a measurement-service needs to know


what services should be measured. Values will be collected according to the demands of the metric. Each individual value will then be integrated into the measurement result, provided to the measurement service. Working with interfaces makes it possible to implement different measurement methods; e.g. for different operating systems. The measurement service is currently under development, therefore no explicit operating scheme can be presented. The class diagram, presented in figure 2, is based on [Mar2003] and shows a class diagram for an ontology for software metrics and indicators. Martin identifies in his class diagram objects that are measurable, but it also defines how the objects will be measured, who is measuring them and of course the place where the results are stored. The adaption from software metrics toward hardware metrics is not yet done. For now the authors assume that a transformation process of all hardware metrics is possible. But before the creation of a metrics it is necessary to define three stages of service measurement. Each stages has a different priority. The first is a measurement of the service with randomly selected values done by the developer of the service. The second is a measurement in special testing mode at runtime, and the third is an outside view of the service, at runtime. Based on the quality of Web-Services these measurement methods are already discussed and motivated. This approach can be adopted for the measurement of resources. [SB03] describes all three methods. [Rud05] describes in greater detail the testing mode and the outside view. In [RKSD07] the simulation approach is described. All methods produce values suitable for an estimate of resource consumption, but in order to generate reliable values all three methods have to be combined in a measurement procedure. This procedure uses the values derived from the initial simulation, done by the developer. In this way a starting value for the service is created. Once installed in the desired computer environment the testmode will produce new values suitable for a distinguished proposition. The third method will then refine the simulated and tested values. This ensures an acceptable measurement for proper values. The simulation approach is necessary have a starting value for a new environment. A developer can influence the results of the measurement. Using special values for the simulation phase produce a first estimate of resource consumption, with little liability for an environment at runtime. Nevertheless these values are critical for a method to compare actual resource consumption and the maximum consumption for a singular service. The test-mode in the distinguished environment redefines the initial values with detailed immersion of each individual node existing in the computer center. The test-mode will enumerate all computers in the environment suitable for the service. The service is computed on each computer under different circumstances, like heavy load of the CPU, or a congested network interface. In the end a detailed map of each node and its resource consumption for the service is created.

37 TABLE I P RIORITIES FOR

THE THREE STAGES OF SERVICE MEASUREMENT

Stage I II III

Priority 0.33 0.5 1

This information is also necessary for an estimation method of resource consumption and current load. The priority, and for that matter the importance, is higher than the one recovered in the simulation phase. Using test-values selected by the developer for the first stage and for the special test-mode in the actual data center can be dangerous to the output of the measurement process. Well chosen test-values can create a measurement result suitable for the service4 . These values are of course not desired and the overall outcome of the measurement process is badly influenced. Therefore another stage of measurement needs to be considered. Therefore the authors demand a stage to measure values when the service is in a productive mode at runtime. The measurement itself is a collection of needed information from the service it self or an outside view on the service. The outside view is more complex, because influences of other services have to be considered. These influences are welcome, in order to create a complex view on resource consumption correlating with other services. Due to constant changes in the environment each value creates an immersion for the resource measurement useful in an adaptive computer center. In figure 3 all three stages are shown. The values derived from all three stages have to be part of the metric. As described above the priority of each stage is not the same. Finding the right coefficient for each stage is a hard task. Depending on the service these coefficients can vary. Practical research has to refine these coefficient. For now the authors assume that the highest priority has to be appointed to values derived from stage III. Values from stage II are better than values from stage I. Therefore the priority (prior) as shown in table I are assumed. For the overall value used for the metric the sum of all values has to be created. Each value is multiplied by the priority. X V alueService = prior ∗ V alueStage This value is now a combination of three values measured with different environments. Each environment is different from the other, therefore heterogeneous systems influence the metric for a service. All three stages should be used for the measurement, but it is still necessary to integrate the time and other circumstances into the values. A service executed on a server with heavy load reacts different and the performance metric is not as good as 4 Service is designed to create a good output in regard of performance. This can be done by special switches inside the source code.

38


Fig. 2.

A UML diagram to the ontology for software (and web) metrics and indicators [Mar2003]

on a server with no load. Especially with the third stage of measurement the circumstances are accessible. IV. AGGREGATION OF S ERVICES In a heterogeneous environment many service are combined to fulfill the desired goal of the computer system. In order to measure and estimate all these services the aggregation of all services have to be considered. Especially in a service oriented environment many heterogeneous services are executed. Each services contains sub-services necessary for its execution. Depending on the orchestration of the (sub-)services the overall consumption of resources has to be calculated in different ways. Four types of orchestration methods are described by [CDPEV05]. The parameters are used for quality of service parameters, suitable for different projects. Shown in table II and table III these parameters are listed for the different ways of orchestration.

TABLE II AGGREGATION FUNCTIONS FOR Q O S PARAMETERS FOR SEQUENCE AND SWITCHES [CDPEV05] QoS Attr.

Sequence

Switch

Time (T)

m P

n P

T (ti )

i=1 m

Cost (C)

P

C(ti )

i=1 m

Availability (A)

Q Q i=1

P

pai ∗ C(ti )

i=1 n

A(ti )

i=1 m

Reliability (R)

pai ∗ T (ti )

i=1 n

P

pai ∗ A(ti )

i=1 n

R(ti )

P

pai ∗ R(ti )

i=1

It is of course possible that each individual service is executed on a different server. The orchestration is then influenced by the delay the network produces. Therefore not only


Fig. 3.

Three stages of service measurement

TABLE III AGGREGATION FUNCTIONS FOR Q O S PARAMETERS FOR FLOW SERVICES AGGREGATION [CDPEV05]

AND LOOPS

QoS Attr. Time (T)

Flow M ax{T (ti )i∈{1..p} }

Loop k ∗ T (t)

Cost (C)

p P

C(ti )

k ∗ C(t)

A(ti )

A(t)k

R(ti )

R(t)k

i=1 p

Availability (A)

Q i=1 p

Reliability (R)

Q i=1

one value has to taken into account as base-measurement. In the chapter “Resources” a motivation for a multi-dimensional index was given. The inclusion of the network delay is just for this theory. The re-arrangment of the services is also possible when executed in a virtual environment. The performance of the service is then of course influenced by the virtual environment and it resource consumption. In [MEN05] concepts, problems, and metrics for virtualization are discussed. V. O UTLOOK The mightiness of the metric, motivated by figure 2, is still unclear. Further research will extend the metric with more results and of course the correlating influences. A current project is developing and implementing the infrastructure to measure services within a SOA. The result of the measurement will be published to prove the fact that there is a difference in resource consumption, according to the server used for its execution. For an automation of service resource management the described approach contains a semantic description of the defined metrics. In history ontologys possessed the capability to describe information in a machine accessible manner. Existing solutions in this area, for example an ontology for object

39

oriented metrics presented in [KKSD06] are a framework for ontology about resource metrics for services. Once having the automated resource management a framework can be created working in a heterogeneous environment. This environment is equipped with different server, most likely working with different operating systems. All services can be executed within the environment. The framework can then rearrange these services to fulfill the desired outcome. One of these outcomes is the CO2 reduction, but it is also imaginable to re-arrange the service according to other demands like cost saving, a better performance, or a secure execution according to the demands of customer. This leads toward an automatic data center as motivated in [BEME05]. Furthermore the desired resourcemeasurement should be implemented in service development process and maintenance process standards to ensure low resource consumption throughout the service life cycle. An adaption of the measurement service to international standards like CMMI or Six Sigma is desired. The desired framework is flexible for all current operating systems and no expensive hardware has to be acquired. The usage of the “old” equipment saves costs and due to a better distribution of services the overall performance of the system will rise. R EFERENCES [BBS07] R. Brandl, M. Bichler, and M. Ströbel. Cost Accounting for Shared IT Infrastructures - Estimating Resource Utilization in Distributed IT Architectures. Wirtschaftsinformatik, 49(2):83–94, 2007. [BEME05] M. Bennani, D. Menasce. Resource Allocation for Autonomic Data Centers using Analytic Performance Models. Proceedings of the 2005 IEEE International Conference on Autonomic Computing, 2005. [CDPEV05] G. Canfora, M. Di Penta, R. Esposito, and M.L. Villani. An approach for QoS-aware service composition based on genetic algorithms. Proceedings of the 2005 conference on Genetic and evolutionary computation, pages 1069–1075, 2005. [Mar03] Maria de los Angeles Martin, Luis Olsina. Towards an Ontology for Software Metrics and Indicators as the Foundation for a Cataloging Web System. Proceedings of the First Latin American Web Congress (LA-Web03), pages 103-113, 2003. [MEN05] D.A. Menasce. Virtualization: Concepts, applications, and performance modeling. Proceedings of the 31th Int. Computer Measurement Group Conf, pages 407-414, 2005. [KKDS06] M. Kunz, S. Kernchen, R. Dumke, A. Schmietendorf. Ontologybased web-service for object-oriented metrics. Proceedings of the International Workshop on Software Measurement and DASMA Software Metrik Konkress , pages 99-106, 2006. [RKSD07] D. Rud, M. Kunz, A. Schmietendorf, R. Dumke. Performance Analysis in WS-BPEL-Based Infastructures. Proceedings of the 23rd Annual UK Performance Engineering Workshop (UKPEW 2007), pages 130-141, 2003. [Rud05] D. Rud. Qualitaet von Web Services. VDM Verlag, 2005. [SB03] S. Battle. Boxes: black, white, grey and glass box views of webservices. Technical Report HPL-2003-30. HP Laboratories Bristol, 2003. [Sch04] A. Schmietendorf, R. Dumke, D. Rud. Ein Measurement Service zur Bewertung der Qualitätseigenschaften von im Internet angeboteten Web Services. MMB-Mitteilungen Nr. 45, pages 6-16, 2004. [ZBP06] Rüdiger Zarnekow, Walter Brenner, and Uwe Pilgram. Integrated Information Management - Applying Successful Industrial Concepts in IT. Springer Berlin Heidelberg, 2006.

40


Experience in testing the Grid based Workload Management System of a LHC experiment V. Miccioa

b c

, A. Fanfani c d , D. Spigae b , M. Cinquillie , G. Codispotid , F. Fanzagob c , F. Farinaf b , S. Lacaprarag , E. Vaanderingh , A. Sciabàb S. Belfortei

a

Corresponding/Contact Author CERN, BAT.28-1-019, 1211 Geneve 23, [email protected] Office: +41 (0)22 76 77215 − Fax: +41 (0)22 76 79330 b c

INFN/CNAF, Bologna, Italy

d e f

CERN, Geneva, Switzerland

University Bologna, Italy

University and INFN, Perugia

INFN, Milano-Bicocca, Italy

g

INFN Legnaro,Italy

h

FNAL, Batavia, Illinois, USA

i

INFN, Trieste, Italy

The computational problem of a large-scale distributed collaborative scientific simulation and analysis of experiments is one of the many challenges represented by the construction of the Large Hadron Collider (LHC) at the European Laboratory for Particle Physics (CERN). The main motivation for LHC to use the Grid is that CERN alone can supply only part of the needed computational resources, while they have to be provided by tens of institutions and sites. The key issue of coordinating and integrating such spread resources leads the building of the largest computing Grid on the planet. Within such a complex infrastructure, testing activities represent one of the major critical factor for deploying the whole system. Here we will focus on the computing model of the Compact Muon Solenoid (CMS) experiment, one of the four main experiments that will run on the LHC, and will give an account of our testing experience for what concerns the analysis job workflow. Keywords: Grid Computing, Distributed Computing, Grid Application Deployment, High Energy Physics Computing Conference: GCA’08 − The 2008 International Conference on Grid Computing and Applications

1. Introduction CMS represents one of the four particle physics experiments that will collect data at LHC starting in 2008 at CERN, and one of the two largest collaborations. The outstanding amount of produced data − something like 2 PB per year − should be available for analysis to world-wide distributed physicists.

The CMS computing system itself relies on geographically distributed resources, interconnected via high throughput networks and controlled by means of Grid services and toolkits, whose building blocks are provided by the Worldwide LHC Computing Grid (WLCG, [1]). CMS builds application layers able to interface with several different Grid flavors (LCG-2, Grid-3, EGEE, Nor-


41

Figure 1: CMS computing model and its tiers hierarchy duGrid, OSG). A WLCG-enabled hierarchy of computing tiers is depicted in the CMS computing model [2], and their role, required functionality and responsibilities are specified (see Figure 1). CERN constitutes the so-called Tier-0 center: here data from the detector will be collected, the first processing and storage of the data will take place and raw/reconstructed data will be transfered to Tier-1 centers. Beside the Tier-0 center, CERN will also host the CMS Analysis Facility (CAF), which will have access to the full raw data and will be focused on latency-critical detector, trigger and calibration activities. It will also provide some CMS central services like the storage of conditions data and calibrations. There are then two level of tiers for quite different purposes: organized mass data processing and custodial storage is performed at about 7 Tier-1 located at large regional computing centres, while a more large number of Tier-2 sites are dedicated to computing. For what concerns custodial storage, Tier-1 centers receive simulated data produced within the Tier-2 centers, and will receive reconstructed data together with the corresponding raw data from the Tier-0. Regarding organized mass data processing activities, Tier-1 centers will be in charge of calibration, re-processing, data skim-

ming and other organized intensive analysis tasks. The Tier-2 centres are essentially devoted to the production of simulated data and to the user distributed analysis of data imported from Tier1 centers. In this sense, the Tier-2 activities are much more “chaotic” with respect to the higher tiers: analysis are not centrally planned and resource utilization decisions are closer to end users which can leverage wider set of them. So the claim for high flexibility and robustness of the the workflow management leads to an extremely compound infrastructure which inevitably entails a large effort in phase of testing and deploying. 2. Workload Analysis

Management

System

for

The Workload and Data Management Systems have been designed in order to make use of the existing Grid Services as much as possible, building on top of them CMS-specific services. In particular, the CMS Workload Management System (WMS) relies on the Grid WMS provided by the WLCG project for job submission and scheduling onto resources according to the CMS Virtual Organization (VO) policy and priorities. Using the Grid Information System, it knows the available resources and their usage. It performs matchmaking to determine the best site to run the job and submits it to the Computing Element

42


(CE) of the selected site which in turn schedules it in the local batch system. The Worker Node (WN) machines where jobs run have POSIX-IOlike access to the data stored in the local Storage Element (SE). On top of the Grid WMS, CMS has built the CMS Remote Analysis Builder (CRAB, [3–5]), an advanced client-server architecture for specific CMS-software (CMSSW) analysis jobs workflow management. It is based on independent components communicating through an asynchronous and persistent message service, which can provide for the strong requirement of extreme flexibility. Such a server is placed between the user and the Grid to perform a set of actions in user behalf through a delegation service which handles users proxy certificates. The main goal of such an intermediate server is to automate as much as possible the whole distributed analysis workflow and to improve the scalability of the system in order to fullfill the target requirement rate of more than 100 thousands jobs handled per days when LHC will be full operational. Anyway, the client-server implementation is transparent to the end users. From this point of view CRAB simply looks like a dedicated front end for specific CMSSW analysis jobs. It enables the user to process datasets and Monte Carlo (MC) samples taking care of CMSSW specific features and requirements, provides the user with a simple and easy to use interface, hides to the user the direct interaction with the Grid and reduces the user load by automating most of the action and looking after error handling and resubmissions. CRAB’s own functionalities and its integrated interaction with the underlying Grid environment needs a dedicated test and deployment activity.

tools and related know-how, and on their capability to manage switches between testbed-like and production-like infrastructures. Such intermediate activities between developing and production are crucial due to the large number of different services running on different layers using different technologies within multiple operational scenarios that will operate when LHC will work at full. Past experience [6,7] has shown that such training activity practice is one of the biggest challenges of such a system. Main issues concern both functionality tests as well as the scalability of the whole infrastructure to maintain the needed performance and robustness both up to the expected full job flows rates and under realistic usage conditions. Work flow tests was performed on both on Grid level and on more CMS specific CRAB level. 3.1. Grid Workflow WMS testing was aimed at probing the load limits of the service both from the hardware and from the software point of view. A quasi automated test-suite was set up to steadily submit jobs at adjustable rate. Some very controlled instances of WMS have been used and was continuously tested, patched and re-deployed, with a tight feedback between testers and developers. The tests involved the submission of large numbers of jobs to the WLCG production infrastruc-

3. Test Experiences The CMS experiment is getting ready to the real LHC data handling by building and testing its Computing Model through daily experience on production-quality operations as well as in challenges of increasing complexity. The capability to simultaneously address both these complex tasks relies on the quality of the developed

Figure 2: The Grid WMS is capable to sustain a rate of about 20kJ/d for several days


43

submission test to verify the acceptance criteria. Something like 60k jobs was submitted; only 119 jobs aborted (< 0.2%) but not due to a CE error; no performance degradation observed and the CE service never restarted.

Figure 3: 5-days no-stop 10kJ/d submission on Grid CE, with always 5k active jobs

ture, both using simple “hello world” scripts and real experiment applications. Problems encountered were reported to the developers, who provided bug fixes, in an iterative process. Acceptance criteria were defined to assess the compliance of the WMS with the requirements from the CMS and ATLAS experiments and the WLCG operations: uninterrupted submission of at least 104 jobs per day for period of at least five days; no service restart required during this period and no degradation in performance at the end of this period, with a number of stale jobs less than 1% of the total at the end of the test. A successful test on gLite 3.1 middleware fully met the acceptance criteria. 115’000 jobs was submitted along 7 days (16’000 job/day) with 320 (0.3%) jobs aborted due to the WMS problems, negligible delay between job submission and arrival on the CE. Further tests proved that the gLite WMS is able to sustain for a week an higher rate of 20’000 job/day without degradation in performances and no stale jobs (see Figure 2). Testing the performance and reliability of CE was done submitting at well specified rate (10k jobs per day) in order to always keep at least 5k jobs active in the CE, according to the criteria defined for the CE acceptance tests. Figure 3 shows the results of a first 5-days no-stop

3.2. CRAB Workflow Testing the CRAB infrastructure is a more structured task, since the goal is now to probe not only the sustainability of a high job submission rate, but also the reliability of the whole system of services and functionalities. For this purpose the so called JobRobot was developed: a whole agent and dropbox based automated expert system architecture which enables to simulate a massive and complete user analysis activity, from creation to submission, monitoring and up to output retrieval. The goal is to understand and solve potentially race condition criticalities that only a realistic, “chaotic” usage can lead to show up. First very exploratory rounds of test was already performed, using an instance of a preliminary version of the CRAB server attached to a dedicated Grid WMS machine. Passing through the CRAB server, the JobRobot was keeping spreading CMSSW collections of jobs over about 30 different sites, using realistic requirements, with a growing submission rate and for a growing number of days. An initial submission rate of 10k jobs per day was quite easily sustained for 4 days, showing no overload both on the Grid

Figure 4: Starting a stress test for the CRAB server workflow at a rate of 18kJ/d

44


Already planned next steps involve the set up of the next forthcoming major release of the CRAB server: it includes important development upgrades (a client-server reengineering and a refactory of the Grid interaction framework) and needs an early test of the overall functionalities. A new scale test with the server pointing at more than one single Grid WMS is then scheduled. 4. Conclusion Figure 5: The bottleneck of having only one Grid WMS

WMS side as well as on the CRAB server side. In a more stress-oriented test a rate of 18k jobs per day was maintained for the first 24 hours and then raised to 30k jobs per day for the succeeding 24 hours. As Figure 4 shows, within the lower rate jobs complete their workflow (green line) at the same rate they are submitted (black line). The high submission rate was overstretched for even more time (see Figure 5), but the single Grid WMS instance was not able to efficiently handle the overall job flow generated in this way: job was dispatched more and more slowly to sites, piling up in WMS queues and bringing a to a rapid degradation of its performace, and as a consequence of the performance of the whole system (e.g. the lowering of the black curve in Figure 5). The expected message was that to further increase the scale requires additional dedicated WMSs. On the contrary the load on the CRAB server machine was very reasonable, proving the a single server can still fairly well handle such a high submission rate. Moreover such first tests was already capable of giving valuable feedback concerning improvements in some CRAB server components and precious feedback was provided to developers about that. As a matter of fact, it also shows that the server could represent a further testing instrument for the underlying Grid WMS services, allowing a fine tuning of its configuration parameters in a much more CMS specific use cases tailored way.

The Grid infrastructure is already really working at production level and actually the every day CMS activity can not do without it. So far testing and integration activities represented a central part of the work needed to bring the workload management infrastructure to to a quality level which let users take full advantage of it in a production context. Anyway there are still challenges, to be ready for when the LHC will be fully operative. This testing process is still ongoing and the improvements achieved during these months have already had a big impact on the amount of effort required to run the work flow services in production activities. REFERENCES 1. LHC Computing Grid project: http://lcg.web.cern.ch/LCG/ 2. CMS Computing Technical Design Report, CERN-LHCC-2005-023, June 2005 3. D. Spiga, S. Lacaprara, W. Bacchi, M. Cinquilli, G.Codispoti, M. corvo, A. Dorigo, F. Fanzago, F. Farina, O. Gutsche, C. Kavka, M. Merlo, L. Servoli, A. Fanfani. (2007). CRAB: the CMS distributed analysis tool development and design, NUCLEAR PHYSICS B-PROCEEDINGS SUPPLEMENTS. Hadron Collider Physics Symposium 2007. La Biodola, Isola d’Elba, Italy. 20-26 Maggio 2007. vol. 177-178C, pp. 267 268. 4. A. Fanfani, D. Spiga, S. Lacaprara, W. Bacchi, M.Cinquilli, G. Codispoti, M. Corvo, A. Dorigo, F. Fanzago, A. Fanfani, F. Farina, M. Merlo, O. Gutsche, L. Servoli,


C. Kavka(2007). The CMS Remote Analysis Builder (CRAB). LECTURE NOTES IN COMPUTER SCIENCE. High Performance Computing -HiPC 2007 14th International Conference. Goa, India. 18-21 Dicembre 2007. vol. 4873, pp. 580 - 586. 5. F. Farina , S. Lacaprara, W. Bacchi, M. Cinquilli, G. Codispoti, M. Corvo, A. Dorigo, A.Fanfani, F. Fanzago, O. Gutsche, C. Kavka, M. Merlo, L. Servoli, D.Spiga. (2007). Status and evolution of CRAB. POS PROCEEDINGS OF SCIENCE. XI International Workshopon Advanced Computing and Analysis Techniques in Physics Research (ACAT07). Amsterdam. 23-27 April 2007. vol. ACAT20, pp. ACAT020. 6. Andra Sciab` a, S. Campana, A. Di Girolamo, E. Lanciotti, N. Magini, P. M. Lorenzo, V. Miccio, R. Santinelli, Testing and integrating the WLCG/EGEE middleware in the LHC computing, International Conference on Computing in High Energy and Nuclear Physics (CHEP07), Victoria BC, Canada, 27 September 2007 7. V. Miccio, S. Campana, A. Sciab` a, Experience in testing the gLite workload management system and the CREAM computing element, EGEE’07 International Conference. 1-5 October 2007

45

46


A Rate Based Auction Algorithm for Optimum Resource Allocation using Grouping of Gridlets 1

G. T. Dhadiwal1, G. P. Bhole1, S. A. Patekar 2 Computer Technology Department, VJTI, Mumbai-19, India. 2 Vidyalankar Institute of Technology, Mumbai-37, India.

Abstract - The Problem of allocating resources to a set of independent subtasks (gridlets) with constraints of time and cost has attracted a great deal of attention in a grid environment. This paper proposes the criteria for resource allocation based on its rate (cost by MIPS) and grouping of the gridlets. The rate minimizes the cost and the grouping reduces the communication overheads by fully utilizing the resource in one go. A comprehensive balance is thus achieved between the cost and time within the framework of grid economy model. Comparison of proposed algorithm is made with the single round First-price sealed and Classified algorithms in the literature. The results obtained using Gridsim Toolkit 4.0 demonstrates that the proposed algorithm has a merit over them. Keywords: Rate based algorithm, Auction algorithm, Gridlets, Gridsim.

1

Introduction and Related Work

Allocating independent subtasks (gridlets) in grid environment to the resources which are geographically distributed, heterogeneous, dynamic and owned by various agencies and thus having different costs and capabilities is one of the key problem addressed by various researchers [14]. Grid Classified algorithm [5] and the First-priced sealed algorithm [1, 3] which are based on market mechanism focus on single round auction. The objective of later algorithm is to obtain the smallest Makespan (time require to complete the task) but has neglected the user grade of service demand. The Classified algorithm [5] proposes optimized scheduling algorithm under the limitation of time and cost. However it attempts to complete the task as quickly as possible, up till the granularity time (user expected completion time) and then after its laps resorts to cost minimization, looking at these aspects independently. The proposed algorithm allocates resources based on rate i.e. Cost by MIPS ratio and identifies a group of gridlets which are submitted as a single bunch which keeps

the allocated resource engage up till the granularity time, resulting in reduced communication overhead. Thus the proposed algorithm balances time and cost comprehensively.

2 The Basis of the Proposed Rate Based Algorithm The Rate based algorithm considers the rate of resource i.e. the ratio of cost of the resource to the MIPS of the resource. Resources are then sorted in increasing order of rates. From sorted list the resource with the least rate and which is within the user budget is selected. For the specified granularity time million instruction performed by the selected resource is computed. Now a group of gridlet (assumed to be independent of each other) is identified such that it’s computing requirements matches with the selected resource, and is allocated to it. By doing this we do not revisit the allocated resource within the granularity time there by substantially reducing in the communication overhead. Now next resource with minimum rate from the sorted list is considered and above procedure is repeated. If during the process all resources are exhausted and still some gridlets are remains to be processed then we resort to a trade off by allocating all the remaining gridlets to the resource with the minimum rate.

2.1 Algorithmic Steps Let n be the total number of gridlets termed as a task, MIi are the Million Instruction (MI) of gridleti, GranTime is the user expected time to complete processing of all gridlet in seconds, overhead is the communication time for each allocation of the resources to the gridlets, budget of user in Indian Rupees (INR) for the task, m be the total number of resources. Rj is the resource j, RMIPSj indicate the number of MI processed by the resource j in one second, RCostj is the cost in INR which resource charges on per second basis. Step 1: Read resource information For each resource Rj j < m Get the following data of the available resources Rj, RMIPSj, RCostj. End for


Step 2: Selection of resources based on rate and budget For Each Resource R find Rratej = RCostj / RMIPSj End for

47

End if End For Send group k to resource j. k = k+1 j = j+1

sort Rratej in increasing order. //Find the require MIPS for the entire task. For Each gridleti Req_MIPS = Req_MIPS + MIi End for While Req_MIPS > 0 and budget > 0 Let j be the next resource from sorted list. (j=1, 2,…, m) Req_time = (Req_MIPS /RMIPSj) round to next integer if (Req_time (Req_time * RCostj)) select resource j. add j to selected_resource_list budget = budget – (Req_time * RCostj ) Req_MIPS = Req_MIPS – (Req_time * RMIPSj ) end if else if (budget > (GranTime * Rcostj)) select resource j add j to selected_resource_list. budget = budget – (GranTime * RCostj ) Req_MIPS = Req_MIPS – ( GranTime * RMIPSj ) end else if End While At end of step 2 a list of selected resources has been formed. Step 3: Grouping of gridlets For each gridleti , i < n Gridlet_sent[i] = false; End for While (all gridlet are not sent) Let k be the group number. (k=1, 2, ---) if Flag_GranTime_over = false select the next resource j from selected_resource_list. else // grantime over and hence select the least rate resource select the first resource (j = 1) from selected_resource _list. End if Total_MIj =RMIPSj * GranTime MIofGroupk = 0; For each Gridleti , i < n If (Gridlet_sent[i] == false and (Total_MIPSj > MIofGroupk +MIi)) MIofGroupk = MIofGroupk+MIi Add gridleti to group k Gridlet_sent[i] = true

// When all gridlets are not executed within granularity time the remaining gridlets are allocated to least rate resource. if resource list exhausted Flag_GranTime_over =True End if End while

3 Illustrative Example - Comparison of the Proposed Algorithm with the others [1, 5] The problem is to execute say Task1, in grid environment using Gridsim toolkit 4.0 [6, 7], consisting of 5 gridlets each requiring 200 MI. The communication overhead is 0.2 sec, the budget is 7000 INR and granularity time is 5 sec. The information regarding resources used is listed in Table 1. Table 1: Detailed Information of Resources Name

MIPS

R1 R2 R3 R4 R5 R6 R7 R8

42 180 256 225 384 39 66 450

Operating System UNIX LINUX LINUX WINDOWS LINUX WINDOWS LINUX WINDOWS

Cost (INR) 100 200 300 250 400 50 60 500

Similarly Task2 to Task8 are considered for execution. The proposed rate based algorithm, the First-price sealed and the Classified algorithm are employed to solve the problem. The results are tabulated in Table 2 and graphically represented in Fig 1 and Fig 2. Resources are allocated during the execution of algorithm. Table 3 shows final allocation of resources to the gridlets for all three algorithms. Grouping of gridlets is also seen in the said Table for the proposed algorithm. For brevity only Task4 is considered.

48


6.46

3480

3.92

4990

9.67

7500

20

5.73

4740

8.83

5480

12.89

10000

5

25

4.89

5990

18.51

6350

16.11

12500

6

30

5.09

6990

159.85

9000

19.33

15000

7

35

12.16

8710

196.49

10500

22.56

17500

8

40

16.78

10040

217.13

12000

25.78

20000

R9

15

4

R8

5000

3

R7

1500

6.44

R6

3.22

3950

R5

1950

1.96

R4

1.09

2240

Cost (INR)

R3

1440

4.89

Make span (sec)

R2

3.23

10

Cost (INR)

-

-

-

-

-

-

-

-

-

-

1, 3, 5, 8, 10

2, 4, 7, 9

-

11, 12, 13, … 19, 20

6

-

1, 2, 3, … 19 20 -

-

16-19

-

1-5, 20-20

7-15

-

66

-

-

R1

5

2

Make span (sec)

Resource

First Prices

First

1

250

Classified

Table 3: Allocation of resources for Task4

Classified

Proposed Rate Based Make span Cost (sec) (INR)

Task

Gridlet

Table 2: Comparison of Proposed algorithm with Firstprice and Classified Algorithm

Rate Based Classified

Makespan (sec)

Rate Based

First

200

150

100

Remarks – following are the observation made with reasoning on the results obtained in Table 2.

50

0

1) Irrespective of magnitude of the task the cost required for the proposed algorithm is minimum because rate based selection of resource is employed.

Task1 Task2 Task3 Task4 Task5 Task6 Task7 Task8 Task

Fig 1: Comparison of time required for task completion by all three algorithms 25000

Cost (INR)

20000

2) As the magnitude of task increases (see after task 3 in table 2) the time required to complete the task also reduces. This is due to reduced communication overhead because of grouping of gridlets

Rate Based Classified First

15000

The superiority of the proposed method is thus apparent.

10000 5000

4

0 Task1 Task2 Task3 Task4 Task5 Task6 Task7 Task8 Task

Fig 2 Comparison of cost required for task completion by all three algorithms

Conclusion

The paper proposes an algorithm for resource allocation in grid environment with single round auction. The proposed algorithm considers the ratio of cost by MIPS i.e. the rate of the resource in its allocation and also it incorporates appropriate grouping of gridlet to minimize the transition time of each gridlet to the resource, thus balancing time and cost comprehensively. The comparison of the rate based algorithm is made with the First-price and Classified algorithms in the literature using Gridsim Toolkit 4.0 and the result demonstrates that proposed algorithm has a merit.


5

References

[1] Daniel Grosu and Anubhav Das. Auction-Based Resource Allocation Protocols in Grids,In proc. Of the 16th IASTED international Conference on Parallel and Distributed Computing and System (PDCS 2004), November 9-11, 2004, MIT, Cambridge, Massachusetts, USA, pp. 20-27.

49

Appendix Gridsim toolkit 4.0 is used. The First-price sealed, Classified and the proposed rate based algorithm are implemented using the flow chart given in Fig 3.

[2] Mathias Dalheimer, Frans-Josef Pfreundt and Peter Merz. “Agent-based Grid Scheduling with Calana”, In Parllel Procedding and Applied Mathematices (PPAM 2005), vol. 3911, Lecture Notes in Computer Science, Springer, pp. 741-750, 2005. [3] Marcos Dias de Assunção and Rajkumar Buyya. An Evaluation of Communication Demand of Auction Protocols in Grid Environments, Proceedings of the 3rd International Workshop on Grid Economics & Business (GECON 2006), World Scientific Press, May 16, 2006, Singapore. [4] Task Scheduling for Coarse-Grained Grid Application Nithiapidary Muthuvelu, Junyang Liu and Nay Lin Soe Grid Computing and,, Distributed Systems Laboratory Department of Computer Science and Software Engineering The University of Melbourne, Australia. [5] Li Kenli, Tang xiaoyong, Zhaohuan. Grid Classified Optimization Scheduling Algorithm under the Limitation of Cost and Time. In (ICESS’05) IEEE Proceedings of the Second International Conference on Embedded Software and Systems. Dec2005. pp. 496-500. [6] URL of Gridsim simulator http://sourceforge.net/projects/gridsim [7] Anthony Sulistio, Uros Cibej, Srikumar Venugopal, Borut Robic and Rajkumar Buyya A Toolkit for Modeling and Simulating Data Grids : An Extension to Gridsim, Concurrency and Computation : Practice and Experience (CCPE), Wiley Press, New York, USA (in press, accepted on Dec. 3, 2007).

Fig 3: Process Flow of used in GridSim Simulation for implementing algorithms used.

50


A Coordination Framework for Sharing Grid Resources D. R. Aremu1, M. O. Adigun2, Department of Computer Science, University of Zululand, KwaDlangezwa, KwaZulu-Natal, South Africa

1

1.

(VE) for resource sharing by logically coupling

Introduction

ABSTRACT:

millions of geographically distributed resources across

Distributed market based models are

frequently used for resource allocation on the computational grid. But, as the grid size grows, it becomes more difficult for a customer to directly negotiate with all the grid resource providers. Middle agents are introduced to mediate between the providers and customers so as to facilitate effective resource allocation process. This paper presents a new market based model called Cooperative Modeler for mediating resource negotiation. By pooling resources together in a cooperative manner, the resource providers can combine

sales returns,

distribute

sales

operating expenses and

efficiently

among

members

in

proportion to the volume of contribution over a specified time. The paper discussed the designed framework for implementing the Cooperative Modeler.

multiple organizations, administrative domains, and policies. The Grid technology comprises heterogeneous resources

(PCs,

work-stations,

clusters,

and

supercomputers), fabric management systems (single system image OS, queuing systems, etc.) and policies, and

applications

(scientific,

engineering,

and

commercial) with varied CPU, I/O, memory, and/or network intensive requirements. The users, producers also called resource owners and consumers have different goals, objectives, strategies, and demand patterns. More importantly both resources and endusers are geographically distributed with different time zones [3]. These essentially lead to complex control and coordination problem, which is frequently solved with economic model based on real or virtual currency. The nature of economic models, which imply self-interested participants with some level of autonomy makes an agent based approach the preferred choice for this

1.

Introduction

Utility Grid Computing is emerging as a new paradigm for solving large-scale problems in science, engineering, and commerce [1][2][3]. The Grid technologies enable the creation of Virtual Enterprises

study. This paper presents a new middleware agent called Cooperative Modeler, for mediating resource negotiation in the grid computing environments, and discussed

the

architecture

designed

for

its

implementation. The presented model, enabled by the Utility Grid Computing Infrastructure allowed the providers of resources to come together in a cooperative


51

relationship to do business. A cooperative is an

grid economies. The models investigated are classified

autonomous association of agents united voluntarily to

based on auctions, bilateral negotiation, and other

meet their common economic, social, and cultural

negotiation models.

needs and aspirations through a joint owned and democratically

controlled

enterprise.

Cooperative

2.1

Auction Models

business is a business owned and controlled by the people who use its services. By pooling resources together in a cooperative manner, providers of resources combine sales returns,

and operating

expenses, and pro-rating or distributing sales among members in proportion to volume each provides through the cooperative over a specified time. A cooperative may operate a single resource pool or a multiple resource pool. In a pool operation, members bear the risk and gains of changes in market prices. So, the advantages of pooling are: (i) it spreads market risks; (ii) it permits management to merchandise resources according to a program it deems most desirable and to one that can be planned with considerable precision in advance; and (iii) it also permits management to use caution in placing and timing shipments to market demands and in developing new markets i.e., orderly marketing; and it helps finance the cooperative. The rest of the paper is organized as follows; section 2 discussed the related work, while section 3 present the Cooperative Modeler, 4

presents

the

Software

between several agents. An auction is a market institution with an explicit set of rules determining resource allocation and prices on the basis of bids from the market participants. Auctions enable the sales of resources without a fixed price or a standard value. The typical purpose of an auction is for the seller to obtain a price, which lies as close as possible to the highest valuation among potential buyers. The auction model supports one to many negotiations, between a grid resource provider (seller) and many grid resource consumers (buyers), and reduces negotiation to a single value (i.e. price). The three key players involved in an auction are: the grid resource owners, the auctioneer (mediator) and the buyers. In a Grid environment, providers can use an auction protocol for deciding service value/price. The steps involved in the auction process are: (i) a Grid Service Provider announces its services and invites bids; (ii) brokers offer their bids and they can see what other consumers offer depending on whether the auction protocol is open or closed; (iii)

and the architecture designed for its implementation. Section

Auctions are highly structured forms of negotiation

Specification/

Requirement and Design used for implementing the

the broker and Grid Service provider communicate privately and use the resource (R). The contents of the deal template used for work announcements in an

Cooperative Modeler. Section 5 concludes the paper.

auction include the addresses of the users, the eligibility

2.0

requirements

Related Work

specifications,

the

task/service

abstraction, an optional price that user is willing to for

invest, the bid specification (what should the offer

coordinating grid resource allocation in the grid

contain) and the expiration time (the deadline for

computing environment. The economic based approach

receiving bids). From a Grid Service Provider’s

provides a fair basis in successfully coordinating

perspective, the process in an auction is: (i) receive

decentralization and heterogeneity that is present in the

tender announcements/advertisements (in Grid market

Economics-based

model

has

been

used

52


Directory); (ii) evaluate the service capability; (iii)

manager and the resource that might be able to solve

respond with a bid; (iv) deliver service if a bid is

the task is called the potential contractor. From a

accepted; (v) report result and bills the user as per the

manager’s perspective, the process in a Tender/Contract

usage and agreed bid.

– Net Model is: (i) the consumer (broker) announces its requirement (using a deal template) and invites bids

2.2

Bilateral Negotiation Model

from Grid Service Providers; (ii) interested Grid Service Providers evaluate the announcement and

Unlike the auction models which supports one to many negotiations, between a grid service provider (seller) and many grid service consumers (buyers), and reduces negotiation to a single issue value (i.e. price), we also investigated the bilateral negotiation model which involves two parties, multiple issues value scoring model as discussed in [4]. In a two party

respond by submitting their bids; (iii) the brokers evaluates and awards the contract to the most appropriate Grid Service Provider(s); (iv) step (ii) goes on until one is willing to bid a higher price or the auctioneer stops if the mini price line is not met; (v) the Grid Service Provider offers the service to the one who wins; (vi) the consumer uses the resource.

negotiation sequence called negotiation thread, offers and counter-offers are generated by linear combinations of simple functions called tactics. Tactics generate an offer and counter-offer considering multi criterions

2.4 Bid-Based Proportional Resource Sharing

such as price, quantity, quality of service, delivery time, etc. To achieve flexibility in negotiation, agents may

Market-based proportional sharing systems are quite

wish to change their ratings of the importance of the

popular in cooperative problem-solving environments

different criteria, and their tactics may vary over time.

such as clusters (in a single administrative domain). In

Strategy is the term used to denote the way in which an

this model, the percentage of resource shared allocated

agent changes the weights of tactics over time. Through

to the user application is proportional to the bid value in

strategy an agent changes the weight of tactics over

comparison to other users’ bids. The user allocated

time. Strategy combines tactics depending on the

credits or tokens, which they can use to have access to

negotiation history.

resources. The value of each credit depends on the resource demand and the value that other users place on

2.3

Tender/Contract–Net Model

the resource at the time of usage. Two users wishing to access a resource with similar requirements for

Tender/Contract-net model is one of the most widely

instance, the first user is willing to spend 2 tokens and

used models for service negotiation in a distributed

the second user is willing to spend 4 tokens. In this case

problem solving environments [5]. Tender/Contract-net

the first user gets one third of the resource share

is modeled on the contracting mechanism used by

whereas the second user gets twice the first the first

businesses to govern the exchange of goods and

user (i.e. two third of the resource share), which is

services. It helps in finding an appropriate service

proportional to the value that both users place on the

provider to work on a given task. A user/resource

resource for executing their applications. This strategy

broker asking for a task to be solved is called the

is a good way of managing a large share resource in an


organization or resource owned by multiple individuals can have a credit allocation mechanism depending on

53

3.0 Proposed Cooperative Middleware for Resource Negotiation

the investment they made. They can specify how much credit they are willing to offer for running their

This section discussed the Cooperative Modeler, and

application on the resource.

the architecture designed for its implementation. The Cooperative Modeler adopts the concept of Utility grid

2.5.3 Monopoly/Oligopoly Unlike the previously discussed auction models which assumed a competitive market where several Grid Service Providers and brokers/consumers determine the market price, there exist cases, where a single Grid Service Provider dominates the market and is therefore the single provider of a particular service. In economic theory this model is known as monopoly. Users cannot influence prices of services and have to choose the service at the price given by the single Grid Service Provider who monopolizes the Grid marketplace. An example is where a single site puts the prices into the Grid Market Directory or information services and brokers consult it without any possibility of negotiating prices. A monopoly’s offer of resources is usually decoupled from the price at which it acquired the resource. The classical problem of a monopoly is that it sets higher price than marginal cost and this distorts the trade-off in the grid economy, and moves it away from

computing, which is tied to utility computing where users can request for resources when ever needed (i.e. on-demand) and only be charged for the amount being used. Individual members of a Cooperative group need not own or purchase expensive grid resources for a specific project but can instead choose to “rent” or share

among

trusted

parties,

members

of

the

cooperative. The Cooperative Modeler is a dynamic alliance of autonomous resource providers distributed across organizations and administrative domains that bring in their complementary competencies and resources that are collectively available to each other through a virtualized pool called Cooperative Market, with the objective to deliver products or services to the market as a collective effort. The Cooperative Modeler adopt bilateral negotiation model to enable the consumers of services to negotiate for services in real time.

3.1 Architecture of the Cooperative Modeler

Pareto efficiency. The fact that a monopoly does not face the discipline of competition means that a monopoly may operate inefficiently without being corrected by the grid marketplace. The competitive markets are one extreme and monopolies are the other extreme. In most of the cases, the market situation is an oligopoly which is in between these two extreme cases: a small number of Grid Service Providers dominate the market and set the prices.

The architecture of the Cooperative Modeller is made up of three components namely: the Client component, the Cooperative Middleware component, and the Providers component. The architecture promotes a business situation involving three stakeholders with three major business roles: (i) End-user - The End-user role is played by a stakeholder (client) who consumes services, (ii) Mediator - The Mediator role is the key player and this role is played by the Cooperative Middleware agents, (iii) Service Provider - The service Provider role is played by a service owner who offers

54


his services to the end user (the client). The Client

Cooperative Middleware Components is made up of

Component is made up of a set of n clients. Each client

five interacting agents saddled with the responsibility of

has at least one task of various length and resource

coordinating resource sharing negotiation at the grid

requirement to execute. The Provider Component is

resource pool. These agents are: the Liaising-Agent, the

composed of a set of m Providers, who formed the

Information-Agent, the Marketing-Agent, the Resource-

cooperative group. The dynamic nature of this model

Control-Agent, and the Execution-Agent. Table I gives

makes it possible for a provider of resources to be a

the description of these agents.

client requesting for resources at the same time. The Table I: Description of the Agents at the Cooperative resource pool Name of Agent

Responsibility of the agent

Client-Agent

This agent is acting on behalf of the Clients (the end user) to negotiate for resources

Liaising-Agent

This agent is acting as the controller, manager of managers to the other agents at the resource pool. Request for resources is always directed to it. It liaises between the clients, the providers and the agents at the resource pool.

Information-Agent

This agent is the manager in charge of the resource pool. It maintains knowledge base of the resources at the pool. It interacts closely with the pool and the Marketing-Agent.

Marketing-Agent

The Marketing-Agent is the expert (manager) in charge of resource negotiation. It is equipped with all kinds of negotiation tactics so as to optimize the objectives of the resource providers.

Resource-Control-Agent

This agent is responsible for sales recording/documentation, and issuance of resource id. It is the manager (accountant) in charge of record keeping at the resource pool.

Execution-Agent

This agent is responsible for the execution of the clients’ tasks. available resources, exchanges of proposals followed by

4.

Implementation Design for the Proposed Cooperative Modeler

proposal-accepted or counter-proposals. If the two agents agree over a deal, sales accepted must be reported followed by sales confirmed. The execution protocol

Table II gives the specification of the negotiation

(Table

protocol between the Marketing-Agent and the Client-

requests/responses

Agent of the Cooperative modeler. The role players in

Execution-Agent over task execution. The Execution-

this protocol are the Client-Agent who plays the role of a

Agent is acting the role of task executor. The protocol

buyer of resources, while the Marketing-Agent plays the

allows the Client to request for the status of the task

role of a seller of resources. The communication

execution, during the execution period. End of task

between the two agents consist of exchange of messages

execution is reported immediately after execution is

in form of buyers querying sellers for available resource

finished.

to meet their task requirements, query for prices of the

III)

also

consists between

the

of

exchange Client

and

of the


55

Table II: Specification of the Negotiation Protocol between the Client-Agent and the Marketing-Agent.

Table III: Specification of the Execution Protocol

Protocol Content

Protocol Specification

Protocol Content

Task Execution Specification

Roles

Client-Agent as Buyer,

Roles

Client as Client, Execution –Agent as Executor

Messages

ExecutionRequest (XClient→Execution-Agent)

Marketing-Agent as Seller Messages

ResourceQuery (XClientAgent→Marketing-Agent) PriceQuery (XClientAgent→Marketing-Agent)

ExecutionAccept (XClient←Execution-Agent)

PriceOffer (XClientAgent←Marketing-Agent)

ExecutionReject (XClient←Execution-Agent)

NoOffer (XClientAgent←Marketing-Agent)

ExecutionFinished (XClient←Execution-Agent)

SaleAccept (XClientAgent→Marketing-Agent)

ExecutionQuery (XClient→Execution-Agent)

CounterOffer (XClientAgent→Marketing-Agent)

ExecutionStatus (XClient←Execution-Agent)

TerminateNegotiation (XClient→Marketing-Agent) SaleConfirm (XClientAgent→Marketing-Agent) Contract

Execution Must be started immediately if resource id is valid Execution

Contract

SaleAccept Must be followed by sales

Client

finished Must be reported

immediately

Execution-Agent executquest : String) : String executeStatus(JobStatusRequst : String) :String

Resource_Control-Agent documentSale(request : String) : String IssueResourceID(resource : String) : String

Liaising-Agent

Client-Agent

CoordinateRequest (request : String) : String resourceQueryneqotiate(resource :String):String

Information-Agent process(request : String) : String matchTasl_1WithResource(req : String) : String . . . . . . . matchTask nWithResource(req : String): String

Marketing-Agent process(request : String) : String negotiate(resource : String) : String

Figure 1: The Class Diagram describing the interaction pattern for the implantation of the Cooperative Modeler

56


5.0

Conclusion

In this paper, a survey of economic models for

[3]

Klaus Krauter, Rajkumar Buyya, and Muthucumaru

negotiating grid resource was carried out. The paper

Maheswaran, “A taxonomy and survey of grid resource

discussed the architecture of a new middleware called

management systems for distributed computing,”

Cooperative Modeler. The purpose of creating the

Software: Practice and Experience, vol. 32, no. 2,

Cooperative Modeler was for mediating resource

February 2002, pp. 135-164.

negotiation between providers and consumers of resources. The paper also presented the design

[4]

framework for the implementation of the Cooperative

H. Raiffa. The Art and Science of Negotiation, Harvard University Press, Cambridge, USA, 1982

Modeler. The presented model, enabled by the Utility Grid Computing Infrastructure, allows the providers of

[5]

Smith R, David R. “The Contract Net protocol: High level

grid resources to collaborate by coming together in a

communication and control in a distributed problem

cooperative relationship, and to contribute their core

solver,” IEEE Transaction on Computers 1980; 29(12):

competences and share resources such as information,

1104 – 1113.

knowledge, and market access in order to exploit fast – changing market opportunities..

Reference [1]

Rajkumar Buyya and Srikumar Venugopal, “The Gridbus Toolkit for Service Oriented Grid and Utility Computing: An Overview and Status Report”, Proceedings of the First IEEE International Workshop on Grid Economics and Business Models (GECON 2004, April 23, 2004, Seoul, Korea), 19-36pp, ISBN 0-7803-8525-X, IEEE Press, New Jersey, USA.

[2]

D. Abramson, J. Giddy, and L. Kotler, High Performance

Parametric

Modeling

with

Nimrod/G: Killer Application for the Global Grid?, IPDPS’2000, Mexico, IEEE CS Press, USA, 2000.


57

A Scheduling Algorithm Using Static Information of Grid Resources Oh-Han Kang1, Sang-Seong Kang2 Dept. of Computer Education, Andong National University, Andong, Kyungbuk, Korea 2 Dept. of Educational Technology, Andong National University, Andong, Kyungbuk, Korea 1

Abstract - In this paper, we propose a new algorithm, which is revised to reflect static information in the logic of WQR(Workqueue Replication) algorithms and show that it provides better performance than the one used in the existing method through simulation. Keywords: Grid, Scheduling, Workqueue replication

1

Static

information,

Introduction

The scheduling algorithm for a grid system can be classified into a number of groups according to the characteristics of resources and tasks, scheduling time, and intended goal. But we focused our study on the algorithms which can obtain minimum completion time for mutually independent tasks in batch mode. With this goal, the tasks were assorted by the type of information of our interest and related works were studied further. 1.1

Algorithm using the information for performanceevaluation

The studies on algorithms, which utilize information concerning the length of assigned tasks, the preparation time of resources, and processing capacity, are dated back to most conventional parallel and distributed environment. These typical algorithms range from Min-Min to Max-min. Out of unassigned tasks, Min(Max)-Min algorithm selects those with the minimum(maximum) completion time, and assign these tasks to the resources, which is expected to have the minimum completion time. Since Min(Max)-Min algorithm can be simply implemented, application of such algorithms can be easily found from other situation. He[6] suggested the modification of Min-Min algorithm to complete tasks (those that require QOS in grid computing) in shortest amount of time. Furthermore, Wu[7] separated sorted tasks into segments and applied it to Min-Min algorithm, In contrast to Min(Max)-Min, some algorithms are introduced by Buyya[1] and Muthuvelu[2] to be developed specifically for the grid system. Buyya suggested an

algorithm model to search the optimal combination of resources within the budget, as it is important to consider the financial costs of resources in a grid system. Because his model includes cost-optimization, time-optimization, and cost-time optimization, appropriate algorithm can be chosen for particular significance of tasks and specific budget of grid users. Moreover, Muthuvelu proposed a strategy, in which he gathered tasks composed of a number of small fragments and, in turn, distributed resource in specific amount. 1.2

Algorithm

not

using

the

information

for

performance-evaluation Evaluating measured capacity and the exact length of a task to be allocated in consideration of the current load is not an easy problem. Especially due to the characteristics of the grid system both the capability to maintain capacity and status considering load information of the resource and the power to maintain these in real time and subsequent difficulty in predicting the completion time, results from evaluation are virtually ineffectual. Even though it may be possible to evaluate with complete precision, relatively long processing time of tasks in utilizing the grid system leads to continual change in the load of resources, and the scope of its application consequently remains limited. For these reasons, a grid scheduling algorithm without information for performance evaluation is suggested. [3, 4] The simplest algorithm independent of information for performance-evaluation is Workqueue. Workqueue allocates every task to every resource one by one in order, and, in turn, assigns other tasks immediately to resources, which have come up with the results. As a result of letting more number of tasks be allocated to fast resources and less to slow resources, this algorithm enables minimal amount of taskcompletion time. Subramani[4] allocated specific tasks to a number of resources repeatedly, and called off the processing of other remaining tasks in other resources when the task is completed in one of resources. In this case, because task waiting queue is located in each resource, not only the utilization of resources enhances and but the total completion-time is decreased as well. Two algorithms mentioned above are programs, which are simple to implement. However, since factors such as

58


either excessively time-consuming resources or unfeasible resources from several reasons, extremely lengthy total completion-time could arise frequently. To solve this problem, WQR (Workqueue Replication) [3] algorithm is suggested. WQR algorithm is similar to Workqueue in its basic distribution method, but does not stop at distributing tasks once. Until all tasks are completed, WQR algorithm distributes incomplete tasks repeatedly within certain limit. Therefore, it can secure stability from excessive load or fault of certain resources. However, for users who utilize the same scheduling strategy and use the grid system as well, overhead could arise from using the same overlapping task in a number of resources.

The tasks composing application to be processed consists of total 200, and the length of tasks is arbitrarily chosen from the uniform distribution of [1,000,000, 5,000,000]. As a result, when the resource with 0% load is assumed to be assigned, processing time of a single task ranges from 2,000 seconds to 5,000 seconds. So as to ignore the communication time used to distribute a resource to a task and return the result, input and output data of each resource is all set to 0. Every simulation is performed 10 times in the same condition and the arithmetic mean is used as the total makespan.

2

3.1

Simulation environment

WGridSP[4] - a tool for the performance evaluation and comparison of algorithms, which employs GridSim (the java-based simulation tool) as an engine - selects its target randomly from the range of distribution limited within the features of these resources and hence constitutes the grid. The maximum load of a resource can be designated by a user. The minimum load is 0%; the processing capacity can be utilized 100 %. Also the weighting factor using time sequence can be changed in each time period. For simulation, the number of resources is assigned to be 50, and the minimum load of each resource is chosen to be 0 %. Also, the maximum load is authorized to shift from 10%, 30%, 50%, 70%, to 90%. Likewise, the reason for comparing capacity of algorithms according to load is that the load is considered to be the most critical factor in turnaround time in the grid system. In case of the load of a resource remaining excessively high, the resource is either used in local system or a considerable number of tasks are concentrated on the resource; sometimes, it can be viewed as a temporarily unusable case. The load of each resource is made to alter from minimum to maximum and the maximum load is multiplied by the weighting factor using time sequence, which will change the current load of each resource. Also to prevent every resource from simultaneously retaining the same load, the GMT based time period of a resource is randomly assigned in uniform distribution of [0, 23]. To analyze the efficacy of information used in each algorithm, two types of simulation are performed. The first simulation fixed CPU processing capacity to 300, and the number of CPU to 1. The number expressing CPU processing capacity means the number of instruction processing in one second. For example, the task with the length of 1200 can be processed in 4 seconds with the resource with 1 CPU of 300 processing capacity. As the other simulation arbitrarily selected the CPU processing capacity and the number, each resource is made to retain different CPU capacity and the number. The CPU processing capacity is selected randomly within the uniform distribution of [100, 500], and the number of CPU within the uniform distribution of [1, 8].

3

Suggesting new algorithm Points to improve for existing algorithm

According to the type of information and the distribution of resources that are used in a grid scheduling, the points to improve for existing algorithms are deduced. Avoidance of excessive load or incapability resources Inferred from the results from performance evaluation of four algorithms, an increase in task completion-time when the maximum load is over 70% is the smallest for WQR. This advantage is mostly derived from a task duplication strategy. The scheduling algorithm for the grid system should include a strategy, which could evade resources that are either of extremely low capacity or unusable status. The exclusion of the utility of dynamic information Min-Min algorithm, a type which utilizes real-time dynamic information, showed the worst quality in resource set where static information is equally applied. This defect is attributed to time difference being unable to cope with the change in usage rate of resources. In other words, dynamic information, real time information of resources, seems to be no help in task scheduling of a grid system. Also, when real time information is renewed in distributing each task, overhead additionally decreases the performance of an algorithm. Therefore, unless there exists a device that can constantly oversee dynamic information and cancel or replace tasks on operation according to dynamic information, the application of dynamic information should be restrained. The application of static information Unlike how WQR algorithm showed good capacity in the grid system with resources with the same static information, it showed worse performance than WQR algorithm in the condition of maximum load being less than 70%, set by the system with multiprocessor. Ultimately, the inability to detect the number of processors and the processing capacity of resources including static information leads to inefficient use of resources. Since it is unlikely for the static information to alter when time passes, an active application in a scheduling algorithm can help reduce the completion time.


3.2

59

× 1,000

New algorithm

In this paper, we have developed WQRuSI (Workqueue-Replication using Static Information) that includes three factors to improve the avoidance of excessive load or incapability resources, the exclusion of the utility of dynamic information such as real time load information, and the application of unchanging static information. To take account of three factors, each resource is initially aligned in the order of CPU processing ability, which is basic static information. When tasks are distributed, the number of CPU retained by a resource should be the number to be distributed. Because the basic allocation strategy is based on repetitively assigning the same task to more than two resources, WQR method that prevents disorder or the slowing of a resource is applied. The detailed logic of WQRuSI algorithm is described in [Figure 1]. Sort the available resources to descending order according to processor capacity; Duplicate all tasks which have the same number as MaxReplication; Save the duplicated tasks to TaskManager; for i:=1 to ResourceList.size do for j:=1 to ResourceList[i].PEList.size do Take out a task from TaskManager under the condition that a task is not allocated to the same resource; Allocate the selected task to ResourceList[i]; if TaskManager is empty break all for; end for end for while TaskManager is not empty do Wait for the completed task from resources; if completed task exists another resources then Cancel the tasks; Take out tasks as the same number of canceled tasks from TaskManager under the condition that a task is not allocated to the same resource; Allocate the selected task to the canceled resource.; end if Take out a task from TaskManager under the condition that a task is not allocated to the same resource; Allocate the selected task to the previously returned resource; end while Wait for the de-allocation of tasks; [Figure 1] WQRuSI Algorithm using Static Information of Grid Resources 3.3

Performance of the Algorithm

To evaluate the performance of the suggested algorithm, WQRuSI algorithm went through simulation with WQR and Min-Min algorithm that showed competence in two situations. Simulation environment was set in the same way as the usual setting for analysis. The results from the analysis is shown in [Figure 2] and [Figure 3].

[Figure 2] Performance of WQRuSI-Consistent Static Information × 1,000

[Figure 3] Performance of WQRuSI-Variable Static Information In contrast to how WQRuSI algorithm showed a similar performance with WQR in the setting that consists of consistent static information, WQRuSI algorithm showed the best performance in the environment where static information for resources is selected randomly. When the static information is set in the same way, the algorithm that accounts for static information of resources did not show much improvement. However, in the environment where the number of processors and the processing ability is set in various ways, the improvement in performance was clear

4

Conclusion

In this paper, we proposed the WQRuSI algorithm. When it went through the simulation with the same environment with preceding simulation, WQRuSI showed similar performance with WQR in the status where static information of resources is fixed. In contrast, when there was variation in static information, WQRuSI demonstrated

60


much improvement than ordinary algorithms. Because there are diverse capacity of resources and the number of processors in the actual grid setting, WQRuSI is expected to contribute much to reducing the task completion-time.

5

References

[1] R. Buyya, "Economic-based distributed Resource Management and Scheduling for Grid Computing", Ph. D, Thesis, Monash University, Melbourne, Austrailia, 2002. [2] N. Muthuvelu, J. Liu, N. L. Soe, S. Venugopal, A. Sulistio and R. Buyya, "A Dynamic Job Grouping-Based Scheduling for Deploying applications with Fine-Grained Tasks on Global Grids", AusGrid2005, Vol.44, 2005. [3] D. P. Silva, W. Cirne and F. V. Brasileiro, "Trading Cycles for Informations: Using Replication to Scheduling Bag-of-Tasks Applications on Computational Grids", in Proc of Euro-Par 2003, pp.169~180, 2003. [4] V. subramani, r. Ketimuthu, S. Srinivasan and P. Sadayappan, "Distributed Job Scheduling on Computational Grids using Multiple Simultaneous Requests", in Proc. of 11th IEEE Symposium on HPDC, pp. 359~366, 2002. [5] O. H. Kang, S. S. Kang, "Web-based Dynamic Scheduling Platform for Grid Computing“, IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5B, May 2006 [6] X. He, X. sun and G. Laszewski, "A Qos Guided MinMin Heuristic for Grid Task Scheduling", in J. of Computer Science and Technology, Special Issue on Grid Computing, vol. 18, No.4, pp. 442~451, July 2003. [7] M. Wu, W. Shu and H. Zhang, "Segmented Min-Min: A Static Mapping Algorithm for Meta-Tasks on Heterogeneous Computing Systems", in Proc. of the 9th HCW, pp. 375~385, 2000.


61

e-Science Models and the Research Life Cycle How will it affect the Philippine Community? Junseok Hwang* International Information Technology Policy Program (ITPP) Seoul National University Seoul, South Korea [email protected] Abstract - In a digital information processes, the replica of life cycle are shaping the method and the way the learners’ study. In a bigger system, part of the things that will be represented in the life course is the Model, like the research process, through a chain of sequentially interconnected stages or phases in which information is manipulated or produced. This paper presents and discusses methods and ways in which the life cycle approach offers insight into the relationships among the stages and activities in research, especially in the field of technology evolution, like eScience. And this paper will also present an idea and concept how this research life cycle in e-Science will affect the Philippine community. An understanding of this viewpoint may contribute further insight into the function of e-science in the larger picture of methodical and scientific research. Keywords: e-Science, Life Cycle, Grid Computing, Philippines * Corresponding Authors

1

Introduction

Utilization of computers generates many challenges as it expands and develops the field of the possible in methodical and scientific research and many of these challenges are usual to researchers in diverse areas. The insights achieved in one area may catalyze change and accelerate discovery in many others. It is absolutely true in the statement that it is no longer possible to do science without doing computing [1]. Computing in the sciences and humanities has developed a great deal over the past decades. The life cycle method turns us to a more sensitive possible information loss in the gaps between stages. The transition and the evolution points in the life cycle are essential junctions for further important activities in the research field, such as e-Science. Many issues and streams of activity flow throughout the life cycle of research, including project administration, grant procurement, data management, knowledge creation, ethical judgments, intellectual property supervision and technology

Emilie Sales Capellan*; Roy Rayel Consulta* International Information Technology Policy Program (ITPP) Seoul National University Seoul, South Korea {capellan, roycons2001}@tepp.snu.ac.kr management as a way e-Science is being implemented. Linking activities across stages requires harmonization and coordination and a sense of continuity in the overall process [2]. In the Philippines, the research undertaken in the Sustainable Technologies Group of the De La Salle University makes use of a highly interdisciplinary approach to providing effective solutions to environmental problems [3]. These problems require an intelligent, integrated approach to yield solutions that are beneficial on a life cycle basis. Also in the Philippines, it makes use of the life cycle framework in most of the projects. Therefore, it makes use of advanced computing techniques such as: • Knowledge-based and rule-based decision support systems • Monte Carlo and fuzzy sets • Pinch analysis • Artificial neural networks • Swarm intelligence To be open and responsive to e-science, researchers must evaluate and assess the services it provided for both research outcomes and data. Given the stages of the life cycle associated with e-science, it needs to determine the services to be provided by research libraries and the partnerships required to implement and sustain these services. There is barely a scientist or scholar remaining who does not use a computer for research purposes. There are distinctive terms in use to point out the fields that are particularly oriented to computing in specific disciplines. In the instinctive and technical sciences, the term “e-Science” has recently become popular, where the “e” of course stands for “electronic” [4]. Science ever more done through distributed and dispersed worldwide collaborations enabled by the Internet, using very large data collections, tera-scale computing resources and high performance visualization. With the technology today, a very powerful infrastructure is required to support and sustain e-Science. The Grid is an architecture projected to produce all the issues together and make a reality of such a vision for eScience. In the field of technology, such as Grid computing,

62


architecture examines Grid technology as a standard and generic integration mechanism assembled from Grid Services (GS), which are an extension of Web Services (WS) to comply with additional Grid requirements. The principal extensions from WS to GS are the management of state, identification, sessions and life cycles and the introduction of a notification mechanism in conjunction with Grid service data elements [5]. The e-Science term is intended to confine an idea of the future for scientific research-based on distributed resources especially data-gathering instruments and group researches. E-Science is scientific investigation performed through distributed global collaborations between scientists and their resources, and the computing infrastructure that enables this. Scientific progress increasingly depends on pooling knowhow and results; making connections between ideas, people, and data; and finding and interpreting knowledge generated by strangers in new ways other than that intended at its time of collection. E-Science offers a promising vision of how computer and communication technology can support and enhance the scientific process. It does this by enabling scientists to generate, analyze, share and discuss their insights, experiments and results in a more effective manner. In the Philippines, as the technology evolved, the agency called ASTI1 being mandated to conduct scientific research and development in the advanced fields of Information and Communications Technology and Microelectronics, undertake projects committed to the development of its people and country as a whole. ASTI has its project called PSIGrid program, which will initiate the establishment of the necessary infrastructure and community linkages to operate throughout the country its grid facility. [6]. ASTI will deploy a reliable and secure grid management system for managing users, nodes and software to ensure the reliability and security of the entire grid.

2

Life Cycle Model of Research

The information that will be presented in this section consists of the variety of methods used in communicating and coordinating research outcomes. The research outcomes and data upon which these outcomes are based collectively document the knowledge for an area of study. The life cycle model helps monitor both the digital objects bound within a stage and those objects that flow across stages. This is represented above in the lightly shaded box around Data and Research Outcomes [7]. Figure 1 show how the life cycle model of research knowledge is being created.

1

Advanced Science and Technology Institute, R & D agency forearm of Department of Science and Technology (DOST) in the Philippines

Data Discove ry

Study Concept & Design

Data Repurp osing

Data Collect ion

Data Proces sing

Data Access & Dissemin ation

Data

Analy sis

K T Cy cle

Resear ch Outco mes

Figure 1. Life cycle model of reseach knowledge Source: Humphrey, Charles; e-Science and the Life Cycle of Research, IASSIST Communiqué, 2006

Every chevron in the above model symbolizes a stage in the life cycle of research knowledge creation. The spaces between chevrons indicate the transitions between stages. These transitions tend to be vulnerable points in the documentation of a project’s life cycle. When a stage is completed, its information may not get systematically preserved and instead end up dead-ended (most often on someone’s hard drive.) Shifts in the responsibility for the objects of research also tend to occur at these points of transition. For example, the data collection stage passes along completed interviews or questionnaires to the data processing stage; the data processing stage passes one or more clean data files to the data access and dissemination stage. In each transition, someone else usually becomes responsible for the outcomes of the previous stage. These transition points become important areas in negotiating the digital duration plan for a project as partners in the life cycle of research identify who is responsible for the digital objects created at each stage. Now in e-Science, the knowledge life cycle can be observed as a set of challenges as well as a sequence of stages. Each stage has variously been seen as a blockage. The attempt of acquiring knowledge was one bottleneck recognized early [8]. But so too are modeling, retrieval, reuse publication and maintenance. In this section, we examine the nature of the challenges at each stage in the knowledge life cycle and review the various methods and techniques at our disposal. Although it is often suffer from a deluge of data and too much information, all too often what we have is still insufficient or too poorly specified to address our problems, goals, and objectives. In short, we have insufficient knowledge. Knowledge acquisition sets the challenge of getting hold of the information that is around, and turning it into knowledge by making it functional. This might involve, for instance, making implied knowledge explicit, identifying gaps in the knowledge already held, acquiring and integrating knowledge from multiple sources (e.g. different experts, or distributed sources on the Web), or


63

acquiring knowledge from unstructured media (e.g. natural language or diagrams). A variety of techniques and methods has been developed ever since to facilitate knowledge acquisition. Much of this work has been carried out in the context of attempts to build knowledge-based or expert systems. Techniques include varieties of interview, different forms of observation of expert problem-solving, methods of building conceptual maps with experts, various forms of document and text analysis, and a range of machine learning methods [9]. Each of these techniques has been found to be suited to the elicitation of different forms of knowledge and to have different consequences in terms of the effort required to capture and model the knowledge [10, 11]. Specific software tools have also been developed to support these various techniques [12] and increasingly these are now Web enabled [13]. However, the process of explicit knowledge acquisition from human experts remains a costly and resource intensive exercise. Hence, the increasing interests in methods that can (semi-) automatically elicit and acquire knowledge that is often implicit or else distributed on the Web [14]. A variety of information extraction tools and methods are being applied to the huge body of textual documents that are now available [15]. Another style of automated acquisition consists of systems that observe user behavior and assumed knowledge from that behavior. Examples include recommender systems that might look at the papers downloaded by a researcher and then detect themes by analyzing the papers using methods such as term frequency analysis [16]. The recommender system then searches other literature sources and suggests papers that might be relevant or else of interest to the user. Methods that can engage in the sort of background knowledge acquisition described above are still in their infancy but with the proven success of pattern directed methods in areas such as data mining, they are likely to assume a greater prominence in our attempts to overcome the knowledge acquisition blockage.

3

Research trends – e-Science in a transparency

The fascinating e-Science concept illustrates changes that information technology is bringing to the methodology of scientific research [17]. e-Science is a relatively new expression that has become particularly accepted after the launch of the major United Kingdom initiative [18]. e-Science describes the new approach to science involving distributed global and international collaborations enabled by the Internet and using very large data collections, terascale computing resources and high-performance visualizations. e-Science is about global collaboration in key areas of science, and the next generation of infrastructure, namely the Grid, that will enable it. Figure 2 summarizes the e-Scientific method.

Fig. 2. Computational science and information technology merge in e-Science In a simplest manner, it can illustrate and characterize the last decade as directing simulation and its integration with science and engineering – this is computational science. e-Science builds on this adding data from all sources with the needed information technology to analyze and incorporate the data into the simulations. Fifty years ago, scientific performance has evolved to reflect the growing power of communication and the importance of collective wisdom in scientific discovery. Originally scientists collaborated by sailing ships and carrier pigeons. At the present aircraft, phone, e-mail and the Web have greatly enhanced communication and therefore the quality and real-time nature of scientific collaboration. The cooperation can be both “real” and enabled electronically [19,20] early influential work on the scientific collaboration. e-Science and hence the Grid is the infrastructure that enables collaborative science. The Grid can provide the basic building blocks to support real-time distance interaction, which has been exploited in distance education. Particularly important is the infrastructure to support shared resources – this includes many key services including security, scheduling and management, registration and search services and the message-based interfaces of Web services to allow powerful sharing (collaboration) mechanisms. All of the basic Grid services and infrastructure provide a critical venue for collaboration and will be highly important to the community. In Philippine perspective, researchers created what they say are the first generic system for Grid Computing that utilizes an industry-standard Web service infrastructure. The system, called Bayanihan Computing .NET [21] is a generic Grid computing framework based on Microsoft .NET that uses Web services to harness computing resources through “volunteer” computing similar to projects such as SETI@Home [22], and to make the computing resources easily accessible through easy-to-use and interoperable computational Web services. As mentioned in the preceding section that ASTI agency from the Philippines is managing a project called Philippine e-Science Grid Program (PSiGrid).

64


This emerging computing model provides the ability to perform higher throughput computing by taking advantage of many networked computers to model a virtual computer architecture that is able to distribute process execution across a parallel infrastructure. The establishment and planning of this PSiGrid is expected to foster collaboration among local research groups as they share computing resources to further their research efforts. This is also expected to enable efficient utilization of local computing resources.

4

e-Science Practical Model Application

With the global advancement of technology, new advances in networking and computing technology have produced an explosive growth in networked applications and information services. Applications are getting more complex, heterogeneous and dynamic. In the recently concluded forum regarding national e-Science development strategy, held on August 24 at the Westin Chosun Seoul under the supervision of KISTI and under the joint auspices of the Ministry of Science and Technology (MOST) and the Korea e-Science Forum, has reported significance of R&D activity changing into e-Science system. The necessity of national e-Science is becoming more & more important because of its new research method which challenges huge applications and research in limited environments, improvement in research productivity which enables us to utilize research resource at remote places and collaborate between researchers, education learning trait which enables diverse learning equipment's utilization with networked studying environment, and finally economic development's new growth engine with cutting-edge technology innovation. [23] One major impact that it had made contributed was on the medical field, for instance on reduction on the period for drug development; enabling global research projects in fields of aerospace development, nuclear fusion research, tsunami and SARS prevention; boosting national science technology competitiveness by developing a new methodology model in which IT and science technologies are converged by securing convergence research's core technology and cooperation and collaboration among nations, regions and fields in that the researchers can have access to cutting edge equipment, data and research manpower. By means of cutting-edge technology innovation, this national e-Science can serve as a new growth engine of economic development, provoking astronomical economic ripple effect. Aside from the R&D applications, e-Science has also proven its importance by its introduction to classroom. In UK, a pilot project has begun to explore the potential benefits of collecting and sharing scientific data within and

across schools and closer collaborations between schools and research scientists with a view to running a national project involving multiple schools. [24 ]. This pilot project has begun to reveal the educational potential through the collaboration of teachers and students, in a way they input, manipulate their collected data and share this Grid-like technologies, such activities can provide and a larger scale project would have the potential to begin to feed schoolssampled local pollution data into a more significant GRIDbased data set which scientists could use to build up a picture of pollution levels across the country. Another major contribution that UK, being the first country to develop a national e-Science Grid, developed was used in the diagnosis and treatment of breast cancer as in one of the pilot projects, they developed a digital mammographic archive together with an intelligent medical decision support system which an individual hospital without supercomputing facilities could access through the use of grid. This project is called e-DiaMoND and Integrative Biology. [25 ] In Australia, they have introduced the world’s first degrees in e-Science. Two Australian computer science departments namely Australian National University and RMIT have worked together, established a program called “Science Lectureships Initiative” designed to foster linkages between academia and industry with the idea of attracting students into science-related areas which would then benefit emerging industries. [ 26] At RMIT, the eScience Graduate Diploma started with only 10 students in the first year, but thereafter struggled to gain enough extra students to become self sustaining as a separate program while at ANU there was a large influx of overseas students particularly from Indian subcontinent and from East Asia. With these initiatives, this can provide guidance and attract other universities to set up similar education programs.

5

Definition and Relevance of e-Science in the Philippine Perspective

As noted by many different studies and researches that been done by the different authors in all parts of the world, we can say that e-Science have its own role, function, and relevance in this modern society. Many developed countries have gone far in this field. However in the Philippines, it is just on the introduction phase. Thus, e-Science could be defined as a solution that can guarantee the Philippines, through international collaboration, improve its technological innovation in researches and discoveries within the applied technological approach. This paper serves as the driving force in addressing the issue on three most important application of ICT which is education, health and governance. With its direct connectivity to a number of international research and


education networks such as Asia Pacific Advanced Network (APAN) and the Trans-Eurasia Information Network 2 (TEIN2), this will benefit researchers on the academe sectors to collaborate in the global research community.

6

Research Life Cycle Model in the Philippines

In Fig. 3, this model PREGINET2 will be the network backbone that will support the key major players in the whole system flow of the scientific research arising from the academe and the government’s research and development institutes. Thus in this research life cycle model, this will serve as the highway that will serve the applications on which, in this research, is the e-Science. As shown in the model, e-Science will take part as the heart of these important areas of researches , as this will become the central application to researchers from the academe and other R&D institutions, e-Library and distance learning. This platform will allow linkages among its partners in the network locally and globally. Fig. 3. Development Model and Work Processes

65

abroad, which covers three (3) projects, namely: (1) Boosting Grid Computing Using Reconfigurable Hardware Technology (Grid Infrastructure); (2) Developing a Federated Geospatial Information System (FedGIS) for Hazard Mapping and Assessment; and (3) Boosting Social and Technological Capabilities for Bioinformatics Research (PBS). The Program will be implemented by the Advanced Science and Technology Institute (ASTI), an attached institute of the DOST focusing on R&D in ICT and Microelectronics as stated in the previous section. Moreover, four (4) components of e-Science as shown in figure 3 emphasizes the importance of the following byproducts and elements in the Philippine e-Science perpective. First, Researches from the Academe emphazises and stresses the collaboration between the academic and R&D institutions or sectors that may be functional within the e-Science framework. Secondly, in association with the first component or element, Researches from the R&D Sectors accentuates the use of linking R&D to the e-Science work processes. Thirdly, given the PREGINET infrastructure, Distance Learning emphasizes the importance in the collaborative e-Science framework because framework itself might be a special tool to deliver the interactive and realtime education. Lastly, e-Library brings and take advantage the potential means in a distributed computing.

PREGINET

7 Researches from the Academe

Distance Learning

e-Science Researches from DOST R&D institutions

e-

E-Library

Data Storage

And this will be linked to a central depository which will be managed and controlled by a policy making body or the technical working group. As the development model and work process have shown, the Department of Science and Technology (DOST) has provided funding, under its Grantsin-Aid (GIA) Program, to implement the Philippine e-Science Grid (PSiGrid) Program. The three-year program (2008-2011), which aims to establish a grid infrastructure in the Philippines to improve research collaboration among educational and research institutions, both locally and 2

PREGINET is a nationwide broadband research and education network that interconnects academic, research and government institutions. It is the first government-led initiative to establish a National Research and Education Network (NREN) in the country. PREGINET utilizes existing infrastructure of the Telecommunications Office (TELOF) of the Department of Transportation and Communications (DOTC) as redundant links.

Effect of Research Life Cycle Utilizing Grid Technology – the e-Science in the Philippines

By understanding the full research life cycle allows us to identify gaps in services, technologies and partnerships that could harness eventual utilization of Grid technology in an e-Science framework. There is also a need to understand the process of collaboration in e-Science in order to fully and accurately define requirements for next generation Access Grids [27]. The emergence of e-Science systems raises also challenging issues concerning design and usability of representations of information, knowledge or expertise across variety of potential users that could lead to a scientific discovery. [28]. The quest about e-Science frequently focuses on enormous hardware, user interfaces, storage capacity and other technical issues, in the end, the capability of e-Science to serve the needs of scientific research teams boils down to people ; the ability of the builders of the infrastructure to communicate with its users and understand their needs and the realities of their work cultures [29]. The builders and implementors of e-Science infrastructure requires in focusing more about fostering, preferably than building the infrastructure. There are social features to research that must be recognized, from understanding how research teams work and interact to

66


realizing that research often does not involve the kinds of large, interdisciplinary projects engaged in by virtual organizations, but rather individual work and unplanned or ad-hoc, flexible forms of collaboration within wider communities. The grid is transforming science, business which in effect, e-Science research, business and commerce will significantly benefit from grid based technologies which will potential increase abilities, efficiency and effectiveness through leading edge technology applications and solving large scientific and business computing problems. Although on the part of socio-economic aspects, this will demand investigation to address issues such as ethics, privacy, liability, risk and responsibility for future public policies. In addition, for the envisaged new forms of business models, economic and legal issues are also at stake which will require interdisciplinary research. In the long run, the lasting and permanent effects of high-speed networks, data stores, computing systems, sensor networks, and collaborative technologies that make eScience possible will be up to the people who create it and use it. For e-science projects like PSiGrid program in the Philippines, the majority (if not all) of the funding is from government sources of all types. For this cooperation to be sustainable, however, especially in commercial or government settings, participants need to have an economic incentive. Thus, as it is stated in the preceding sections, PSIGrid aims to establish a grid infrastructure in the Philippines that will be needed to maximize and improve the potential of research collaboration among educational and research institutions, both locally and abroad. For this start and with the promising vision that e-Science have, there is a great chance for the PSiGrid program to also participate on the global world and thus come up with technologies that would be beneficial to its citizens.

8

Conclusion

To conclude, given the above viewpoints on lifecycle and e-Science models, there have been important changes how technology especially on scientific researches can be successfully managed. The trend in technology goes towards increasingly global collaborations for scientific research. In every country that initiated implementing its vision for e-Science, it can be seen that each had its own strategy to face the challenges not only with regards to technical issues such as dependability, interoperability, resource management, etc. but also more on people-centric relating to its collaboration and sharing of its resources and data. For example in the case of United Kingdom (UK), they had established nine e-Science centers and eight other regional centers covering

most of UK which primarily aimed to allocate substantial computing and data resources and run standard set of Grid middleware to form the basis for the construction of UK Grid testbed, to generate portfolio industrial Grid middleware and tools and lastly to disseminate information and experience of Grid. [30] The ideas presented on this paper on both the e-Science models and lifecycle approach will have impact in giving insights, directions and encouragement for policy makers along with valuable contribution to serving the Filipino people especially those scientists and researchers in coming up with technological breakthrough. To further the studies and research, evaluation of the life cycle that co-exist in the Global trends and in the Philippine perpective, e-Science tools must be more intuitive for the biomedical community not only use them in a collaborative R&D.

7

References

[1] The 2007 Microsoft e-Science Workshop at RENCI, https://www.mses07.net/main.aspx [2] Conceptualizing the Digital http://iassistblog.org/?p=26

Life

Cycle,

[3] Sustainable Technologies Research Group, http://www.dlsu.edu.ph/research/centers/cesdr/strg.asp [4] Boonstra, O; Breure, L; Doorn, P; Past, Present and Future of Historical Information Science, Netherlands Institute for Scientific Information Royal Netherlands Academy of Arts and Sciences, 2004. [5] Berman, F; Hey, A. J. G; Fox, G. C; Grid Computing – Making the Global Infrastructure a Reality, Wiley Series of Communications Networking & Distributed Systems, 2003 [6] http://www.psigrid.gov.ph/index.php [7] Humphrey, Charles; e-Science and the Life Cycle of Research, IASSIST Communiqué, 2006 [8] Hayes-Roth, F.; Waterman, D. A; Lenat, D. B; Building Expert Systems, Reading, Addison-Wesley, 1983 [9] Shadbolt, N. R; Burton, M, Knowledge elicitation: A systematic approach, in evaluation of human work, and Wilson, J. R; Corlett, E. N. (eds) A Practical Ergonomics Methodology, London, UK: Taylor & Francis, 1995 [10] Hoffman, R.; Shadbolt, N. R.; Burton, A. M; Klein, G., Eliciting knowledge from experts: A methodological


analysis. Organizational Processes, 1995

Behavior

and

Decision

[11] Shadbolt, N. R.; O’Hara, K.; Crow, L., The experimental evaluation of knowledge acquisition techniques and methods: history, problems and new directions, International Journal of Human Computer Studies, 1999 [12] Milton, N.; Shadbolt, N.; Cottam, H.; Hammersley, M.; Towards a knowledge technology for knowledge management. International Journal of Human Computer Studies, 1999 [13] Shaw, M. L. G.; Gaines, B. R., WebGrid-II: Developing hierarchical knowledge structures from flat grids. Proceedings of the 11th Knowledge Acquisition Workshop (KAW ’98), 1998 Banff, Canada, April, 1998, http://repgrid.com/reports/KBS/WG/. [14] Crow, L.; Shadbolt, N. R., Extracting focused knowledge from the semantic web. International Journal of Human Computer Studies, 2001 [15] Ciravegna, F.; Adaptive information extraction from text by rule induction and generalization, Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI2001), Seattle, 2001 [16] Middleton, S. E.; De Roure, D.; Shadbolt, N. R., Capturing knowledge of user preferences: Ontologies in recommender systems, Proceedings of the First International Conference on Knowledge Capture, KCAP2001. New York: ACM Press, 2001 [17] Fox, G; e-Science meets computational science and information technology. Computing in Science and Engineering, 2002, http://www.computer.org/cise/cs2002/c4toc.htm. [18] Taylor, J. M. and e-Science, http://www.escience.clrc.ac.uk and http://www.escience-grid.org.uk/ [19] W. Wulf; The National Collaboratory – A White Paper in Towards a National Collaboratory, Unpublished report of a NSF workshop, Rockefeller University, New York, 1989 [20] Kouzes, R. T.; Myers, J. D.; Wulf, W. A.; Collaboratories: Doing science on the Internet, IEEE Computer August 1996, IEEE Fifth Workshops on Enabling Technology: Infrastructure for Collaborative Enterprises (WET ICE ’96), 1996, http://www.emsl.pnl.gov:2080/docs/collab/presentations /papers/IEEECollaboratories.html.

67

[21] Sarmenta, L. F. G; Bayanihan Computing .NET: Grid Computing with XML Web Services, 2002, http://bayanihancomputing.net/ [22] Search for Extraterrestrial Intelligence, http://setiathome.berkeley.edu/sah_about.php [23] Survival Strategy for Securing Competitiveness, Korea IT Times, 2007

International

[24] Ella Tallyn, et.al. Introducing e-Science to the Classroom, http://www.equator.ac.uk/var/uploads/Ella2004.pdf [25] Prof. Iversen, Oxford University e-Science Open Day, 2004 [26] Henry Gardner, et. al, eScience curricula at two Australian universities, Australian National University and RMIT, Melbourne, Australia, 2004 [27] David De Roure, Infrastructure, 2001

et.

al;

A

Future

e-Science

[28] Usability Research Challenges in e-Science, UK eScience Usability Task Force [29]Voss, Alex; Features: e-Science It’s Really About People, HPCWire – High Productivity Computing website, RENCI, 2007 [30] Tony Hey and Anne E. Trefethen, The UK e-Science Core Programme and the Grid, ICCS, Springer-Verlag Berlin Heidelberg, 2002

68


An Agent-based Service Discovery Algorithm Using Agent Directors for Grid Computing Leila Khatibzadeh1, Hossein Deldari2 Computer Department, Azad University, Mashhad, Iran 2 Computer Department, Ferdowsi University, Mashhad, Iran 1

Abstract - Grid computing has emerged as a viable method to solve computational and data-intensive problems which are executable in various domains from business computing to scientific research. However, grid environments are largely heterogeneous, distributed and dynamic, all of which increase the complexities involved in developing grid applications. Several software has been developed to provide programming environments hiding these complexities and also simplifying grid application development. Since agent technologies are more than a tenyear study and also because of the flexibility and the complexity of the grid software infrastructure, multi-agent systems are one of the ways to overcome the challenges in the grid development. In this paper, we have considered the needs for programs running in grid, so a three-layer agentbased parallel programming model for grid computing is presented. This model is based on interactions among the agents and we have also implemented a service discovery algorithm for the application layer. To support agent-based programs we have extended GridSim toolkit and implemented our model in it. Keywords: Agents, Grid, Java, Parallel Programming, Service Discovery Algorithm

1

Introduction

Grid applications are the next-generation network applications for solving the world’s computational and dataintensive problems. Grid applications support the integrated and secure use of a variety of shared and distributed resources, such as high-performance computers, workstations, data repositories, instruments, etc. The heterogeneous and dynamic nature of the grid requires its applications’ high performance and also needs to be robust and fault-tolerant [1]. Grid applications run on different types of resources whose configurations may change during run-time. These dynamic configurations could be motivated by changes in the environment, e.g., performance changes, hardware failures, or the need to flexibly compose virtual organizations from available grid resources [2]. Grids are also used for large-scale, high-performance computing. High performance requires a balance of computation and communication among all resources involved. Currently this

is achieved through managing computation, communication and data locality by using message-passing or remote method invocation [3]. Designing and implementing applications that possess such features is often difficult from the ground up. As such, several programming models have been presented. In this paper, we have proposed a new programming model for Grid. Numerous research projects have already introduced class libraries or language extensions for Java to enable parallel and distributed high-level programming. There are many advantages of using Java for Grid Computing including portability, easy deployment of Java’s bytecode, component architecture provided through JavaBeans, a wide variety of class libraries that include additional functionality such as secure socket communication or complex message passing. Because of these, Java has been chosen for this model. As agent technology will be of great help in pervasive computing, Multi-Agent systems in aforementioned environments pose important challenges, thus, the use of agents has become a necessity. In agent-based software engineering, programs are written as software agents communicating with each other by exchanging messages through a communication language [4]. How the agents communicate with each other differs in each method. In this paper, a three-layer agentbased model is presented based on interactions between the agents and a service discovery algorithm is implemented in this model. The rest of the paper is organized as follows: In Section 2, a brief review of related works on programming models in Grid and a review of service discovery algorithms are given. Section 3 presents the proposed three-layer agentbased parallel programming model in details. The model in which the service discovery algorithm has been presented is shown. Simulation results are presented in Section 4. Finally, the paper is concluded in Section 5.

2

Related works

There are different kinds of programming models, each of which have been implemented in different environments. The summary of these programming models are briefly explained below.


Superscalar is a common concept in parallel computing [5]. Sequential applications composed of the tasks of a certain granularity are automatically converted into a parallel application. There the tasks are executed in different servers of a computational Grid. MPICH-G2 is a Grid-enabled implementation of the Message Passing Interface (MPI) [6]. MPI defines standard functions for communication between processes and groups of processes. Using the Globus Toolkit, MPICH-G2 provides extensions to MPICH. This gives users familiar with MPI an easy way of Grid enabling their MPI applications. The following services are provided by the MPICH-G2 system: co-allocation, security, executable staging and results collection, communication and monitoring [5]. Grid–enabled RPC is a Remote Procedure Call (RPC) model and an API for Grids [5]. Besides providing standard RPC semantics, it offers a convenient, highlevel abstraction whereby many interactions with a Grid environment can be hidden. GridRPC seeks to combine the standard RPC programming model with asynchronous course-grained parallel tasking. Gridbus Broker is a software resource that transparently permits users access to heterogeneous Grid resources [5]. The Gridbus Broker Application Program Interface (API) provides a straightforward means to users to Grid-enable their applications with minimal extra programming. Implemented in Java, the Gridbus Broker provides a variety of services including resource discovery, transparent access to computational resources, job scheduling and job monitoring. The Gridbus broker transforms user requirements into a set of jobs that are scheduled on the appropriate resources. It manages them and collects the results. ProActive is a Java-based library that provides an API for the creation, execution and management of distributed active objects [7]. Proactive is composed of only standard Java classes and requires no changes to the Java Virtual Machine (JVM). This allows Grid applications to be developed using standard Java code. In addition, ProActive features group communication, object-oriented Single Program Multiple Data (OO SPMD), distributed and hierarchical components, security, fault tolerance, a peer-to-peer infrastructure, a graphical user interface and a powerful XML-based deployment model. Alchemi is a Microsoft .NET Grid computing framework, consisting of service-oriented middleware and an application program interface (API) [8,9]. Alchemi features a simple and familiar multithreaded programming model. Alchemi is based on the masterworker parallel programming paradigm and implements the concept of Grid threads.

69

Grid Thread Programming Environment (GTPE) is a programming environment, implemented in Java, utilizing the Gridbus Broker API [1]. GTPE further abstracts the task of Grid application development and automates Grid management while providing a finer level of logical program control through the use of distributed threads. The GTPE is architected with the following primary design objectives: usability and portability, flexibility, performance, fault tolerance and security. GTPE provides additional functionality to minimize the effort necessary to work with grid threads [1]. Open Grid Services Architecture (OGSA) is an ongoing project that aims to enable interoperability among heterogeneous resources by aligning Grid technologies with established Web services technology [5]. The concept of a Grid service is likened to a Web service that provides a set of well-defined interfaces that follow specific conventions. These Grid services can be composed into more sophisticated services to meet the needs of users. The OGSA is an architecture specification defining the semantics and mechanisms governing the creation, access, use, maintenance and destruction of Grid services. The following specifications are provided: Grid service instances, upgradability and communication, service discovery, notification, service lifetime management and higher level capabilities. Dynamic service discovery is not a new issue. There are several solutions proposed for fixed networks, all with different levels of acceptance [10]. We will now briefly review some of them: SLP, Jini, Salutation and UPnP’s SSDP. The Service Location Protocol (SLP) is an Internet Engineering Task Force standard for enabling IP networkbased applications to automatically discover the location of a required service [11]. The SLP defines three “agents”, User Agents (UA) that perform service discovery on behalf of client software, Service Agents (SA) that advertise the location and attributes on behalf of services, and Directory Agents (DA) that store information about the services announced in the network. SLP has two different modes of operation. When a DA is present, it collects all service information advertised by SAs. The UAs unicast their requests to the DA, and when there is no DA, the UAs repeatedly multicast these requests. SAs listen to these multicast requests and unicast their responses to the UAs [10]. Jini is a technology developed by Sun Microsystems [12]. Its goal is to enable truly distributed computing by representing hardware and software as Java objects that can adapt themselves to communities and allow objects to access services on a network in a flexible way. Similar to the Directory Agent in SLP, service discovery in Jini is

70


based on a directory service, named the Jini Lookup Service (JLS). As Jini allows clients to always discover services, JLS is necessary for its functioning.

¾ The middle level is the communication layer which defines the manner of communicating among agents. There are different communication methods.

Salutation is an architecture for searching, discovering, and accessing services and information [13]. Its goal is to solve the problems of service discovery and utilization among a broad set of applications and equipment in an environment of widespread connectivity and mobility. Salutation architecture defines an entity called the Salutation Manager (SLM). This functions as a directory of applications, services and devices, generically called Networked Entities. The SLM allows networked entities to discover and use the capabilities of other networked entities [10].

Direct communication, such as contract-net and specification sharing, has some disadvantages [4]. One of the problems with this form of communication is the cost. If the agent community is as large as grid, then the overhead in broadcasting messages among agents is quite high. Another problem is the complexity of implementation. Therefore, an indirect communication method using Agent Directors (ADs) has been considered. An AD is a manager agent which directs inter-communication among its own agents as well as communication between these agents and other agents through other ADs. Figure 1 shows the scheme of this communication. Message passing and transfer rate are reduced in this model. Depending on the agent’s behavior (requester/replier), two AMS (Agent Management System) are considered in each AD. In each AMS, there is information about other agents and their ADs which directs the request to the desired AD as necessary. As for the two AMS, time spent seeking in the agent platform decreases.

Simple Service Discovery Protocol (SSDP) was created as a lightweight discovery protocol for the Universal Plugand-Play (UPnP) initiative [14]. It defines a minimal protocol for multicast-based discovery. SSDP can work with or without its central directory service, called the Service Directory. When a service intends to join the network, it first sends an announcement message to notify its presence to the rest of the devices. This announcement may be sent by multicast, so that all other devices will see it, and the Service Directory, if present, will record the announcement. Alternatively, the announcement may be sent by unicast directly to the Service Directory. When a client wishes to discover a service, it may ask the Service Directory about it or it may send a multicast message asking for it [10]. In this paper, according to the models presented, we have integrated the benefits of message passing through explicit communication. In addition, we have made use of the Java-based model because of the similarity of portability to that of the distributed object method. Agents that act like distributed threads were also utilized. Playing the role of the third layer of this model, the service discovery algorithm was also implemented.

3 The three-layer agent-based parallel programming model In this section, a three-layer agent-based parallel programming model for grid is presented and the communications among agents classified into three layers: ¾ The lower level is the transfer layer which defines the way of passing messages among agents. In this layer, the sending and receiving of messages among agents are based on the UDP protocol. A port number is assigned to each agent which, in turn, sends and receives messages.

Based on the AD’s needs and the type of request, the AD’s table of information on other agents in other ADs is updated in order to produce a proper decision. This update reduces the amount of transactions among ADs.

agent

agent

agent

agent

agent AD

AD

agent agent

AD

agent

agent

agent

AD

agent

agent

agent

AD agent

agent agent

Figure 1. The scheme of communication among agents in the presented model

The following parameters are considered for each AMS when being stored in the AMS table: • AID: Is a unique name for each agent which is composed of an ID of the name of the machine where the agent was created.


71

• State: Shows the state of the agent. Three states are possible in our model:

implemented is based on SLP and Jini. We have considered three kinds of agents similar to those of SLP:

I. Active State: When the agent is fully initiated and ready to use.

1- User Agents (UA) that run service discovery on behalf of client software.

II. Waiting State: When the agent is blocked and waiting for a reply from other agents.

2- Service Agents (SA) that announce location and attributes on behalf of services. In the implemented algorithm, each SA may have different kinds of services. The creation and termination of services are dynamic and the list of services that each SA owns dynamically changes. The root of action of SAs in our model is similar to that of process groups in MPI.

III. Dead State: When the agent has completed its job. In this state, information about this agent is removed from AMS. • Address: Shows the location of an agent. If an agent has migrated to other machines, this is announced to the AD. • AD name: Shows the name of the AD associated with the agent. This parameter is essential as communication among agents is performed through ADs. • Message: Shows the message sent from an agent to the AD. According to the agent’s behavior, different kinds of messages are generated. The content of the message shows the interaction that is formed between the AD and the agent. Based on FIPA [15], we have considered ACL (Agent Communication Language) with some revisions. Below is the list of the different message types: If the agent acts as a requester, then the content of the message, based on the program running on grid, explains the content of the request, e.g., a required service in the service discovery algorithm. As the agent is created, the message content includes the agent’s specifications for registering in AMS. If the agent acts as a replier, the message content, based on the program which is running on grid, explains the content of the reply for that request. When the job that is related to the agent is completed, the message content informs AD to remove this record from AMS. While a program is running, if an error occurs, the message content informs AD.

Directory Agents (DA) that store information about services informed in the network are similar to JLS in Jini. In this model, each AD contains one DA. Because the structure of ADs includes histories of the processes that each agent performs, this model is effective for algorithms such as the service discovery. With two different histories implemented for each AD, the way of searching services in grid has been facilitated.

4

Simulation Results

This model has been implemented in Java by using the GridSim toolkit [16]. We extended the GridSim toolkit to implement agent-based programs with this simulator. The gridsim.agent package has also been added, and three different models have been implemented. The three-layer agent-based parallel programming model presented in this paper has been compared with these models through a service discovery algorithm. The first model involves message passing among nodes that form the service discovery algorithm. The second, called Blackboard, is a simple model for communication among agents [4]. In this communication model, information is made available to all agents in the system through an agent named “Blackboard”. The Blackboard agent acts like a central server. The third model is our own and was fully explained in Section 3.

We have considered a history table which keeps information about agents which have completed their jobs. The jobs done through AD are reported to the user.

In order to evaluate our model, we measured the number of messages which were sent by agents until the time all user agents obtained their services. Three different methods were considered for this algorithm:

In this model, AD decides to send the request to its own agents or to other agents in other platforms. This decision is made according to the information that is stored in the replier AMS and the condition of agents.

1- Message passing: In this method, there is no difference between the types of agents. All agents act like nodes in the message passing method.

¾ The higher level is the user application layer, which defines the application running through software agents. The application used in this model for testing is a service discovery algorithm. The algorithm

3- Agent Director: In this method, each AD acts like a DA.

2- Blackboard: In this method, the Blackboard agent acts like a Directory Agent.

72


- In message passing, the cost is estimated to be between 10 and 100. - In Blackboard, the cost between agents and the Blackboard agent is estimated to be between 10 and 50. - In AD, because of the two existing types of communications, two different costs were calculated. One is for ADs and is estimated to be between 50 and 100. The other is between AD and agents, which is estimated to be between 10 and 20. These estimations are derived according to the method’s actions. Figure 2. The comparison of the average number of messages received by agents

In Figure 2, the vertical axis represents the average number of messages and the horizontal axis represents the number of agents. In the first method, the number of agents means the number of nodes. In the other methods, the number of agents is the sum of the three types of agents earlier explained.

Figure 4. Time for different service discovery algorithms Vs. number of agents

It is obvious from Figure 4 that with an increase in the number of agents, execution time also increases. Due to the agent-based nature of Blackboard and AD, the time in which all user agents reach their services is shorter. Figure 3. Cost Vs. Number of Agents The results show that the average number of messages increase as the number of agents increase. This is quite obvious, because as the number of agents increase, the communication among agents also increases. It is observed from Figure 3 that if the number of agents (nodes in message passing) exceeds 40, the cost of the communication in message passing implementation rises suddenly. As the number of nodes increases, the number of messages performing service discovery grows, along with the cost of communication. In other words, due to the lack of a database which could store the specifications of nodes having services, the cost of message passing rises suddenly when the number of agents exceeds 40. However, AD performs better than Blackboard. We estimate the cost of each method as follows:

5

Conclusion and Future Work

In this paper, we have studied different programming models that had been previously presented for grid. Because of the Java advantages for Grid Computing, among others, portability, easy deployment of Java’s bytecode, component architecture and a wide variety of class libraries, it has been chosen in this research. As agent technology is very effective in pervasive computing, the use of Multi-Agent systems in pervasive environments poses great challenges, thus, making the use of agents a necessity. In agent-based software engineering, programs are written as software agents that communicate with each other by exchanging messages through a communication language. How the agents communicate with each other differs in each method. In this paper a three layer agent-based programming model has been presented that is based on


interactions among agents. We have integrated the benefits of message passing through explicit communication and made use of a Java-based model because its portability is similar to that of the distributed object method. In addition, agents acting like distributed threads were utilized and so formed a three-layered programming model for Grid based on agents. We have extended the GridSim toolkit simulator and added the gridsim.agent package for agent-based parallel programming. A service discovery algorithm has been implemented as the third layer of the model. Using our model, measured parameters, such as the number of messages, cost and execution time, resulted in a better operation compared to that of the other methods.

6

References

[1] H. Soh, S. Haque,W. Liao, K. Nadiminti, and R. Buyya, “GTPE: A Thread Programming Environment for the Grid”, Proceedings of the 13th International Conference on Advanced Computing and Communications, Coimbatore, India, 2005 [2] I. Foster, C. Kesselman, and S. Tuecke. “The anatomy of the grid: Enabling scalable virtual organizations”. Intl. J. Supercomputer Applications, 2001. [3] D. Talia, C. Lee, “Grid Programming Models: Current Tools, Issues and Directions”, Grid Computing, G. F. Fran Berman, Tony Hey, Ed., pp. 555–578, Wiley Press, USA, 2003. [4] C. F. Ngolah, “A tutorial on agent communication and knowledge sharing”, University of Calgary, SENG609.22 Agentbased software engineering, 2003. [5] H. Soh, S. Haque, W. Liao and R. Buyya, “Grid programming models and environments”, In: Advanced Parallel and Distributed Computing ISBN 1-60021-202-6 [6] N. T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A GridEnabled Implementation of the Message Passing Interface”, Journal of Parallel and Distrbuted Computing (JPDC), vol. 63, pp. 551-563, 2002. [7] ProactiveTeam, “Proactive Manual REVISED 2.2”, Proactive, INRIA April 2005. [8] A. Luther, R. Buyya, R. Ranjan, and S. Venugopal, “Alchemi: A .NET-Based Enterprise Grid Computing System”, Proceedings of the 6th International Conference on Internet Computing (ICOMP’05), June 27-30, 2005, Las Vegas, USA. [9] A. Luther, R. Buyya, R. Ranjan, and S. Venugopal, “Peer-toPeer Grid Computing and a .NET-based Alchemi Framework”, High Performance Computing: Paradigm and Infrastructure, L. Y. a. M. Guo, Ed.: Wiley Press, 2005. [10] Celeste Campo, “Service Discovery in Pervasive MultiAgent Systems”, pdp_aamas2002, Workshop on Ubiquitous Agents on embedded, wearable, and mobile devices 2002 Bolonia, Italy. [11] IETF Network Working Group. “Service Location Protocol”, 1997. [12] S. Microsystems. “Jini architectural overview”. White paper. Technical report, 1999.

73

[13] I. Salutation Consortium. “Salutation architecture overview”. Technical report, 1998. [14] Y. Y. Goland, T. Cai, P. Leach, and Y. Gu. “Simple service discovery protocol/1.0”. Technical report, 1999. [15] http://www.fipa.org/ [16] R. Buyya, and M. Murshed, GridSim: “A Toolkit for the Modeling, and Simulation of Distributed Resource Management, and Scheduling for Grid Computing”, The Journal of Concurrency, and Computation: Practice, and Experience (CCPE), Volume 14, Issue 13-15, Pages: 1175-1220, Wiley Press, USA, November December 2002.

74


Optimization of Job Super Scheduler Architecture in Computational Grid Environments M. Shiraz.+ M. A. Ansari*. +

Allama Iqbal Open University Islamabad. Federal Urdu University of Arts, Sciences & Technology Islamabad + [email protected]. +920339016430 * [email protected]. +9203215285504 Conference GCA’08. *

Abstract - Distributed applications running over distributed system communicate through inter process communication mechanisms. These mechanisms may be either with in a system or between two different systems. The complexities of IPC adversely affect the performance of the system. Load balancing is an important feature of distributed system. This research work is focused on the optimization of the Superscheduler architecture. It is a load balancing algorithm designed for sharing work load on computational grid. It has two perspectives i.e. Local Scheduling and Grid Scheduling. Some unnecessary inter process communication has been identified in the local scheduling mechanism of the job Superscheduler architecture. The critical part of this research work is the interaction between grid scheduler and autonomous local scheduler. In this paper an optimized Superscheduler architecture has been proposed with optimal local scheduling mechanism. Performance comparisons with earlier architecture of workloads are conducted in a simulation environment. Several key metrics demonstrate that substantial performance gains can be achieved in local scheduling via proposed Superscheduling architecture in distributed computation environment.

Keywords: Inter Process Communication (IPC), Distributed Computing, Grid Scheduler, Local Scheduler, Grid Middleware, Grid Queue, Local Queue, Superscheduler (SSCH).

I.

INTRODUCTION

Distributed computing has been defined in a number of different ways [1] Different types distributed systems are deployed worldwide e.g. Internet, Intranet, Mobile Computing etc. Distributed applications run over distributed system. These applications are the main communicating entities at the application layer of the distributed system. E.g. video conferencing, web application, Email application, chatting software etc. Each application has its own architecture and requires

specific protocol for its implementation. All the distributed applications run over middleware, and use its services for IPC. Cluster computing [1] and Grid computing [8] [10] are two different forms of distributed system. Load balancing is a challenging feature in distributed computing environment. Objective is to find the under loaded system and share the processing work load dynamically so that to efficiently utilize network resources and increase throughput. A job Superscheduler architecture for load balancing in grid environment has been proposed earlier [9]. This architecture has two schedulers i.e. autonomous local scheduler and grid scheduler. Job scheduling has two perspectives in this architecture. I.e. local scheduling and grid scheduling. Local scheduling is used to schedule jobs on local hosts, while grid scheduling is used to schedule the jobs for remote hosts for sharing work load. This research work is based on the optimization of a specific aspect of Superscheduler architecture i.e. in local scheduling different components of the Superscheduler get involved. I.e. Grid Scheduler, Grid Middleware, Grid Queue, Local Scheduler, and Local Queue. Interaction between different components involve inter process communication (IPC). IPC involves the complexities of context switching and domain transition.[7][3] Therefore large number of IPC adversely affect the performance of the system. [2]. Some unnecessary IPC has been identified in the local scheduling of job Superscheduler architecture. This research work is focused on this specific context of Superscheduler architecture. An optimized architecture has been proposed with minimum possible IPC. Processing workload in simulation environment evaluates performances of both architectures.


Several key metrics demonstrate that substantial performance gains in local scheduling can be achieved via proposed Superscheduling in distributed computing environment. II. RELATED WORK There are different policies available for job scheduling on distributed grid environment [4][5][6][11]. Superscheduler architecture [9] is a load balancing technique for sharing work load on distributed grid environment. There are three major processes and two data structures used in this architecture: The processes include Grid Middleware (GM), Grid Scheduler (GS) and Local Scheduler (LS). Data Structures Include Grid Queue (GQ), and Local Queue (LQ). The architecture is illustrated below.

75

grid partner). Once job enters the Local Queue; Local Scheduler (independent of Grid Scheduler) monitors its execution. Grid Scheduler has no control over Local Scheduler. Analysis of the flow among different components of the architecture shows some unnecessary inter process communication. It is expected that by minimizing the communication among different components of the architecture performance may be improved. This work is focused on performance optimization on the local scheduling policy. In earlier scheduling process bi-directional communication among Grid Scheduler, Grid Middleware, and Local Scheduler is identified as unnecessary for the case of local job processing. An optimized Superscheduler architecture with minimum IPC has been proposed in this paper. In the modified architecture it has been focused to minimize inter process communication as much as possible so that the Superscheduler algorithm could be optimized. III. PROPOSED ARCHITECTURE

Fig. I.Distributed Architecture of the Grid Superscheduler [9]

During grid Superscheduling interaction between different components of the architecture occurs as follows: A newly coming job enters Grid Queue; Grid Scheduler computes its resource utilization requirements and queries Local Scheduler through Grid Middleware for Approximate Waiting Time (AWT) on local system for that job. The job waits in the Grid Queue before beginning execution on the local system. Local Scheduler computes AWT based on local scheduling policy and Local Queue status. If the local resources cannot fulfill the requirements of the job AWT of infinity is returned. If AWT is below a threshold value job is moved directly from Grid Queue to Local Queue without any external network communication. If the value of AWT is greater than threshold or infinity, then one of the three migration policies [9] is invoked for the selection of appropriate under loaded remote host and job migration. Processor always processes job from local queue (whether on local system or remote system i.e.

The proposed architecture contains the same number of components and in the same positions. In this architecture the main focus is the change of sequence of flow in the initial process of job scheduling. The proposed sequence of flow is such that a newly arrived job should enter Local Queue (Instead of Grid Queue), Local Scheduler should compute the processing requirements for the newly entered process (instead of Grid Scheduler). If the Average Waiting Time (AWT) on the local system is less than a thresholdφ, then the Local Scheduler should schedule the job by using local scheduling policy. It would not involve Grid Queue, Grid Scheduler, and Grid Middleware at all. None of these components are needed in local scheduling. These components are needed in those situations only when the local system is overloaded, and it could not execute the task as much efficiently as other system of the grid environment. In that situation Local Scheduler would communicate with the Grid Scheduler through Grid Middleware, it would send processing requirements of the newly arrived process. Then the job will be moved from Local Queue to Grid Queue. Grid Scheduler would then initiate any of the job migration policies [9] as in earlier architecture, and would migrate the job to the best available host on the grid

76


environment. Proposed architecture is shown in the following figure.

Table 1 is composed of the following attributes. 1. 2.

3.

4.

Fig. II Proposed Architecture.

IV. RESULTS AND DISCUSSION The work load has been processed in simulation environment. Table 1 show the workload processed through simulator.

Job ID 1 2

TABLE I Workload Processed through Simulator Input Run Number of PE Required Time Time 5 20 2 8 30 2

1

5

7.25

6.25

1

0

124

123

8

11.25

9.25

2

0

185

183

10

14.25

11.25

3

0

216

213

4

12

17.25

13.25

4

0

247

243

5

15

21.25

16.25

5

0

278

273

6

18

25.25

19.25

6

0

307

303

7

20

28.25

21.25

7

0

337

333

8

23

32.25

24.25

8

0

367

363

2

9

27

37.25

28.25

9

0

400

393

29

40.25

30.25

10

0

429

423

35

2

4

12

40

2

5

15

45

2

6

18

50

2

7

20

55

2

60

GridletId

Local Local IPC IPC Queue Queue Inpu Arrival Arrival Delay Delay Total Total in Cost in Cost in in t Time in Time in Time SSCH OSCH SSCH OSCH SSCH OSCH

3

10

23

TABLE II Comparison of simulation output of Superscheduler Architecture vs. Proposed Optimized Superscheduler Architecture.

2

3

8

Job Number: A counter field, starting from1. Input Time: in seconds. It represents the time at which the gridlet (a single job) is submitted for processing. The earliest time the log refers to is zero, and is the submittal time the of the first job. The lines in the log are sorted by ascending submittal times. Run Time: in seconds. The time for which the gridlet will use a single processing element of CPU. Number of Allocated Processors: an integer. In most cases this is also the number of processors the job uses. A job may require more than one PE’s.

9

27

65

2

10

10

29

70

2

11

33

45.25

34.25

11

0

462.75 453

11

33

75

2

12

34

47.25

35.5

11.75

0

490

483

12 13

34 39

80 85

2 2

13

39

48.5

40.25

8.25

0

519.25

513

14

41

90

2

14

41

53.25

42.25

11

0

552.75

543

15

42

95

2

15

42

56.25

43.5

12.75

0

584.5

573

16

43

100

2

16

43

58.25

44.75

13.5

0

615.25

603

17

44

105

2

17

44

60.25

46

14.25

0

644

633

18

48

62.25

49.25

13

0

674.75

663

49

69.25

50.5

18.75

0

710.5

693

18

48

110

2

19

49

115

2

19

20

50

120

2

20

50

71.25

51.75

19.5

0

741.25

723

2

21

51

73.25

53

20.25

0

772

753

52

75.25

54.25

21

0

802.75

783

21

51

125

22

52

130

2

22

23

53

135

2

23

53

77.25

55.5

21.75

0

833.5

813

24

54

140

2

24

54

79.25

56.75

22.5

0

864.25

843

25

55

145

2

25

55

81.25

58

23.25

0

896.25

873


The comparison of workload processing through both architectures indicates that there is difference between local queue arrival time of a job in both techniques, e.g. job id 1 arrives 1 second late to the local queue in earlier scheduling technique i.e. 7.25 as compared to the proposed scheduling technique i.e. 6.25 while the input time is same for both techniques i.e. 5. Similarly job 2 is submitted at time 8 for both techniques, but arrives the local queue 2 seconds

late in earlier scheduler technique i.e. 11.25 than that of proposed optimized scheduling technique i.e. 9.25. Job 25 is submitted at time 34, it arrives 23.25 seconds late in earlier scheduling technique i.e. 60.25, than that of proposed scheduling technique, i.e. 45. This is because of the extra communication involved among different components of the Superscheduler architecture. Total cost comparison in Table II indicates that total cost of the gridlet depends on two parameters, i.e. processing cost and IPC delay cost. Processing cost depends upon the run time and number of CPUs needed for each gridlet. Simulation results show that processing cost will remain the same for both scheduling techniques while earlier scheduling technique experiences extra IPC delay which will eventually increase total cost. If a gridlet experiences n seconds IPC delay its total cost will increase n times. e.g. In Table II gridletid 1 has total cost 123 in proposed optimized scheduling technique while it experiences IPC delay of one second in earlier scheduling technique, therefore its total cost is 124 i.e. one point increased. Similarly the total cost of each gridlet is increased depending upon its IPC delay. In proposed architecture IPC has been minimized in local scheduling, therefore the performance of the local processing scenario has been improved. The results of the comparison in table II have been elaborated through the following charts: IPC Delay Comparison 25 20 IPC Delay

Simulation output of workload processed through both Job Superscheduling Architectures is compared in tabulated form in Table II. This table has the following attributes: 1. Gridlet Id: This attribute represents job id. 2. Status: This attribute represents the status of job processed. if it has been successfully processed the value of this field will be successful. Other wise the value will be unsuccessful. As all the values of this field are successful, it means the jobs have been processed successfully. 3. Input time: This attribute shows the time at which a job enters the system initially. 4. Local Queue Arrival Time: It shows the time at which the jobs enter the local queue for local processing. 5. Inter Process Communication Delay (IPCD): This attribute indicates IPC delay experienced by each job before entering the local queue. The value of this field is derived by subtracting local queue arrival time of each job in optimized architecture from local queue arrival time in earlier scheduling architecture. In earlier architecture each job experience IPC delay before entering the local queue, while in optimized architecture each job is submitted directly to the local queue, therefore it experience no IPC delay for entering local queue. 6. Execution Start Time: This attribute represents the time at which a job is pick up by processor from local queue for processing. 7. Finish Time: This attribute represents the time at which a job leaves the processor. 8. Processing Cost: The difference between finish time and execution start time is called the processing cost of each job. 9. Total Cost: This attribute shows the total cost of each job processed. Total Cost of each job is computed from processing cost and IPC Delay cost of each job.

77

15 10 5 0 1

3

5

7

9

11

13

15

17

19

21

23

25

Gridlet

IPC Delay in SSCH

IPC Delay in OSCH

Fig. III. IPC delay comparison for each job in both scheduling techniques

As stated earlier the in optimized architecture each job enters local queue instead of grid queue. Therefore it experiences no IPC delay. Fig. III compares IPC Delay experienced by each job in both architectures. Blue line indicates the trend of increasing IPC delay in earlier architecture. Pink line indicates 0 unnecessary IPC delay for

78


each job. The graph is based on output of simulation.

Local Queue Arrival Time

Comparison of SuperScheduler Gridlet Local Queue Local Queue Arrival Time with Optimized Scheduler Arrival Time 90 80 70 60 50 40 30 20 10 0

V. CONCLUSION

1

3

5

7

9 11 13 15 17 19 21 23 25 Gridlet

Local Queue Arrival Time in SSCH

Local Queue Arrival Time In OSCH

Fig. IV: Comparison of local queue arrival time in both architectures.

Fig. IV compares local queue arrival time of each job in both architectures. Purple lines indicate local queue arrival time in earlier architecture. Red lines indicate local queue arrival time in optimized architecture. It clearly shows that in earlier architecture jobs take more time to enter the local queue as compared to the optimized architecture. This is because in earlier architecture a job is first entered the grid queue, then it experience IPC delay because of inter process communication among different components of the architecture before it enters the local queue. While in optimized architecture each job enters the local queue directly for processing. Thus there is no unnecessary inter process communication, and therefore it experience no IPC delay. Total Cost Comparison TotalCost= PC + IPCDC

cost. Each job has the same processing cost in both architectures. But in earlier architecture there is a factor of IPC delay cost for each job. Therefore the total cost of job will differ in both architectures. There is a trend of increasing total cost for each job in earlier architecture because of unnecessary IPC delay.

1000 500 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Gridlet TotalCost in SSCH

TotalCost in OSCH

Fig V. Comparison of Superscheduler Architecture with Optimized Scheduler Architecture

Fig.5 compares total cost of each job processed through both architectures. Blue lines indicate total cost of each job processed through earlier architecture. Red lines indicate total of each job in optimized architecture. Total cost of each job is computed from processing cost and IPC Delay

Simulation results add to the conclusion of the research work performed. It is concluded that extra inter process communication involved in the earlier Superscheduler architecture in local job processing adversely affected the performance of local system. Large number of IPC means large number of domain transitions and context switching. Therefore each job will experience unnecessary IPC delay before reaching the local queue. This unnecessary communication has been eradicated from the proposed architecture. Therefore the results show a substantial performance improvement. VI. FUTURE WORK This research work does not consider grid scheduling which is another perspective of the Superscheduler architecture. This work may be extended to external scheduling as well to further improve total system performance. Future work may consider job migration policies for external job migration, as the current job migration policies do not consider many complexities involved in network communication therefore it is expected to come up with more optimized architecture. VII. REFERENCES [1].Cluster Computing, http://www.cisco.com/application/pdf/en/us/. Accessed 20th march 2007. [2].Colulouris. G., Dollimore. J., Kindberg. T. (2001) [3].Context_switch.html(2006) The Linux Information Project, http://www.bellevuelinux.org/context_switch.html accessed on August 2, 2007. [4].Feitelson. G. D., Rudolph. L, Schwiegelshohn. U, Sevcik. C. K, & Wong. P. (1997) Theory And Practice In Parallel Job Scheduling, In 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume LNCS 1291, pp. 1–34. [5].Feitelson. G. D. & Weil. M. A. (1996) “Packing schemes for gang scheduling” In 2nd Workshop on Job Scheduling Strategies for Parallel Processing, volume LNCS 1162, pp. 89–100. [6].Franke. H, Jann. J, Moreira. E. J., Pattnaik. P, and Jette. A.M. (1999) An evaluation of parallel job scheduling for ASCI Blue-Pacific. In Proc. SC99. [7].Fromm. R, Treuhaft. N, (ud) Revisiting the Cache Interference Costs Of Context Switching, .


http://citeseer.ist.psu.edu/252861.html. [8].Global Grid Forum. http://www.gridforum.org [9].Hongzhang. S., Leonid. O., Rupak. B. (2003) “Job Superscheduler Architecture and Performance in Computational Grid Environments”, Proceedings of the ACM/IEEE SC2003 Conference [10].McCarthy. B. M. (2006) “Building Scalable, High Performance Cluster and Grid Networks: The Role of Ethernet”, Force10 Networks, Inc. [11].Nelson. D. R, Towsley. F. D, and Tantawi. N. A.(ud) Performance Analysis Of Parallel Processing Systems, IEEE Transactions on Software Engineering, 14(4): pp. 532–540.

79

80



SESSION GRID COMPUTING APPLICATIONS + ALGORITHMS Chair(s) TBA

81

82



83

Rapid Prototyping Capabilities for Conducting Research of Sun-Earth System T. Haupt, A. Kalyanasundaram, I. Zhuk High Performance Computing Collaboratory, Mississippi State University, USA

Abstract - This paper describes the requirements, design and implementation progress of an e-Science environment to enable rapid evaluation of potential uses of NASA research products and technologies to improve future operational systems for societal benefits. This project is intended to be a low-cost effort focused on integrating existing open source, public domain, and/or community developed software components and tools. Critical for success is a carefully designed implementation plan allowing for incremental enhancement of the scale and functionality of the system while maintaining an operational the system and hardening its implementation. This has been achieved by rigorously following the principles of separation of concerns, loose coupling, and service oriented architectures employing Portlet (GridSphere), Service Bus (ServiceMix), and Grid (Globus) technologies, as well as introducing a new layer on top of the THREDDS data server. At the current phase, the system provide data access through a data explorer allowing the user to view the metadata and provenance of the datasets, invoke data transformations such as subsampling, reprojections, format translations, and de-clouding of selected data sest or collections, as well as generate simulated data sets approximating data feed from future NASA missions. Keywords: Science Portals, Grid Computing, Interfaces, Data Repository, Online tools

1 1.1

Rich

Introduction Objectives of Rapid Prototyping Capability

The overall goal of the National Aeronautic and Space Administration’s (NASA) initiative to create a Rapid Prototyping Capability (RPC) is to speed the evaluation of potential uses of NASA research products and technologies to improve future operational systems by reducing the time to access, configure, and assess the effectiveness of NASA products and technologies. The developed RPC infrastructure will accomplish this goal and contribute to NASA's Strategic Objective to advance scientific knowledge of the Earth system through space-based observation, assimilation of new observations, and development and deployment of enabling technologies, systems and capabilities including those with potential to improve future operational systems.

Figure 1: The RPC concept as an integration platform for composing, executing, and analyzing numerical experiments for Earth-Sun System Science supporting the location transparency of resources. The infrastructure to support Rapid Prototyping Capabilities (RPC) is thus expected to provide the capability to rapidly evaluate innovative methods of linking science observations. To this end, the RPC should provide the capability to integrate the tools needed to evaluate the use of a wide variety of current and future NASA sensors and research results, model outputs, and knowledge, collectively referred to as “resources”. It is assumed that the resources are geographically distributed and thus RPC will provide the support for the location transparency of the resources. This paper describes a particular implementation of a RPC system under development by the Mississippi Research Consortium, in particular Mississippi State University, under a NASA/SSC contract as part of the NASA Applied Sciences Program. This is a work in progress, about one year from the inception of this project. 1.2

RPC experiments

Results of NASA research (including NASA partners) provide the basis for candidate solutions that demonstrate the capacity to improve future operational systems through activities administered by NASA’s Applied Sciences Program. Successfully extending NASA research results to operational organizations requires science rigor and capacity throughout the pathways from research to operations. A framework for the extension of applied sciences activities involves a Rapid Prototyping Capability (RPC) to accelerate

84


the evaluation of research results in an effort to identify candidate configurations for future benchmarking efforts. The results from the evaluation activity are verified and validated in candidate operational configurations through RPC experiments. The products of RPC studies will be archived and will be made accessible to all customers, users and stakeholders via the internet with a purpose of being utilized in competitively selected experiments proposed by the applied sciences community through NASA’s “Decisions” solicitation process [1]. Examples of currently funded RPC experiments (through the NASA grant to the Mississippi Research Consortium (MRC)) include: Rapid Prototyping of new NASA sensor data into the SEVIR system, Rapid prototyping of hyperspectral image analysis algorithms for improved invasive species decision support tools, an RPC evaluation of the watershed modeling program HSPF to NASA existing data, simulated future data streams, and model (LIS) data products, and Evaluation of the NASA Land Information System (LIS) using rapid prototyping capabilities.

2

System Requirements

The requirements for the infrastructure to support RPC experiments fall into two categories: (1) a computational platform seamlessly integrating geographically distributed resources into a single system to perform RPC experiments and (2) collaborative environment for dissemination of research results enabling a peer-review process. 2.1

Enabling RPC Experiments

The RPC is expected to support at least two major categories of experiments (and subsequent analysis): comparing results of a particular model as fed with data coming from different sources, and comparing different models using the data coming from the same source, as depicted in Fig. 2. In spite of being conceptually simple, two use cases defined in Fig. 2 in fact entail a significant technical challenge. The barriers currently faced by the researchers include inadequate data access mechanisms, lack of simulated data approximating feeds from sensors to be deployed by future NASA missions, a plethora of data formats and metadata systems, complex multi-step data pre-processing, and rigorous statistical analysis of results (comparisons between results obtained using different models and/or data). The data from NASA and other satellite missions are distributed by Data Active Archive Centers (DAAC) operated by NASA and its partners. The primary focus of DAACs is to feed post-processed data (e.g., calibrated, corrected for atmospheric effects, etc.) – referred to as the data products – for operational use by the US government and organizational users around the world. The access to the data by an individual researcher is currently cumbersome (the requests are processed asynchronously, as the data in most cases are

Figure2: Two major categories of experiments and subsequent analysis to be supported by RPC. not readily available online), and the pre-processing made by DAACs usually does not meet the researcher’s needs. In particular, the purpose of many RPC experiments is to define new pre-processing procedures that, if successful, can be later employed by DAACs to generate new data products. The pre-processing of the data takes many steps and the steps to be performed depends on technical details of the sensor and the nature of the research. For the sake of brevity, only Moderate Resolution Imaging Spectroradiometer (MODIS) [2] data are discussed here, as a representative example. MODIS sensors are deployed on two platforms: Aqua and Terra that are viewing the entire Earth's surface every 1 to 2 days, acquiring data in 36 spectral bands. The data (the planetary reflectance) is captured in swatches 2330 km (cross track) by 10 km (along track at nadir). The postprocessing of MODIS data may involve the selection of the region of interest (that may require combining several swatches taken at different times, and possibly merging data from Terra and Aqua), sub-sampling, re-projection, band selection or computation of the vegetation or moisture indices by combining data from different spectral bands, noise removal and de-clouding, feature detection, correlation with GIS data and other. The post-processed data are then fed into computational models and/or compared with in situ observations (changes in vegetation, changes in soil moisture, fires, etc.). Of particular interest for RPC experiments currently performed by MRC is the time evolution of Normalized Difference Vegetation Index (NDVI) defined as (NIR-RED)/(NIR+RED), where RED and NIR stand for the spectral reflectance measurements acquired in the red, and near-infrared regions, respectively. Different algorithms are being tested for eliminating gaps in the data caused by the cloud cover by fusing data collected by Aqua and Terra and weighted spatial and temporal interpolations. Finally, the comparison of data coming from different sources (and corresponding model predictions) require handling the differences in spatial and temporal resolutions, differences in satellites orbits, differences in spectral bands, and other sensor characteristics.


Enabling RPC experiments, in this context, means thus a radical simplification of access to both actual and simulated data, as well as tools for data pre- and post-processing. The tools must be interoperable allowing the user to create computational workflows with the data seamlessly transferred as needed, including third-party transfers to high-performance computing platforms. In addition, the provenance of the data must be preserved in order to document results of different what-if scenarios and to enable collaboration and data sharing between users. The development of the RPC system does not involve developing the tools for data processing. These tools are expected to be provided by the researchers performing experiments, projects focused on the tool development, and the community at large. Indeed, many tools for handling Earth science data are available from different sources, including NASA, USGS, NOAA, UCAR/Unidata, and numerous universities. Instead, the RPC system is expected to be an integration platform supporting adding (“plugging in”) tools as needed.

Enabling Community-Wide Peer-Review Process The essence of the RPC process is to provide an evaluation of the feasibility of transferring research capabilities into routine operations for societal benefits. The evaluation should result in a recommendation for the NASA administrators to pursue or abandon the topic under investigation. Since making an evaluation requires a narrow expertise in a given field (e.g., invasive species, crop predictions, fire protection, etc.), the results presented by a particular research needs to be peerreviewed. One way of doing that is publishing papers in professional journals and conferences. However, this introduces latency to the process, and the information given in a paper is not always sufficient for conclusive evaluation of the research results. The proposed remedy is to provide means for publishing the results electronically – that is, providing the community access not only to the final reports and publication but also to the data used and/or produced during the analysis, as well as providing access to tools used to derive the conclusions of the evaluation. The intention is not to let the peer scientists to repeat a complete experiment, which may involve processing voluminous data on highperformance systems, but rather to provide means for testing new procedures, tools, and final analysis developed in the course of performing the experiment.

Design Considerations The development of an RPC system satisfying all the requirements described above is an immense task. Consequently, one of the most important design decisions was to prioritize the system features and select the sequence of actions that would lead towards the implementation of the full functionality. Taking into account the particular needs of

85

the experiments carried on by MSC, the following implementation roadmap has been agreed upon [3]. Phase I: Interactive web site for describing the experiments and gathering feedback from the community. All experiments are performed outside the RPC infrastructure. Phase II: RPC data server acting as a cache for experimental data (Unidata’s THREDDS server [4]). In the prototype deployment a small amount (~6 TBytes) of disk space is made available for the experimenters with support for transfers of the data between the RPC data server and a hierarchical storage facility at the High Performance Computing Collaboratory (HPCC) at Mississippi State University via a 2 Mbytes/s link. The experiments obtain the data from DAACs “the old way” (through asynchronous requests) and store them at HPCC, or generate using computational models run on HPCC Linux clusters. Once transferred to the RPC data server, the data sets are available online. This is a transitional step, and still the experiments are executed outside the RPC infrastructure. However, since the data are online, they can be accessed by various standalone tools, such as Unidata’s Integrated Data Viewer (IDV) [5]. Phase III: Online tools for data processing (“transformations”). The tools are deployed as web services and integrated with the RPC data server. Through a web interface, the user sets the transformation parameters and selects the input data sets by browsing or searching the RPC data server. The results of the transformations (together with the available provenance information) are stored back at the RPC data server at the location specified by the user. The provenance information depends on the tool, in some case it is just the input parameter files and standard output, other tools generate additional log files and metadata. Since the THREDDS server “natively” handles data in the netCDF [6] format, the primary focus is given to tools for transforming NASA’s HDF-EOS [7] format (for example, MODIS data are distributed in this format), including HDF-EOS to geoTIFF Conversion Tool (HEG) [8] supporting reformatting, band selection, subsetting and stitching , and MODIS re-projection tools (MRT[9] and MRTswath[10]). The second class of tools integrated with the RPC system is the Applications Research Toolbox (ART) and the Time Series Product Tool (TSPT), developed specially for the RPC system by the Institute for Technology Development (ITD) [11] located at the NASA Stennis Space Center. The ART tool is used for generating simulated Visible/Infrared Imager/Radiometer Suite (VIIRS) data. VIIRS is a part of the National Polarorbiting Operational Environmental Satellite System (NPOESS) program, and it is expected to be deployed in 2009. The VIIRS data will replace MODIS. The TSPT tool generates layerstacks of various data products to assist in time series analysis (including de-clouding). In particular, TSPT operates on MODIS and simulated VIIRS data. Finally, the RPC system integrates the Performance Metrics Workbench tools for data visualizations and statistical analysis. These tools have been developed at the GeoResources Institute at

86


Mississippi State University and the Geoinformatics Center at the University of Mississippi. At this phase, the experimenters can use the RPC portal for rapid prototyping of experimental procedures using online, interactive tools on data uploaded to the RPC data server. Furthermore, the peer researchers can test the proposed methods using the same data sets and the same tools. Phase IV: Support for batch processing. The actual data analysis needed to complete an experiment usually requires processing huge volumes of data (e.g., a year’s worth of data). This is impractical to perform interactively using web interfaces. Instead, support for asynchronous (“batch”) processing is provided. The tools are still deployed as web services; however, they delegate the execution to remote high-performance computational resources. The user selects the range of files (or a folder), selects the transformation parameters and submits processing of all selected files by clicking a single submit button. Since the data pre-processing is usually embarrassingly parallel (the same operation is repeated for each file or for each pixel across the group of files in TSPT), the user automatically gains by using the Portal, as the system seamlessly makes all necessary data transfers and parallelizes the execution. Since the batch execution is asynchronous, the Portal provides tools for monitoring the progress of the task. Furthermore, even very complex computational models (as opposed to a relatively simple data transformation tools) can be easily converted to a Web service, and thus all of the computational needs of the user can be satisfied through the Portal. At this phase the user may actually perform the experiment using the RPC infrastructure, assuming that the input data sets are “prefetched” to the RPC data server or HPCC storage facility, all computational models are installed at HPCC systems, and all tools are wrapped as Web Services. Phase V: The RPC system is deployed at NASA Stennis Space Center, and it becomes a seed for a Virtual Organization (VO). Each deployment comes with its own portal, creating a network of RPC points of access. Each Portal deploys a different set of tools that are accessible through a distributed Service Bus. Each site contributes storage and computational resources that are shared across the VO. In collaboration with DAACs the support for online access is developed.

Implementation Grid Portal The functionality of the RPC Portal naturally splits into several independent modules such as interactive Web site, data server, tool’s interfaces, or monitoring service. Each such module is implemented as an independent portlet [12]. The RPC Portal aggregates the different contents provided by the portlets into a single interface employing a popular GridSphere [13] open source portlet container. GridSphere,

while fully JSR-168 compliant, also provides out-of-the-box support for user authentication and maintaining user credentials (X509 certificates, MyProxy [14]), vital when providing access to remote resources. Access to full functionality of the Portal, which includes the right to modify the contents served by the Portal, is granted only to the registered users who must explicitly login to start a Portal session. To access remote resources, in addition, the user must upload his or her certificate to the myProxy server associated with the Portal, using a portlet developed by the GridSphere team. In phases II - IV of the RPC system deployment, the only remote resources available to the RPC users are those provided by HPCC. Remote access to the HPCC resources is granted to registered users with certificates signed by the HPCC Certificate Authority (CA). Phase V of the deployment calls for establishing a virtual organization allowing the users to access the resources made available by the VO participants, including perhaps the NASA Columbia project and TeraGrid. To simplify the user task to obtain and manage certificates, Grid Account Management Architecture (GAMA) [15] will be adopted. It remains to be determined what CA(s) will be recognized, though.

Interactive Web Site It is imperative for the PRC system to provide an easy way for the experimenters to update the contents of the web pages describing their experiments, in particular, to avoid intermediaries such as a webmaster. A ready to use solution for this problem is Wiki - a collaborative website which can be directly edited by anyone with access to it [16]. From several available open source implementations, Media Wiki [17] has been chosen for the RPC portal, as the RPC developers are impressed by the robustness of the implementation proven by the multi-million pages Wikipedia [18]. Media Wiki is deployed as a portlet managed by GridSphere. The only (small) modification introduced to Media Wiki in order to integrate it with the RPC Portal is replacing the direct Media Wiki login by automatic login for users who successfully logged in to GridSphere. With this modification, by a single login to the RPC Portal the user not only gets access to the RPC portlets and remote resources accessible to RPC users (through automatic retrieval of the user certificate from the MyProxy server) but also she acquires the rights to modify the Wiki contents. The rights to modify the Wiki contents are group-based. Each group is associated with a namespace and only members of the group can make modifications to the pages in the associated namespace. For example, only participants of an RPC experiment can create and update pages describing that experiment, while anyone can contribute to the blog area and participate in the discussion of the experimental pages. In


addition, each group is associated with a private namespace – not accessible to nonmembers at all – which enables collaborative development of confidential contents.

Data Server The science of the Sun-Sun system is notorious for collecting an incredible amount of observational data that come from different sources in a variety of formats and with inconsistent and/or incomplete metadata. The solution of the general problem of managing such data collections is a subject of numerous research efforts and it goes far beyond the scope of this project. Instead, for the purpose of the RPC system it is desirable to adopt an existing solution representing the community common practice. Even though such solution is necessarily incomplete, by the virtue of being actually used by the researchers, it is useful enough and robust enough to be incorporated into the RPC infrastructure. From the available open source candidates, Unidata’s THREDDS Data Server (TDS) has been selected and deployed as a portlet. In order to better integrate it with the other RPC Portal functionality, and in particular, to provide user-friendly interfaces to the data transformations, a thin layer of software on top of TDS – referred to as the TDS Explorer – has been developed

THREDDS Data Server THREDDS (Thematic Real-time Environmental Distributed Data Services) [4] is middleware developed to simplify the discovery and use of scientific data and to allow scientific publications and educational materials to reference scientific data. Catalogs are the heart of the THREDDS concept. They are XML documents that describe on-line datasets. Catalogs can contain arbitrary metadata. The THREDDS Catalog Generator produces THREDDS catalogs by scanning or crawling one or more local or remote dataset collections. Catalogs can be generated periodically or on demand, using configuration files that control what directories get scanned, and how the catalogs are created. THREDDS Data Server (TDS) actually serves the contents of the datasets, in addition to providing catalogs and metadata for them. The TDS uses the Common Data Model to read datasets in various formats, and serves them through OPeNDAP, OGC Web Coverage Service, NetCDF subset, and bulk HTTP file transfer services. The first three allow the user to obtain subsets of the data, which is crucial for large datasets. Unidata’s Common Data Model (CDM) is a common API for many types of data including OPeNDAP, netCDF, HDF5, GRIB 1 and 2, BUFR, NEXRAD, and GINI. A pluggable framework allows other developers to add readers for their own specialized formats. Out-of-the-box TDS provide most functionality needed to support data sets commonly used in climatology applications

87

(e.g., weather forecasting, climate change) and GIS applications because of the supported file formats. It is possible to create CDM-based modules to handle other data formats, in particular, HDF4-EOS that is critical for many RPC experiments, however, that would possibly lead to loss of the metadata information embedded in HDF headers. Furthermore, while TDS provides support for subsetting CDM-based data sets, it does not allow for other operations, often performed on HDF4-EOS data, such as re-projections. To minimize modifications and extensions to TDS needed to integrate it with RPC infrastructure, the new functionality needed for RPC is developed as a separate package (a web application) that acts as an intermediary between the user interface and TDS. The requests for services that can be rendered by TDS are forwarded to TDS, while the others are handled by the intermediary: the TDS Explorer.

TDS Explorer The TDS native interface allows browsing the TDS catalog one page at a time, which makes the navigation of a hierarchical catalog a tedious process. To remedy that, a new interface inspired by the familiar Microsoft File Explorer (MSFE) has been developed. The structure of the catalog is now represented as an expandable tree in a narrow pane (iframe) on the left hand side of the screen. The selection of a tree node or leave results in displaying the catalog page of the corresponding data collection or data set, respectively, in an iframe occupying the rest of the screen. The TDS explore not only makes the navigation of the data repository more efficient, it also simplifies the development of interfaces to other services not natively supported by TDS. Among the services are: •

Creating containers for new data collections and uploading new data sets. The creation of a new container is analogous to creating a new folder in MSFE: select a parent folder, from menu select option “new collection”, in a pop-up iframe type in the name of the new collection and click OK (or cancel). There are two modes of uploading files: from the user workstation using HTTP and from a remote location using gridFTP (until phase V of the deployment, only HPCC storage facility). In either case select the destination collection, from menu select option “uploadHTTP” or “uploadGridFTP” , select files(s) in the file chooser popup, and click OK (or cancel).

•

Renaming and deleting datasets and collections: select dataset or data collection and use the corresponding option in the menu

•

Downloading the data either to the user desktop using HTTP or transferring it to a remote location using gridFTP.

•

Displaying the provenance of a dataset. By choosing this option, the list of files is displayed (instead of the

88


TDS catalog page) that were generated when creating the datasets, if any. Typically, the provenance files are generated when an RPC tool is used to create a dataset, and the list may include the standard output, the input parameter file, a log file, a metadata record, or other depending on the tool. •

Invoke tools for the selected fileset(s) or collection. Some tools operate on a single dataset (e.g., multispectral viewer, other may be invoked for several datasets (e.g. HEG tool), yet others operate on data collections (e.g., TSPT). The tools GUI pop up as new iframes. The tools are described in Section 4 below.

The user interface of the TDS Explorer is implemented using JavaScript, including AJAX. The server side functionality is a web application using JSP technology. The file operations (upload, delete, rename) are performed directly on the file system. The changes in the file system are propagated to the TDS Explorer tree by forcing TDS to recreate the catalog by rescanning the file system (with optimizations prohibiting rescanning folders that have not changed). Finally, the TDS explorer web application invokes TDS API for translating datasets logical names into physical URLs, as needed.

RPC Tools From the perspective of the integration, there are three types of tools. One type is standalone tools capable of connecting to the RPC data server to browse and select the dataset but otherwise performing all operations independently of the RPC infrastructure. The Unidata IDV is an example of such a tool. Obviously, such tools require no support from the RPC infrastructure except for making the RPC data server conform to accepted standards (such as DODS). One of the advantages of TDS is that it supports many of the commonly used data protocols, and consequently, the data served by the RPC data server may be accessed by many existing community developed tools, immediately enhancing the functionality of the RPC system. The second type of tools is transformations that take a dataset or a collection as an input, and output the transformed files. Examples of such transformations are HEG, MRT, ART, and TSPT. They come with a command line interface (a MatLab executable in the case of ART and TSPT), and are controlled by a parameter input file. The integration of such tools with the RPC infrastructure is made in two steps. First, the Webbased GUI is developed (using JavaScript and Ajax as needed to create lightweight but rich interfaces) to produce the parameter file. The GUI is integrated with the TDS explorer to simplify the user task of selecting the input files and defining the destination of the output files. The other step is to install the executable on the system of choice and convert it to a service. To this end open source ServiceMix [19] that implements the JSR-208 Java Business Integration Specification [20] is used; implementations of JBI are usually referred to as “Service Bus”. Depending on the user chosen

target machine the service forks a new process on one of the servers associated with the RPC system, or submits the job on the remote machine using Globus GRAM[21]. In the case of remote submission, the service registers itself as the listener of GRAM job status change notifications. The notifications are forwarded to a Job Monitoring Service (JMS). JMS stores the information on the jobs in a database (mySQL). A separate RPC portlet provides the user interface for querying the status of all jobs submitted by the user. The request for a job submission (local or remote) contains an XML job descriptor that specifies all information needed to submit the job: the location of the executable, values of environmental variables, files to be staged out and in, etc. Consequently, the same ServiceMix service is used to submit any job with the job descriptor generated by the transformation GUI (or supporting JSP page). Furthermore, a new working directory is created for each instance of a job. Once the job completes, the result of the transformation is transferred to the TDS server to the location specified by the user, while “byproducts” such as standard output and log files, if created, are transparently moved to a location specified by the RPC server: a folder with the name created automatically by hashing the physical URL of the transformation result. This approach eliminates unnecessary clutter in the TDS catalog. Using the TDS explorer the user navigates only the actual datasets. If the provenance information is needed, the TDS explorer recreates the hash from the dataset URL and shows the contents of that directory providing the user with access to all files there. Finally, the data viewers and statistical analysis tools do not produce new datasets. In this regard, they are similar to standalone tools. The advantage of integrating them with the RPC infrastructure is that the data can be preprocessed on the server side reducing the volume of the necessary data transfers. Because of the interactivity and rich functionality (visualizations), they are implemented as Java applets.

Summary This paper describes the requirements, design and implementation progress of an e-Science environment to enable rapid evaluation of innovative methods of processing science observations, in particular data gathered by sensors deployed on NASA-launched satellites. This project is intended to be a low-cost effort focused on integrating existing open source, public domain, and/or community developed software components and tools. Critical for success is a carefully designed implementation plan allowing for incremental enhancement of the functionality of the system, including incorporating new tools per user requests, while maintaining an operational system and hardening its implementation. This has been achieved by rigorously following the principles of separation of concerns, loose coupling, and service oriented architectures employing Portlet (GridSphere), Service Bus (ServiceMix), and Grid (Globus) technologies, as well as introducing a new layer on top of the


THREDDS data server (TDS Explorer). At the time of writing this paper, the implementation is well into phase IV, while continuing to add new tools. The already deployed tools allow for subsampling, reprojections, format translations, and de-clouding of selected data sets and collections, as well as for generating simulated VIIRS data approximating data feeds from future NASA missions.

References

[17] http://en.wikipedia.org/wiki/MediaWiki [18] http://en.wikipedia.org/wiki/Wikipedia:About [19] http://incubator.apache.org/servicemix/home.html [20] http://jcp.org/en/jsr/detail?id=208 [21] http://www.globus.org

[1] NASA Science Mission Directorate, Applied Sciences Program. Rapid Prototyping Capability (RPC) Guidelines and Implementation Plan, http://aiwg.gsfc.nasa.gov/esappdocs/RPC/RPC_guidelines_01 _07.doc [2] http://modis.gsfc.nasa.gov [3] T. Haupt and R. Moorhead, “The Requirements and Design of the Rapid Prototyping Capabilities System”, 2006 Fall Meeting of the American Geophysical Union, San Francisco, USA, December 2006. [4] http://www.unidata.ucar.edu/projects/THREDDS/ [5] http://www.unidata.ucar.edu/software/idv/ [6] http://en.wikipedia.org/wiki/NetCDF [7] http://hdf.ncsa.uiuc.edu/hdfeos.html [8] http://newsroom.gsfc.nasa.gov/sdptoolkit/HEG/HEGHome.ht ml [9] http://edcdaac.usgs.gov/landdaac/tools/modis/index.asp [10] http://edcdaac.usgs.gov/news/mrtswath_update_020106.asp [11] http://www.iftd.org/rapid_prototyping.php [12] JSR-168 Portlet Specification, http://jcp.org/aboutJava/communityprocess/final/jsr168/ [13] www.gridsphere.org [14] MyProxy Credential Management http://grid.ncsa.uiuc.edu/myproxy/

Service,

[15] K. Bhatia, S. Chandra, K. Mueller, "GAMA: Grid Account Management Architecture," First International Conference on e-Science and Grid Computing (e-Science'05), pp. 413-420, 2005. [16] Howard G. "Ward" http://www.wiki.org/wiki.cgi?WhatIsWiki

89

Cunningham,

90


The PS3 Grid-resource model R

Martin Rehr and Brian Vinter eScience center, University of Copenhagen, Copenhagen, Denmark

R Abstract—This paper introduces the PS3 Grid-resource model, which allows any Internet connected Playstation 3 to become a Grid Node without any software installation. R The PS3 is an interesting Grid resource as each of the over 5 millions sold world wide contains a powerful heterogeneous multi core vector processor well suited for R scientific computing. The PS3 Grid node provides a native Linux execution environment for scientific applications. Performance numbers show that the model is usable when the input and output data sets are small. The resulting system is in use today, and freely available to any research project.

Keywords: Grid, Playstation 3, MiG

1. Introduction The need for computation power is growing daily as an increasing number of scientific areas use computer modeling as a basis for their research. This evolution has led to a whole new research area called eScience. The increasing need of scientific computational power has been known for years and several attempts have been made to satisfy the growing demand. In the 90’s the systems evolved from vector based supercomputers to cluster computers which is build of commodity hardware leading to a significant price reduction. In the late 90’s a concept called Grid computing[7] was developed, which describes the idea of combining the different cluster installations into one powerful computation unit. A huge computation potential beyond the scope of cluster computers is represented by machines located outside the academic perimeter. While traditional commodity machines are usually PC’s based on the X86 architecture a whole new target has turned up with the development and release of the R R Sony Playstation 3 (PS3 ). The heart of the PS3 is the Cell processor, The Cell Broadband Engine Architecture (Cell BE)[4] is a new microprocessor architecture developed in a joint venture between Sony, Toshiba and IBM, known as STI. Each company has their own purpose for the Cell processor. Toshiba uses it as a controller for their flat panel televisions, R Sony uses it for the PS3 , and IBM uses it for their High Performance Computing (HPC) blades. The development of the Cell started out in the year 2000 and involved around 400 engineers for more than four years and consumed close to half a billion dollars. The result is a powerful heterogeneous multi core vector processor well suited for gaming and High Performance Computing (HPC)[8].

1.1. Motivation The theoretical peak performance of the Cell processor in R the PS3 is 153,6 GFLOPS in single precision and 10.98 GFLOPS in double precision[4]1 . According to the press more than 5 million PS3’s have been sold worldwide at October 2007. This gives a theoretical peak performance of more than 768.0 peta-FLOPS in single precision and 54.9 petaFLOPS in double precision, if one could combine them all in a Grid infrastructure. This paper describes two scenarios R for transforming the PS3 into a Grid resource, firstly the Native Grid Node (NGN) where full control is obtained of R . Secondly the Sandboxed Grid Node (SGN) where the PS3 R several issues have to be considered to protect the PS3 from faulty code, as the machine is used for other purposes than Grid computing. Folding@Home[6] is a scientific distributed application for folding proteins. The application has been embedded into the R , and is limited to protein folding. Sony GameOS of the PS3 This makes it Public Resource Computing as opposed to our model which aims at Grid computing, providing a complete Linux execution environment aimed at all types of scientific applications.

2. The Playstation 3 R The PS3 is interesting in a Grid context due to the powerful Cell BE processor and the fact that the game console has official support for other operating systems than the default Sony GameOS.

2.1. The Cell BE The Cell processor is a heterogeneous multi core processor consisting of 9 cores, The Primary core is an IBM 64 bit power processor (PPC64) with 2 hardware threads. This core is the link between the operating system and the 8 powerful working cores, called the SPE’s for Synergistic Processing Element. The power processor is called the PPE for Power Processing Element, figure 1 shows an overview of the Cell architecture. The cores are connected by an Element Interconnect Bus (EIB) capable of transferring up to 204 GB/s at 3.2 GHz. Each SPE R 1. The PS3 Cell has 6 SPE’s available for applications. Each SPE is running at 3.2 GHz and capable of performing 25.6 GFLOPS in single precision and 1.83 GFLOPS in double precision.


SPE1

SPE3

SPE5

91

PS3 Linux Kernel

SPE7 PS3 VRAM MTD

PPE IOIF_1

XIO XIO

SPU FS*

BEI

Element Interconnect Bus (EIB)

IOIF_0

TCP/IP

HDD**

CD

ALSA **

FB

Network

SCSI

GbE*

Storage*

USB**

PPU

OHCI/ EHCI*

PS3PF*

MIC Audio/Video*

SPE0

SPE2

SPE4

SPE6

PS3 Hypervisor Virtualization Layer BEI

Cell Broadband Engine Interface

PPE

Power Processor Element

IOIF I/O interface

SPE

Synergistic Processor Element

MIC Memory Interface Controller

XIO

Rambus XDR I/O

PS3 Hardware

SPU

GPU

GbE

ATA

WiFi

HDD/CD

USB

AUDIO

HID

Figure 1. An overview of the Cell architecture

PPU

BlueTooth

* PS3 Hypervisor Linux drivers provided by SONY ** Linux drivers NOT included on the PS3LIVE CD

Synergistic Processor Element (SPE) Instruction Prefetch and Issue Unit Local Store Even pipeline

Odd pipeline

Register File

R Figure 3. An overview of the PS3 Hypervisor structure for the Grid-resource model

Memory Flow Controller (MFC)

Element Interconnect Bus (EIB)

Figure 2. An overview of the SPE

is dual pipelined, has a 128x128 bit register file and 256 kB of on-chip memory called the local store. Data is transfered asynchronously between main memory and the local store through DMA calls handled by a dedicated Memory Flow Controller (MFC). An overview of the SPE is shown in figure 2. By using the PPE as primary core, the Cell processor can be used out of the box, due to the fact that many existing operating systems support the PPC64 architecture. Thereby it’s possible to boot a PPC64 operating system on the Cell processor, and execute PPC64 applications, however these will only use the PPE core. To use the SPE cores it’s necessary to develop code specially for the SPE’s, which includes setting up a memory communications scheme using DMA through the MFC.

are not of any interest for scientific computing. R The fact that the PS3 is low priced from a HPC point of view, equipped with a high performance vector processor, and supports alternative operating systems, makes it interesting both as an NGN node and an SGN node. All sold PS3’s can be transformed to a powerful Grid resource with a little effort from the owner of the console. Third party operating systems work on top of the Sony GameOS, which acts as a hypervisor for the guest operating system. See figure 3. The hypervisor controls which hardware components are accessible from the guest operating system. Unfortunately the GPU is not accessible by guest operating systems2 , which is a pity, as it in itself is a powerful vector computation unit with a theoretical peak performance of 1.8 tera-FLOPS in single precession. However 252 MB of the 256 MB GDDR3 ram located on the graphics card can be accessed through the hypervisor, The hypervisor reserves 32 MB of main memory R and 1 of the 7 SPE’s available in the PS3 version of the Cell 3 processor . This leaves 6 SPE’s and 224 MB of main memory for guest operating systems. Lastly a hypervisor model always introduces a certain amount of performance decrease, as the guest operating system does not have direct access to the hardware.

2.2. The game console

3. The PS3 R Grid resource R

Contrary to other game consoles, the PS3 officially supports alternative operating systems besides the default Sony Game OS. Even though other game consoles can be modified to boot alternative operating systems, this requires either an exploit of the default system or a replacement of the BIOS. Replacing the BIOS is intrusion at the highest level, expensive at a large volume and not usable beyond the academic perimeter. Security exploits are most likely to be patched within the next firmware update, which makes this solution unusable in any scenario. Beside the difficulties modifying other game consoles towards our purposes, the processors used by the R game consoles currently on the market, except for the PS3 ,

R The PS3 supports alternative operating systems, making the transformation into a Grid resource rather trivial, as a suitable Linux distribution and an appropriate Grid client are the only requirements. However if you target a large amount of PS3’s this becomes cumbersome. Furthermore if the PS3’s

2. It is not clear whether it’s to prevent games to be played outside Sony GameOS, due to DRM issues or due to the exposure of the GPU’s registerlevel information R 3. The Cell processor consists of 8 SPE’s, but in the PS3 one is removed for yield purposes, if one is defective it is removed, if none is defective a good one is removed to assure that all PS3’s have exactly 6 SPE’s available for applications, to preserve architectural consistency

92


located beyond the academic perimeter are to be reached, R minimal administrational work form the donator of the PS3 is a vital requirement. Our approach minimizes the workload R required transforming a PS3 into a powerful Grid resource by R using a LIVECD. Using this CD, the PS3 is booted directly into a Grid enabled Linux system. The NGN version of the LIVECD is targeted at PS3’s used as dedicated Grid nodes, R and uses all the available hardware of the PS3 , whereas the SGN version uses the machine without making any change4 to it, and is targeted at PS3’s used as entertainment devices as well as Grid nodes.

3.1. The PS3-LIVECD Several requirements must be met by the Grid middleware to support the described LIVECD. First of all the Grid middleware must support resources which can only be accessed through a pull based model, which means that all communication is initiated by the resource, i.e. the PS3-LIVECD. This is required because the PS3’s targeted by the LIVECD are most likely located behind a NAT router. Secondly, the Grid middleware needs a scheduling model where resources are able to request specific types of jobs, e.g. a resource can specify that only jobs which are targeted R the PS3 hardware model can be executed. In this work the Minimum intrusion Grid[11], MiG, is used as the Grid middleware. The MiG system is presented next, before presenting how the PS3-LIVECD and MiG work together.

3.2. Minimum intrusion Grid MiG is a stand-alone Grid platform, which does not inherit code from any earlier Grid middlewares. The philosophy behind the MiG system is to provide a Grid infrastructure that imposes as few requirements on both users and resources as possible. The overall goal is to ensure that a user is only required to have a X.509 certificate which is signed by a source that is trusted by MiG, and a web browser that supports HTTP, HTTPS and X.509 certificates. A fully functional resource only needs to create a local MiG user on the system and to support inbound SSH. A sandboxed resource, the pull based model, only needs outbound HTTPS[1]. Because MiG keeps the Grid system disjoint from both users and resources, as shown in Figure 4, the Grid system appears as a centralized black box[11] to both users and resources. This allows all middleware upgrades and trouble shooting to be executed locally within the Grid without any intervention from neither users nor resource administrators. Thus, all functionality is placed in a physical Grid system that, though it appears as a centralized system in reality is distributed. The basic functionality in MiG starts by a user submitting a job to MiG and a resource sending a request for a job to execute. The resource then receives an appropriate job from MiG, executes the job, and sends the result to MiG that 4. One has to install a boot loader to be able to boot from CD’s

Client

Resource

Client Grid

Resource

Client Client

Resource

Figure 4. The abstract MiG model

can then inform the user of the job completion. Since the user and the resource are never in direct contact, MiG provides full anonymity for both users and resources, any complaints will have to be made to the MiG system that will then look at the logs that show the relationship between users and resources. 3.2.1. Scheduling. The centralized black box design of MiG makes it capable of strong scheduling, which implies full control of the jobs being executed and the resource executing them. Each job has an upper execution time limit, and when the execution time exceeds this time limit the job is rescheduled to another resource. This makes the MiG system very well suited to host SGN resources, as they by nature are very dynamic and frequently join and leave the Grid without notifying the Grid middleware.

4. The MiG PS3-LIVECD R by The idea behind the LIVECD is booting the PS3 inserting a CD, containing the Linux operating system and the R appropriate Grid clients. Upon boot, the PS3 connects to the Grid and requests Grid jobs without any human interference. R Several issues must be dealt with. First of all the PS3 must not be harmed by flaws in the Grid middleware nor exploits through the middleware, Secondly the Grid jobs may not harm R the PS3 neither by intention nor by faulty jobs. This is especially true for SGN resources where an exploit may cause exposure of personal data.

4.1. Security To keep faulty Grid middleware and jobs from harming the R PS3 , both the NGN and SGN model use the operating system as a security layer. The Grid client software and the executed Grid jobs are both executed as a dedicated user, who does not have administrational rights of the operating system. The MiG system logs all relations between jobs and resources, thus providing the possibility to track down any job.

4.2. Sandboxing The SGN version of the LIVECD operates in a sandR boxed environment to protect the donated PS3 from faulty middleware and jobs. This is done by excluding the device R driver for the PS3 HDD controller from the Linux kernel


used, and keeping the execution environment in memory instead. Furthermore, the support for loadable kernel modules is excluded, which prevents Grid jobs from loading modules into the kernel, even if the OS is compromised and root access is achieved.

4.3. File access Enabling file access to the Grid client and jobs without having access to the PS3’s hard drive is done by using the graphics card’s VRAM as a block device. Main memory is a limited resource5 , therefore using the VRAM as a block device is a great advantage compared to the alternative of using a ram disk, which would decrease the amount of main memory available for the Grid jobs. However the total amount of VRAM is 252 MB and therefore Grid jobs requiring input/output files larger than 252 MB are forced to use a remote file access framework[2].

4.4. Memory management R The PS3 has 6 SPE cores and a PPE core all capable of accessing the main memory at the same time, through their MFC controllers. This results in a potential bottleneck in the TLB, as it in the worst case ends up thrashing, which is a known problem in multi core processor architectures. TLB thrashing can be eliminated by adjusting the page size to fit the TLB, which means that all pages have an entry in the TLB. This is called huge pages, as the page size grows significantly. The use of huge pages has several drawbacks, one of them is swapping. Swapping in/out a huge page results in a longer execution halt as a larger amount of data has to be moved between main memory and the hard drive. The Linux operating system implements huge pages as a memory mapped file, this results in a static memory division of traditional pages and huge pages, using different memory allocators. The operating system and standard shared libraries use the traditional pages which means the memory footprint of the operating system and the shared libraries has to be estimated in order to allocate the right amount of memory for the huge pages. Opposite to a cluster setup where the execution environment and applications are customized to the specific cluster, this can’t be achieved in a Grid context6 . Therefore a generic way of addressing the memory is needed. Furthermore future SPE programming libraries will most likely use the default memory allocator. This and the fact that no performance measurement clarifying the actual gain of using huge pages could be found, led to the decision to skip huge pages for the PS3-LIVECD. At last it’s believed by the authors that the actual applications which could gain a performance increase by using huge pages is rather insignificant, as the the majority of applications will be able to hide the TLB misses by using double- or multi buffering, as memory transfers through the MFC are asynchronous. R 5. The PS3 only has 224 MB of main memory for the OS and applications 6. Specially in MiG where the user and resources are anonymous to each other

93

5. The execution environment The PS3-LIVECD is based on the Gentoo Linux[9] PPC64 distribution with a customized kernel[5] capable of communiR cating with the PS3 hypervisor. Gentoo catalyst[3] was used as build environment, this provides the possibility of configuring exactly which packages to include on the LIVECD, as well as providing the possibility to apply a custom made kernel and initrd script. The kernel was modified in different ways, firstly loadable modules support was disabled to prevent potential evil jobs, which manages to compromise the OS security, from modifying the kernel modules. Secondly the framebuffer driver has been modified to make the VRAM appear as a memory technology device, MTD, which means that the VRAM can be used as a block device. The modification of the frame-buffer driver also included freeing 18 MB of main memory occupied by the frame-buffer used in the default kernel7 . The modified kernel ended up consuming 7176 kB of the total 229376 kB main memory for code and internal data structures, leaving 222200 kB for the Grid client and jobs. Upon boot the modified initrd script detects the block device to be used as root file system8 and formats the detected device with the ext2 filesystem, reserving 2580 kB for the superuser, leaving 251355 kB for the Grid client and jobs9 . When the block device has been formatted, the initrd script sets up the root file system by coping writable directories and files from the CD to the root file system. Read-only directories, files, and binaries are left on the CD and linked symbolically to the root filesystem keeping as much of the root filesystem free for Grid jobs as possible. The result is that the root file system only consumes 1.6 MB of the total space provided by the used block device. When the Linux system is booted the LIVECD initiates the communication with MiG through HTTPS. This is done R by sending a unique key identifying the PS3 to the MiG system, if this is the first time the resource connects to the Grid a new profile is created dynamically. The response to the initial request is the Grid resource client scripts, these are generated dynamically upon the request. By using this method it’s guaranteed that the resource always has the newest version of the Grid resource client scripts, disabling the need for downloading a new CD upon a Grid middleware update. When the Grid resource client script is executed the request of Grid jobs is initiated through HTTPS. Within that request a unique resource identifier is provided, giving the MiG scheduler the necessary information about the resource such as architecture, memory, disc space and an upper time-limit. Based on these R parameters the MiG scheduler finds a job suited for the PS3 and places it in a job folder on the MiG system. From this R location the PS3 is able to retrieve the job consisting of 7. As the hypervisor isolates the GPU from the operating system, the display is operated by having the frame-buffer writing the data to be displayed to an array in main memory, which is then copied to the GPU by the hypervisor 8. The SGN version uses VRAM, DGN version uses the real hard drive provided through the hypervisor 9. This is true for the SGN version, the NGN version uses the total disc space available, which is specified through the Sony Game OS


job description files, input-files, and executables. The location of these files is returned within the result of the job request, and is a HTTPS URL including a 32 character random string generated upon the job request and deleted when the job terminates. At job completion the result is delivered to the MiG system which verifies that it’s the correct resource (by the unique resource key) which delivers the result of the job. If it’s a false deliver10 the result is discarded, otherwise it’s R accepted. And the PS3 resource requests a new job when the result of the previous one has been delivered.

6

5

4 Speed up

94

3

2

1

6. Experiments

0 0

1

2

3

4 5 Number of nodes

6

7

8

9

R

Testing the PS3 Grid-resource model was done establishing a controlled test scenario consisting of a MiG Grid server and 8 PS3’s. The experiments performed included a model overhead check, a file system benchmark using VRAM as a block device, and application performance, using a protein folding and a ray tracing application.

Figure 5. The speedup achieved using the PS3-LIVECD for protein folding with 4 and 8 nodes

4

6.1. Job overhead and file performance

3.5

3

2.5 Speed up

The total overhead of the model was tested by submitting R connected. 1000 empty jobs to the Grid with only one PS3 The 1000 jobs completed in 12366 seconds, which translates to an overhead of approximately 13 seconds per job. The performance of the VRAM used as a block device was tested by writing a 96 MB file sequentially. This was achieved in 1.5 seconds, resulting in a bandwidth of 64 MB/s. Reading the written file was achieved in 9.6 seconds, resulting in a bandwidth of 10 MB/s. This shows that writing to the VRAM is a factor of approximately 6.5 faster than reading from the VRAM, which was an expected result as the nature of VRAM is write from main memory to VRAM, not the other way around.

2

1.5

1

0.5

0 0

1

2

3

4 5 Number of nodes

6

7

8

9

Figure 6. The speedup achieved using the PS3-LIVECD for ray tracing with 4 and 8 nodes

6.2. Protein folding Protein folding is a compute intensive algorithm for folding proteins. It requires a small input and generates a small output, and is embarrassing parallel which makes it very suitable for Grid computing. In this experiment, a protein of length 27 R was folded on one PS3 resulting in a total execution time of 57 minutes and 16 seconds. The search space was then divided into 17 different subspaces using standard divide and conquer techniques. The 17 different search spaces were then submitted as jobs to the Grid, which adds up to 4 jobs for each of the 4 nodes used in the experiment plus one extra job to ensure unbalanced execution. Equivalently, the 17 jobs were distributed among 8 nodes, yielding 2 jobs per node plus one extra job. The execution finished in 18 minutes and 50 seconds using 4 nodes giving a speedup of 3.04. The 8 node setup finished the execution in 10 minutes and 56 seconds giving a speedup of 5.23, this is shown in figure 5. These results are considered quite useful in a Grid setup as opposed to a cluster setup where this would be considered bad. 10. The resource keys doesn’t match, the time limit has been violated, or another resource is executing the job, due to a rescheduling

6.3. Ray tracing Ray tracing is compute intensive, requires a small amount of input and generates a large amount of output. This experiment uses a Ray tracing code written by Eric Rollings[10], this has been modified from a real time ray tracer to a ray tracer which writes the rendered frames to files in a resolution of 1920x1080 (Full HD). The final images are jpeg compressed to reduce the size of the output. A total of 5000 frames were rendered in R 78 minutes and 6 seconds on a single PS3 , the search space was then divided into 25 equally large subspaces. These were submitted as jobs to the Grid resulting in a total of 25 jobs, which adds up to 6 jobs per node plus one extra in the 4 node setup, and 3 jobs per node plus one extra in the 8 node setup. The execution time using 4 nodes was 32 minutes and 23 seconds giving a speedup of 2.41 and the execution time using 8 nodes was 25 minutes and 12 seconds giving a speedup of 3.09, this is sketched in figure 6. While the speedup achieved with 4 nodes is quite useful in a Grid context, the speedup gained using 8 nodes is quite disappointing. The authors


believe this is due to network congestion when the rendered frames are sent to the MiG storage upon job termination.

7. Conclusion In this work we have demonstrated a way to use the Sony Playstation 3 as a Grid computing device, without the R need to install any client software on the PS3 . The use of the Linux operating system provides a native execution environment suitable for the majority of scientific applications. The advantage of this is that existing Cell applications can be executed without any modifications. A sandboxed version of the execution environment has been presented which denies R access to the hard drive of the PS3 . The advantage of this is that donated PS3’s can’t be compromised by faulty or evil jobs, the disadvantage is the lack of file access, which is solved by using the VRAM of the PS3 as block device. The Minimum intrusion Grid supports the required pulljob model for retrieving and executing Grid jobs on a resource located behind a firewall without the need to open any incoming ports. By using the PS3-LIVECD approach any R PS3 connected to the Internet can become a Grid resource by R is booting it with the LIVECD. When a Grid connected PS3 shut down the MiG system will detect this event, by a timeout, and resubmit the job to another resource. Experiments show that the ray tracing application doesn’t scale well, due to the large amount of output data resulting in network congestion problems. Opposite to this, a considerable speedup is reached when folding proteins despite of the model overhead of 13 seconds applied to each job.

References [1] Rasmus Andersen and Brian Vinter. Harvesting idle windows cpu cycles for grid computing. In Hamid R. Arabnia, editor, GCA, pages 121–126. CSREA Press, 2006. [2] Rasmus Andersen and Brian Vinter. Transparent remote file access in the minimum intrusion grid. In WETICE ’05: Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise, pages 311–318, Washington, DC, USA, 2005. IEEE Computer Society. [3] Gentoo Catalyst. http://www.gentoo.org/proj/en/releng/catalyst. [4] Thomas Chen, Ram Raghavan, Jason Dale, and Eiji Iwata. Cell broadband engine architecture and its first implementation. IBM developerWorks, 2005. http://www.ibm.com/developerworks/ power/library/pa-cellperf. [5] PS3 Linux Sony-PS3.

extensions.

ftp://ftp.uk.linux.org/pub/linux/

[6] Folding@home. http://folding.stanford.edu. [7] Ian Foster. The grid: A new infrastructure for 21st century science. Physics Today, 55(2):42–47, 2002. [8] Mohammad Jowkar. Exploring the Potential of the Cell Processor for High Performance Computing. Master’s thesis, University of Copenhagen, Denmark, August 2007.

95

[9] Gentoo Linux. http://www.gentoo.org. [10] Eric Rollings. Ray tracer. http://eric rollins.home.mindspring. com/ray/ray.html. [11] Brian Vinter. The Architecture of the Minimum intrusion Grid (MiG). In Communicating Process Architectures 2005, sep 2005.

96


Numerical Computational Solution of Fredholm Integral Equations of the Second Kind by Using Multiwavelet K. Maleknejada a Department of mathematics Iran University of Science and Technology Narmak, Tehran 1684613114, Iran

Abstract The main purpose of this paper is to develope a multiwavelets Galerkin method for obtain numerical solution of Fredholm integral equations of second kind. On other hand, we use a class of multiwavelet which construct a bases for L2 (R) and leads to a sparse matrices with high precision, in numerical methods as Galerkin method. Because multiwavelets are able to offer a combination of orthogonality, symmetry, higher order of approximation and short support, methods using multiwavelets frequently outperform those using the comparable scale wavelets. Since spline bases have a maximal approximation order with respect to their length, we using a family of spline multiwavets that are symmetric and orthogonal as basis. Finally, by using numerical examples we show that our estimation have a good degree of accuracy.

T. Lotfib , and K. Nouria b Department of mathematics I. A. U. H. (Hamadan Unit) Hamadan, Iran

ces, the symbols are trigonometric matrix polynimials, and so on. This change is responsible for most of the extra complication. We now consider a dilation factor of m rather than 2. The multiscaling function is still φ, the multiwavelet are ψ (1) , ..., ψ (m−1) . Likewise, the recursion coefficients are Hk and G(1) , ..., G(m−1) and so on. Definition 1 A refinable function vector is a vector-valued function   φ1 (x)   φ(x) =  ...  , φn : R −→ C, φr (x) which satisfies a two-scale matrix refinement equation of the form φ(x) =

1

Hφ(mx − k),

k ∈ Z.

(1)

r is called the multiplicity of φ; the integer m ≥ 2 is the dilation factor. The recursion coefficients Hk are r × r matrices.

Introduction

This section provide an overview of the topics that we need in this paper. The use of wavelet based algorithms in numerical analysis is superficially similar to other transform methods, in which, instead of representing a vector or an operator in the usual way it is expanded in a wavelet basis, or it’s matrix representation is computed in this bases.

1.1

m

k1 X k=k0

Keywords: Integral Equation, Multiwavelet, Galerkin System, Orthogonal Bases

√

Multiwavelet

The multiwavelet is more general than the scaler wavelet. The recursion coefficients are now matri-

2

Construction Bases

Multiwavelet

We begin with construction of a class of bases for L2 [0, 1]. The class is indexed by p ∈ Z+ , which denotes the number of vanishing moments of the basis functions; we say a basis {b1 , b2 , b3 , . . . } form this class is of order p, if Z 1 bi (x)xj dx = 0, j = 0, . . . , k − 1, 0

for each bi with i > p.


2.1

97

Multiwavelet Bases for L2 [0, 1]

We employ the multiresolution analysis framework, Keinert [1]. For m = 0, 1, 2, . . . and i = 0, 1, . . . , 2m − 1, we define a half open interval Im,i ⊂ [0, 1) as below Im,i = [2

−m

i, 2

−m

(i + 1)).

j ψm,i (x) = ψj (2m x − i),

(3)

j } Vmp = V0p ⊕ linear span{ψm,i

j = 1, . . . , p; m = 0, 1 . . . ; i = 0, 1, . . . , 2m − 1. (5) An explicit construction of ψ1 , . . . , ψp is given in Walter and Shen [3]. We define the space V p to be the union of the Vmp , given by the formula

and we define Vmp

=

p Vm,0

∞ [

p

V = ⊕

p Vm,1

⊕

p Vm,2

⊕ ··· ⊕

p Vm,2 m −1 .

thus V0p ⊂ V1p ⊂ · · · ⊂ Vmp ⊂ . . . . For m = 0, 1, 2, . . . and i = 0, 1, . . . , 2m −1, we define p the p-dimensional space Wm,i to be the orthogonal p p p complement of Vm,i in Vm+1,2i ⊕ Vm+1,2i+1 , p p Wm,i ⊥ Wm,i ,

and we define p p p p p Wm = Wm,0 ⊕ Wm,1 ⊕ Wm,2 ⊕ · · · ⊕ Wm,2 m −1 . p Now we have Vmp ⊕ Vmp = Vm+1 , so we inductively obtain the decomposition p . Vmp = V0p ⊕ W0p ⊕ W1p ⊕ · · · ⊕ Wm−1

(4)

Suppose that functions ψ1 , ψ2 , . . . , ψp : R → R form an orthogonal basis for W0p . Since W0p is orthogonal to V0p , the first p moments ψ1 , . . . , ψp vanish, Z 1 ψi (x)xj dx = 0, j = 0, 1, . . . , p − 1. 0 p Wm,i

Vmp ,

(6)

m=0

p It is apparent that for each m and i the space Vm,i m p has dimension p, the space Vm has dimension 2 p, and p p p Vm,i ⊂ Vm+1,2i ⊕ Vm+1,2i+1 ;

p p p p Vm,i ⊕ Wm,i = Vm+1,2i ⊕ Vm+1,2i+1 ,

x ∈ R,

we obtain from decomposition (4) the formula

(2)

For a fixed m, the dyadic intervals Im,i are dijoint and their union is [0, 1); also Im,i = Im+1,2i ∪ Im+1,2i+1 . Now we suppose that p ∈ Z+ and for m = 0, 1, . . . , and i = 0, 1, . . . , 2m − 1, we define a k space Vm,i of piecewise polynomial functionl k Vm,i = {f | f : R → R, f = Pp χIm,i },

1, . . . , p, m = 0, 1, 2, . . . , and i = 0, 1, 2, . . . , 2m − 1, by the formula

The space then has an orthogonal basis consisting of the P functions ψ1 (2m x − i), . . . , ψp (2m x − i), which are non-zero only on the interval Im,i ; furthermore, each of the functions has p vanishing j moments. Introducing the notation ψm,i for j =

and observe that V p = L2 [0, 1]. In particular, V 1 contains the Haar basis for L2 [0, 1], which consists of functions piecewise constant on each of the intervals Im,i . Here the closure V p is defined with respect to the L2 -norm. We let {φ1 , . . . , φp } denote any orthogonal basis for V0p ; in view of (5) and (6), the orthogonal system [

Bp = {φj }j

j {ψm,i }i,j,m

spans L2 [0, 1]; we refer to Bp as the multiwavelet basis of order p for L2 [0, 1]. In Resnikoff and Wells [4] it is shown that Bp may be readily generalized to bases for L2 (R), L2 (Rd ), and L2 [0, 1]d .

3

Second Kind Integral Equations

The matrix representations of integral operators in multiwavelet bases are sparse. We begin this section by introducing some notation for integral equations. A linear Fredholm integral equation of the second kind is an expression of the form b

Z f (x) = g(x) +

K(x, t)f (t)dt,

(7)

a

where we assume that the kernel K is in L2 [a, b]2 and the unknown f and g are in L2 [a, b]. For simplicity, we let [a, b] = [0, 1], and Z (kf )(x) =

1

K(x, t)f (t)dt. 0

98


Suppose that {θi }i=1 is an orthonormal basis for L2 [0, 1], the expansion of K in this basis is given by the formula XX K(x, t) = Kij θi (x)θj (t), (8) i=1 j=1

Pn Let gT (x) = i=1 gi θi (x), we rewrite equations (7) and (10) in terms of operators K and T, we have (I − K)f = g, Therefore we have (I − K)eT = (K − T)fT + (g − gT ).

where the coefficient Kij is given by the expression Z

1

1

Z

K(x, t)θi (x)θj (t) dxdt,

Kij = 0

i, j = 1, 2, . . . .

(I − T)fT = gT .

Provided that (I − K)−1 exists, we obtain the error bound

0

(9) Similarly, the functions f and g have expansion X X f (x) = fi θi (x), g(x) = gi θi (x), i=1

keT k ≤ k(I − K)−1 k.k(K − T)fT + (g − gR )k. (11)

4

Numerical performances

i=1

where the coefficients fi and gi are given by Z fi =< f (x), θi (x) >=

1

For showing efficiency of numerical method, we consider the following examples. We note that, Delves and Mohamed ([5]):

f (x)θi (x)dx, 0

Z

Z k eN k=

1

gi =< g(x), θi (x) >=

g(x)θi (x)dx,

By this notation, the integral equation (7) can be written as a infinite system of equations X fi − Kij fj = gi , i = 1, 2, . . . . j=1

We can truncate the expansion for K at a finite number of terms, and show it by the integral operator T as following (Tf )(x) = 0

n X n 1X

Kij θi (x)θj (t) f (t)dt,

i=1 j=1

which approximates K. Therefore the integral equation (7) can be approximated by the following system n X fi − Kij fj = gi , i = 1, . . . , n, (10) j=1

which is a linear system of n equation in n unknowns fi . Equations (10) may be solved numerically for approximate solution of equation (7). In this case we have the approximation solution as fT (x) =

≈

fi θi (x).

i=1

Now we estimate the error eT = f − fT . We follow the derivation by the Delves and Mohamed in [5].

! 12 N 1 X 2 , e (xi ) N i=0 N

where e(si ) = x(si ) − xN (si ),

i = 0, 1, . . . , N,

Such that xN (si ) and x(si ) are, respectively the approximate and exact solutions of the integral equations.

4.1

EXAMPLES

Example 1. x(s) = sin s − s + with exact solution x(s) = sin s. s+1

f ∈ L2 [0, 1], x ∈ [0, 1]

n X

e2N (t) dt

21

−1

i = 1, 2, . . . .

0

Z

1

Example 2. x(s) = es − e s+1−1 + exact solution x(s) = es .

R π/2 0

R1 0

st x(t) dt,

est x(t)dt with

R1 Example(3. x(s) = s + 0 K(s, t) x(t) dt, s, s ≤ t K(s, t) = , t, s ≥ t with exact solution: x(s) = sec 1 sin s. The following table shows the computed error k eN k for the before examples.

N 2 3 4 5 6

Table 1: Errors keN k at m=6 for Multiwavelet Example 1 Example 2 Example 3 −2 5.2 × 10 3.3 × 10−2 3.5×10−2 −3 −3 5.5 × 10 7.2 × 10 1.5 × 10−3 −6 −5 4.6 × 10 9.8×10 8.5 × 10−4 −9 −7 5.3 × 10 3.2 × 10 2.8 × 10−7 −12 −10 3.6 × 10 8.9 × 10 1.0 × 10−9


5

Conclusion

The main advantage of multiwavelets over scalar wavelets in numerical methods lies in their short support, which makes boundaries much easier to handle. If symmetric/antisymmetric multiwavelets are used, it is even possible to use only the antisymmetric components of the boundary function vector for problems with zero boundary conditions. The characteristics of the multiwavelets bases which lead to a sparse matrix representation are that 1. The basis functions are orthogonal to low order polynomials (having vanishing moments). 2. Most basis functions have small interval of support.

References [1] F. Keinert, ”Wavlets and Multiwavelets.” Chapman and Hall/CRC, 2004. [2] G.S. Burrus, R.A. Gopinath, H. Guo, ”Introduction to Wavelets and Wavelet Transform.” Prentice Hall,1998. [3] G.G. Walter, X. Shen, ”Wavelets and Other Orthogonal Systems.” Chapman and Hall/CRC, Second Edition, 2001.

99

[4] H.L. Resnikoff, R.O. Wells, ”Wavelet Analysis.” Springer, 1998. [5] L.M. Delves, J.L. Mohamed, ”Computational Methods for Integral Equations.” Cambridge University Press, Cambridge, 1985. [6] I. Daubechies, ”Ten Lecture on Wavelets.” SIAM, Philadelphia, PA, 1992. [7] I. Daubechies, ”Orthonormal Bases of Compactly Supported Wavelets II, Variations on a Thems.” SIAM J.Math.Ana, 24(2), pp. 499-519, 1993. [8] K. Maleknejad, S. Rahbar, ”Numerical solution of Fredholm Integral Equation of the Second Kind by Using B-Spline Function.” Int.J.Eng.Sci, 13(5), pp. 9-17, 2000. [9] K. Maleknejad, F. Mirzaee, ”Using Rationalized Haar Wavelet for Solving Linear Integral Equations.” Applied Mathematics and Computation (AMC), 160(2), pp. 579-587, 2005. [10] K. Maleknejad, H. Mesgarani, T. Nikzad, ”Wavelet-Galerkin Solution for Fredholm Integral Equation of the Second Kind.” Int.J.Eng.Sci, 13(5), pp. 75-80, 2002.

100


A Grid-based Context-aware Recommender System for Mobile Healthcare Applications Mohammad Mehedi Hassan1, Ki-Moon Choi2, Seungmin Han2, Youngsong Mun3 and Eui-Nam Huh1 1, 2 Department of Computer Engineering, Kyung Hee University, Global Campus, South Korea 3 Department of Computer Engineering, Soongsil University, South Korea Abstract - In recent years, with their small size format and ubiquitous connectivity, mobile devices such as smart phones and PDAs can offer interesting opportunities for novel services and applications. In this paper, we propose a context-aware doctor recommender system called CONDOR which recommends suitable doctors for a patient or user at the right time in the right place based on his/her preferences and current context (location, time, weather, distance etc.) information. Existing centralized recommender systems (CRSs) cannot resolve the contradiction between good recommendation quality and timely response which is essential in mobile healthcare. CRSs are also prone to single point failure and vulnerable to privacy and security threads. So we propose a framework by integrating Grid technology with context-aware recommender system to alleviate these problems. Also we present the construction process of context-aware recommendation as well as the performance of our architecture compare to existing CRSs. Keywords: Context-aware recommender system, Grid, Mobile healthcare.

1

Introduction

In recent years mobile computing, where users equipped with small and portable devices, such as mobile phones, PDA’s or laptops are free to move while staying connected to service networks, has proved to be a true revolution [1]. Applications have begun to be developed for these devices to offer online services to people whenever and wherever they are. One of the most popular tools provided in e-commerce to accommodate user shopping needs with vendor offers is recommender systems (RS) [2]. However, it is becoming clear that the use of mobile technologies will become quite pervasive in our lives and that we need to support development of applications in different areas. In particular, we have recently been involved in the development of context-aware Recommender system in mobile healthcare setting. Any patient or user carrying mobile phone or PDA moves different places he/she has never been to before and may face difficulties to find good doctors in those unknown places for emergency healthcare. Therefore, in this research, we propose a context-aware doctor recommender system called (CONDOR-CONtextaware DOctor Recommender), to recommend suitable

doctors for a patient or user at the right time in the right place in mobile computing environment. This time-critical recommendation requires establishing system architectures that allow support infrastructure for wireless connectivity, network security and parallel processing of multiple sources of information. Moreover, unlike stationary desktop oriented machines (PCs), mobile devices, (smart phones, PDA’s) are constrained by their shape, size and weight. Due to their limited size, these devices tend to be extremely resource constrained in terms of their processing power, available memory, battery capacity and screen size among others [3]. These portable devices need to access various distributed computing powers and data repositories to support intelligent deployment of the proposed RS. Furthermore, most RS of today’s are centralized ones which are suitable for single websites but not for large-scale distributed applications of recommendation. Centralized recommender systems (CRSs) cannot resolve the contradiction between good recommendation quality and timely response. In case of performance, the centralized architectures prone to single point failure and cannot ensure low latency and high reliability which is essential in mobile healthcare. Also CRSs are vulnerable to privacy and security threads. [4] In this paper, we propose an architecture by combining context-aware recommender system with Grid technology for mobile healthcare service. And there are very few researches that integrate recommender system with Grid. Traditional recommendation mechanisms like collaborative and contentbased approaches are not suitable in this environment. So we present a recommendation mechanism that analyzes a user’s demographic profile, user’s current context information (i.e., location, time, and weather), doctor’s information (i.e., education, availability, location, reputation, visit fee etc.) and user’s position so that doctor information can be ranked according to the match with the preferences of a user. The performance of our architecture is evaluated compare to existing CRSs. The paper is structured as follows: Section 2 briefly reviews related works. Section 3 presents the proposed system architecture. Section 4 describes the recommendation process. Section 5 shows the analytical performance of our architecture and finally Section 6 concludes the paper.


2 2.1

Related Works Decentralized Recommender Systems

Modern distributed technologies need to be incorporated into recommender systems to realize distributed recommendation. Chuan-Feng Chiu, et al proposed the mechanism to achieve community recommendation based on the design of the generic agent framework which is designed based on the peer-to-peer (P2P) computing architecture [5]. Peng Han, et al. proposed a distributed hash table (DHT) based technique to implement efficient user database management and retrieval in decentralized Collaborative Filtering system [6]. Pouwelse et al. also proposed in [7], a P2P recommender system capable of social content discovery called TRIBLER. TRIBLER uses an algorithm based on an epidemic protocol. However the above systems did not consider context-awareness which is very important in case of mobile computing. Also in healthcare service, parallel processing of multiple sources of information is very important. Today Grid computing promises the accessibility of vast computing and data resources across geographically dispersed areas. This capability is significantly enhanced by establishing support for mobile wireless devices to access and perform on-demand service delivery from the Grid [8]. Integration of recommender system with Grid can enable portable devices (mobile phones, PDA’s) to perform complex reasoning and computation efficiently with various context information exploiting the capabilities of distributed resource integration (computing resources, distributed databases etc.). The author in [4] proposed a knowledge Grid based intelligent electronic commerce recommender systems called KGBIECRS. The recommendation task is defined as a knowledge based workflow and the knowledge grid is exploited as the platform for knowledge sharing and knowledge service. Also context-awareness for mobility support is not considered. We propose a framework by combining context-aware recommender system with Grid technology in a mobile environment.

2.2

Context-aware Recommender Systems

Context is any information that can be used to characterize the situation of an entity. An entity is any person, place or object that is considered relevant to the interaction between a user and an application, including the user and application themselves [9]. Examples of contextual information are location, time, proximity, and user status and network capabilities. The key goal of context-aware systems is to provide a user with relevant information and/or services based on his current context. There are many researches in the literature to use context-aware recommender system in different application areas like travel, shopping, movie, music etc. [10-13]. All these RSs are centralized ones. Thus they are prone to single point failure and vulnerable to security threads.

101

Existing popular recommendation mechanism like collaborative, content based and their hybrid approaches cannot be used in our application area as they cannot handle both user situation and personalization at the same time. So we propose an efficient recommendation mechanism which effectively handles users’ current context information and recommends appropriate doctors in case of normal or emergency condition.

3

Proposed System Architecture

To provide recommendation, the proposed CONDOR system needs following functional requirements: (i) Appropriate data preparation/collection (ii) Creation of personalization method of recommendation Data preparation/collection includes the following: a) User Demographic Information: This includes age, gender, income range, own car, health insurance etc. b) User Context Information: This includes current location, distance, time, and weather information. c) User Preference Information: This is the user’s tendency toward selecting certain doctor among others. d) Doctor Information: This includes specialty, board certification, price, service hour etc. The system will have access to different hospitals and healthcare databases to collect doctor information. Also different doctors can register their information to our system. Doctors will be encouraged to provide their information to get more patients and improve their healthcare quality according to user feedback. Also user can register good doctors in their area into the system. The overall CONDOR architecture is shown in figure 1. Our architecture consists of user interface, middleware and data Service. The major components of our architecture are described as follows: a) Web Portal: This is used as a web interface for the user which can be accessed from users mobile’s or PDA’s internet browser. It also provides Grid security interface (GSI) for preventing unauthorized or malicious users. Users/patients and doctors register their profiles or information through this web portal. Also user can give some doctors information through this portal. Also user submits his recommendation request (query) through this portal as shown in figure 2. b) Context Manager: The context manager retrieves information about the user’s current context by contacting the appropriate context Information services (see the fig 1) and send context information to recommendation web service. The location information is collected by two major location positioning technologies, Global Positioning System (GPS) and Mobile Positioning System (MPS). The distance is the Euclidean distance between user location and doctor location. Time and day information is provided by the computer system and the weather information is obtained from the weather bureau website. c) OGSA-DAI: OGSA-DAI (Open Grid Service ArchitectureData Access and Integration) [11] is an extensible framework for data access and integration. It exposes heterogeneous data resources to be accessed via stateful web services. No additional code is required to connect to a data base or for

102


Figure 1: Our Proposed CONDOR System Architecture

Patient/User Id Doctor Specialty: Cardiology, medicine etc. Visit Fee Range: Average, Low, High Condition: Normal/Emergency Parking Area: (Yes/No) Figure 2: User query interface in mobile browser

querying data. OGSA-DAI supports an interface integrating various databases such as XML databases and relational databases. It provides three basic activities- querying data, transforming data, and delivering the results using ftp, e-mail etc. All information regarding users, doctors, hospitals, recommendation results, doctor’s rating information etc. are saved through OGSA-DAI to the distributed databases. d) Recommendation Generation Web Service (RGWS): This actually generates the recommendation using our recommendation technique. Usually this generates top 5 doctors who are suitable in user’s current context. e) Map Service: Using map service, the doctor’s location map is displayed on the user’s mobile phone browser. The workflow of our architecture is as follows: (1) When a user or patient needs the recommendation service, a recommendation request is sent to the system from user’s mobile phone web browser. (2) Then recommendation web service is invoked from the web service factory for that user. It then collects necessary information like user’s profile data, current doctor’s information on that location through OGSA-DAI and user’s current context information from context manager.

(3) The service broker then schedules the recommendation service to run on different machines and a list of recommended doctor’s are generated using our recommendation algorithm and are passed to the user’s mobile phone browser through the web portal. (4) When the user selects a doctor, his location map is displayed through the map service on the mobile phone’s web browser display. (5) User is also asked to provide his/her rating about that doctor he/she already visited so that the system can produce better recommendation.

4 4.1

Recommendation Process Identification of Appropriate Using Bayesian Network

Doctors

To identify appropriate doctors for the individual mobile patient or user in any location based on his/her current context, it requires effective representations of the relationships among several kinds of information as shown in figure 3. Because many uncertainties exist among different kinds of information and their relationships, a probabilistic approach such as a Bayesian Network (BN) can be utilized. Bayesian networks (BNs) which constitute a probabilistic framework for reasoning under uncertainty in recent years, have been representative models to deal with context inference [14]. With the Bayesian network, we can formulate user/patient’s interest to a doctor in his/her current situation


User Demographic Information Doctor 2 distance2

Doctor 1

distance1

Doctor Information User Context Information

103

the probability when xi is the kth and π i is the jth, we denote it by θ ijk . The purpose of Bayesian networks parameter learning is to evaluate the conditional probabilistic density p (θ | D, S ) using prior knowledge when network structure S and samples set D are given. The main task in EM algorithm is calculating conditional probability p ( xi , π i | Dl ,θ ( t ) ) for all parameter set D and all variables xi. When data set D is given, the log likelihood is: l (θ | D) = ∑ ln p( Dl | θ ) l

= ∑ f ( xik ,π i j )ln θijk

Figure 3: A simple recommendation Space

(2)

ijk

in the form of a joint probability distribution. For constructing the structure of the Bayesian network, we need the knowledge of user preference of choosing a good doctor in any location. A survey by US National Institute of Health [15] showed that board certification, rating (reputation), type of insurance accept, location (distance), visit fee, hours the service is available and existence of lab test facility, in descending order, are the most important factors to user/patient in choosing a new doctor. Also car parking facility is required by user in doctor’s location. Therefore, we design the BN structure considering user preference information as shown in figure 4. From the BN structure we can easily calculate the probability of Interest (u, d) – interest of user u on a doctor d, that is, Interest (u, d) = p (interest | user_age, user_gender, doctor_ specialty, user_ current location, time, user_income) (1)

Where f ( xik , π i j ) denotes the value in dataset when xi = k and π i = j , the maximal log likelihood θ can be obtained by :

θijk =

f ( xik ,π i j ) ∑ f ( xik ,π ij )

(3)

The EM algorithm initializes an estimated value θ (0) and repairs it continuously Th.ere are two steps from current θ (t ) to next θ (t +1) , Expectation calculation and Maximum. Expectation calculation calculates the current expectation of θ when D is given. l (θ | θ ( t ) ) = ∑∑ ln p ( Dl , X l | θ ) p ( X l | Dl ,θ ( t ) ) l

(4)

xi

For all θ , there are l (θ | θ ( t +1) ) ≥ l (θ | θ ( t ) ) . From equation (2): l (θ | θ ( t ) ) = ∑ ft ( xik , π ij )ln θijk i, j ,k

Where fi ( xi ,π i ) = ∑ p( xi ,π i | Dl ,θ ( t ) )

(5)

l

Maximization chooses next θ ( t +1) by maximize expectation of current log likelihood: θijk(t +1) = arg max E[ P( D | θ ) | D′,θ (t ) , S ] =

f ( xik , π i j ) ∑ j f ( xik ,π ij )

(6)

Equation (5) calculates the Expectation and (6) the Maximization. When this EM algorithm converges slowly, we use the improved E and M step using the procedure in [16].

4.2 Figure 4: A BN structure for finding Interest (u, d)

Bayesian networks built by an expert cannot reflect a change of environment. To overcome this problem, we apply a parameter learning technique. Based on collected data, CPTs (Conditional Probability Tables) are learned by using EM (Expectation Maximization) algorithm as shown below: Let, V denotes variable set {x1 , x2 ,.......xn } and S Bayesian network structure. The value of each variable in structure S is {xi1 , xi2 ,...., xir } .And D = {C1 , C2 ,...., Cm } is samples data set.

π i is the parent’s sets of xi . Then p ( xik | π ij ) represents that

Calculation of Final Ranking Score Considering User Sensitivity to Location

The sensitivity of users to location is considered by the CONDOR system. We posit that the likelihood of a doctor being visited by a user/patient not only depends on hi/her interest on a doctor but also on distance between them. The distance is the Euclidean distance between user location and doctor location. Usually user is more likely to choose a doctor with highest similarity and minimum distance. In case of emergency, distance will get more priority and user will choose a doctor with minimum distance and moderate similarity. So we consider a distance weight variable (DWV)

104


for measuring the user’s sensitivity to distance. DWV for a user u with respect to different doctors d i is calculated as follows: distancemax (u, d ) ] distance (u, di ) +1 DWV (u.di ) = log 2 distance max (u , d ) log 2 [

(7)

where i = 1, 2, 3……n (No. of doctors in the similarity list) distance max (u, d ) = The maximum or farthest distance of a doctor location from user current location among the doctors in the preferred list. distance (u, di ) = The distance between the user and any doctor. In the formula (7), DWV will be reduced when distance (u, di ) increases and DWV is also normalized. I.e., if distance (u, di ) = distance max (u, d ) , then DWV = 0, if distance (u, di ) = 0, then DWV = 1.Therefore, the final score

for any doctor di for a user u in user’s present position with current user context is calculated as follows:

Score (u, di ) = W1 ∗ Interest (u, di ) + W2 ∗ DWV (u, di ) ………. (8) where W1 and W2 are two weighting factors ( W1 + W2 = 1; 0 ≤ W1 ≤ 1; 0 ≤ W2 ≤ 1 ) reflecting the relative importance of similarity/interest (u, d) and distance. Based on the highest score, the doctors will be ranked and recommended to that particular user/patient. Initially, W1 = 0.5 and W2 = 0.5 if we consider equal importance of interest value and distance. If emergency situation arrives, W1 = 0.1 and W2 = 0.9. Figure 5 illustrates the scenario of the recommendation process. Doctors’ Information

User Context

User Preference Recommendation Input Space

DRS output in the user’s mobile Internet browser

D1 (View Map) D2 (View Map) D3 (View Map)

Evaluation

5.1

Effectiveness Algorithm

of

the

Recommendation

In order to find the effectiveness of the recommendation process, some case studies are presented in this section. We have created different sample datasets and applied our recommendation mechanism. Suppose user x wants to use the CONDOR system for finding a doctor of Cardiology. So he registers his demographic information to the system. As Bayesian Network is used, all data should be in discrete form. Table 1 shows preprocessed data set which consists of demographic information of user x required for registration. Let, in user’s current location, the system finds four doctors of cardiology specialist. User’s current context information is shown in table 2. Table 3 shows the information of four doctors (D1, D2, D3 and D4) in user’s current location. Table 1: Demographic information of user x User x Gender Age Income (1,000 won) Health Insurance No.(If exist) Own Car

Male 30-39 200-250 H-0012 Yes

Table 2: Context information of user x Context Information of User x 10am – 11am Time Sunny Weather Weekday Day Differ Distance Location Table 3: Doctors information in current location

Score (u , di ) = W1 ∗ Interest (u , d i ) + W2 ∗ DWV (u , d i )

Doctor’s List (Based on Ranking Score)

5

Distance (Km) 2 4 3

Figure 5: CONDOR’s recommendation construction process

Specialty Board Certification Overall Rating Accept Health Insurance Visit Fee Service Hour (Weekday) Service Hour (Weekend) Lab Test facility Differ Distance(km) Parking Area

D1 Card

D2 Card.

D3 Card.

D4 Card.

Yes

Yes

No

No

0.8

1

0.3

0.3

Yes

Yes

Yes

Yes

Avg. 10a.m. 9p.m. 10a.m. 3p.m.

Avg. 10a.m. - 9p.m.

Low 9a.m. 10p.m.

Avg. 9a.m. 10p.m.

11a.m. - 4p.m.

10a.m. 3p.m.

10a.m. - 3p.m.

Yes

Yes

No

No

2

8

5

3

Yes

Yes

Yes

No

To calculate the final score of each doctor for recommendation, first using the formula in equation (1), the probability of interest of user x to each doctor di is calculated


using the BN structure as shown in figure 4. We used Hugin Lite [17] for calculation. Using the equation in (7), the DWV of doctors’ location is calculated. The results of Interest (x, Di) and DWV are shown in figure 6.

Figure 6: Interest (x, Di) and DWV value of four doctors

If user x selects normal condition, then final ranking score for each doctor is calculated using equation (8), considering W1= 0.5 and W2 = 0.5: RankingScore D = 0.5*0.8 + 0.5*0.47 = 0.635 1

RankingScore D = 0.5*1.0 + 0.5*0 = 0.5 2

RankingScore D = 0.5*0.65 + 0.5*0.14 = 0.395 3

RankingScore D = 0.5*0.6 + 0.5*0.33 = 0.465 4

Final doctors list is displayed on user mobile phone web browser as shown in figure 7. Doctors List

Distance (Km)

D1 (View Map) D2 (View Map) D4 (View Map) D3 (View Map)

2 8 3 5

Figure 7: Recommendation results in normal condition

If in the same position, user x selects emergency condition, then the result will be calculated considering W1= 0.1 and W2 = 0.9. The result is shown in figure 8. The doctor D4 will not be in the list since it has long distance. RankingScore D = 0.1*0.8 + 0.9*0.47 = 0.503 1

RankingScore D = 0.1*1.0 + 0.9*0 = 0.1 2

RankingScore D = 0.1*0.65 + 0.9*0.14 = 0.191 3

RankingScore D = 0.1*0.6 + 0.9*0.33 = 0.356 4

Doctor’s List

Distance (Km)

D1 (View Map) D4 (View Map) D3 (View Map)

2 3 5

Figure 8: Recommendation results in emergency case

105

5.2

Effectiveness of the Architecture

In this experiment we are concentrated on the on-line workload of the CONDOR system in Grid environment. Let us model the CONDOR system as an M/G/1 queue (centralized system). An M/G/1 queue consists of a FIFO buffer with requests arriving randomly according to a Poisson process at rate λ and a server, which retrieves requests from the queue for servicing. User requests are serviced at a first-come-first-serve (FCFS) order. We use the term ‘task’ as a generalization of a recommendation request arrival for service. We denote the processing requirements of a recommendation request as ‘task size’ and service time is a general distribution. All requests in its queue are served on a FCFS basis with mean service rate µ . It has been observed that the workloads in Internet are heavy-tailed in nature [18], characterized by the function, Pr{ X > x} ~ x −α , where 0 ≤ α ≤ 2 . In CONDOR system, users/patients request for recommendation varies with the no. of doctors of that category. For example, in case of medicine, there may be 10,000 doctors, where in case of cardiology; the no. of doctors may be 2000. Based on the no. doctors the processing requirements (i.e. task size) also vary. Thus, the task size on a given CONDOR’s service capacity follows a Bounded Pareto distribution. The probability density function for the Bounded Pareto B ( k , p, α ) is:

α kα

x −α −1 (9) 1 − (k / p)α where α represents the task size variation, k is the smallest possible task size and p is the largest possible task size f ( x) =

(k ≤ x ≤ p) . By varying the value of α , we can observe distributions that exhibit moderate variability (α = 2) to high variability (α = 1) . We start the derivation of waiting time E (W ) . W is the time a user has to wait for service. E ( N q ) is the number of waiting customers and E ( X ) is the

mean service time. By Little’s law, the mean queue length E ( N q ) can be expressed in terms of waiting time. Therefore, E ( N q ) = λ E (W ) and load on the server, ρ = λ E ( X ) . Let E ( X j ) be the j-th moment of the service distribution of the

tasks. We have,  α p j (( k p)α − ( k p ) j ) if j ≠ α  α  ( j − α )(1 − ( k p) ) j (10) E( X ) =  α  α k ln( p k ) if j = α  (1 − (k p)α ) Hence, using P-K formula, we obtain the expected waiting time in the CONDOR system’s queue, E (W ) = λ E ( X 2 ) 2(1 − ρ ) . Now we want to measure the expected waiting time with respect to varying server load and task sizes. In centralized CONDOR system, E (W ) waiting time increases when the service time increases and load on server also increases. But in the Grid environment, the load will be distributed to different machines and new load will

106


be ρ = (1 − Predirect ) on primary server and new arrival rate will be λ = λ (1 − Predirect ) . Figure 9 shows the effectiveness of our CONDOR system in Grid environment compare to CRS in terms of expected waiting time considering task variability (α = 1.5) and redirection probability ( Predirect = 0.5) . In figure 9, we can see a reasonable improvement in expected waiting time in distributed environment. Without sharing resources, when the system load approaches to 1.0, the user perceived response time for recommendation service.

Figure 9: Effectiveness of our Grid-based CONDOR system architecture compare to Centralized RS

6

Conclusion

In this paper, we present a novel context-aware doctor recommender system architecture called CONDOR in Grid environment. The system helps a user/patient to find suitable doctors at the right time, in the right place. We have also discussed the recommendation process which efficiently recommends appropriate doctors in case of normal and emergency case. The CONDOR system based on Grid technology improves high performance and stability along with better quality than the centralized ones which are prone to single point failure and lack the capabilities of improving recommendation quality and privacy. We are in the implementation process of the architecture. In future, we will evaluate our framework using real-world testing data.

Acknowledgement This research was supported by the MKE (Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Advancement) (IITA-2008-(C1090-0801-0002)

References [1] T. F. Stafford and M. L. Gillenson. “Mobile commerce: what it is and what it could be”. Communications of the ACM, 46(12), 2003 [2] Special issue on information filtering. Communications of the ACM, 35(12), 1992

[3] S. Goyal and J. Carter. “A lightweight secure cyber foraging infrastructure for resource-constrained Devices”. The Sixth IEEE workshop on Mobile Computing Systems and Applications, 2004, pp-186-195 [4] P. Liu, G. Nie, D. Chen and Z. fu. “The knowledge Grid-Based Intelligent Electronic Commerce Recommender Systems”. IEEE Intl. Conf. of SOCA, 2007, pp: 223-232 [5] C. Chin, T. K. Shiih and U. Wang. “An Integrated Analysis Strategy and Mobile Agent Framework for Recommendation System in EC over Internet”. Tamkang Journal of Science and Engineering, 2002, 5(3):159-174 [6] P. Han, B. Xie, F. Yang and R. Shen et al. “A scalable P2P recommender system based on distributed collaborative filtering”. Expert Systems with Applications, 2004, 27(2) [7] J. A. Pouwelse, P. Garbacki et al.” Tribler: A socialbased peer-to-peer system”. In the Proceedings of the 5th International P2P conference (IPTPS 2006), 2006 [8] D. C. Chu and M. Humphrey. “Mobile OGSI.NET: Grid Computing on Mobile Devices”. The 5th IEEE/ACM Int. Workshop on Grid Computing, 2004, Pittsburgh, PA [9] A.K. Dey. “Understanding and Using Context”, Personal and Ubiquitous Computing, Vol.5, pp.20-24, 2001 [10] M. V. Setten, S. Pokraev and J. koolwaaij. “ ContextAware Recommendation in the Mobile Tourist Application: COMPASS”. AH-2004, 26-29 August, Eindhoven, The Netherlands, LNCS, 3137, pp. 235-244 [11] W. Yang, H. Cheng and J. Dia, “A Location-aware Recommender System for Mobile Shopping Environments”. Journal of Expert System with Applications, 2006 [12] H. Park, J. Yoo and S. Cho. “A Context-aware Music Recommendation System Using Fuzzy Bayesian Networks with Utility Theory”. FSKD 2006, LNAI 4223, pp. 970-979 [13] C. Ono, M. Kurokawa, Y. Motomura and H. Asoh. “A Context-Aware Movie Preference Model Using a Bayesian Network for Recommendation and Promotion”, UM 2007, LNAI 4511, pp. 247-257 [14] P. Korpipaa, et al.”Bayesian Approach to Sensor-based Context Awareness”. Personal and Ubiquitous Computing, Vol. 7, pp. 113-124, 2003 [15] http://www.niapublications.org/agepages/choose.asp [16] S. Zhang, Z. Zhang, N. Yang, J. Zhang and X. Wang “An improved EM Algorithm for Bayesian Networks Parameter Learning” Proceedmgs of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004 [17] http://www.hugin.com/Products_Services/Products/De mo/Lite/ [18] M. E. Crovella, M. S. Taqqu and A. Bestavros. “A Heavy Tailed Probability Distribution in the World Wide Web”. A Practical Guide To Heavy Tails, Birkhauser Boston Inc., Cambridge, MA, USA, pp. 3-26, 1998


107

A Two-way Strategy for Replica Placement in Data Grid Qaisar Rasool, Jianzhong Li, Ehsan Ullah Munir, and George S. Oreku School of Computer Science and Technology, Harbin Institute of Technology, China

Abstract – In large Data Grid systems the main objective of replication is to enhance data availability by placing replicas at the proximity of users so that user perceived response time is minimized. For a hierarchical Data Grid, replicas are usually placed in either top-down or bottomup way. We put forward Two-way replica placement scheme that places replicas of most popular files close to the requesting clients and less popular files a tier below from the Data Grid root. We facilitate data requests to be serviced by the sibling nodes as well as by the parent. Experiments results show the effectiveness of Two-way replica placement scheme against no replication. Keywords: Replication, Replica placement, Data Grid.

1

Introduction

Grid computing [5] is a wide-area distributed computing environment that involves large-scale resource sharing among collaborations, often referred to as Virtual Organizations, of individuals or institutes located in geographically dispersed areas. Data grids [2] are grid infrastructure with specific needs to transfer and manage massive amounts of scientific data for analysis purposes. Data replication is an important technique used in distributed systems for improving data availability and fault tolerance. Replication schemes are divided into static and dynamic. While static replication is user-centered and do not support the changing behavior of the system, dynamic replication is more suitable for environments like P2P and Grid systems. In general, replication mechanism determines which files should be replicated, when to create new replicas, and where the new replicas should be placed. There are many techniques proposed in research for dynamic replication in Grid [10, 7, 11, 13]. These strategies differ by the assumptions made regarding underlying grid topology, user request patterns, dataset sizes and their distribution, and storage node capacities. Other distinctive features include data request path and the manner in which replicas are placed on the Grid nodes. Two common approaches for replica placement in a tree topology Data Grid are top-down [10, 7] and bottom-up [11]. In both cases, the root of Data Grid tree is considered as the central repository for all datasets to be replicated.

For a Data Grid tree, usually clients at the leaf nodes generate data requests. A request travels from client to parent node in search of replica until it reaches at root node. We in this paper propose a Two-way replication scheme that takes a different path for data request. It is assumed that the children under the same parent in the Data Grid tree are linked in a P2P-like manner. For any client’s request, if desired data is not available at the client’s parent node, the request moves to the sibling nodes one by one until it finds the required data. If none of the siblings can fulfill the request, the request moves to the parent node one level up. Here also all the siblings are probed and if data not found the request moves to next parent and ultimately to root node. In Two-way replication scheme we use both bottom-up and top-down approaches to place the data replicas in order to enhance availability of requested data in Data Grid. The files which are more frequent are placed close to the clients and the less frequent files are placed close to the root, one tier below, in the Grid. The simulation studies show the benefit of Two-way replication strategy over the case when no replication is used. We perform experiments with data files of uniform size and with variable sizes separately.

2

Data Grid Model

Several Grid activities such as [3, 8] have been launched since the early years of this century. We find that many practical Grids, for example, GriPhyN [12] employ topology which is hierarchical in nature. The High Energy Physics (HEP) community seeks to take advantage of the grid technology to provide physicists with the access to real as well as simulated LHC [8] data from their home institutes. Data replication and management is hence considered to be one the most important aspects of HEP Data Grids. In this paper we have used the hierarchical Data Grid model. A tree T is used to represent the topology of the Data Grid which is composed of root, intermediate nodes and leaf nodes. We hereafter refer the intermediate nodes as cache nodes and leaf nodes as client nodes. All client nodes are local sites issuing request for data stored at the root or cache nodes of the Data Grid. For any parent node, all its children are linked into P2P-like manner (i.e. are siblings) and can transfer replicas to each other when required. The only

108


exception is client tier since storage space of a client node is very limited and can hold only one file. Unlike most of the previous hierarchical model in Data Grid in which data request sequence is along the path from child to parent and up to the root of the tree, our hierarchical model has another data request path. A request moves upward to parent node only after all the sibling nodes have been searched for the required data. The process is as follows: 1. A client c requests a file f. If the file is available in the client’s cache then ok. Otherwise step 2. 2. The request is forwarded to the parent of client c. If data is found there, it is transferred to the client. Otherwise request is forwarded to the sibling node. 3. Probe all sibling nodes one after another in search of data. If data is found, it is transferred to the client via shortest path. 4. If data is not found at any sibling node, the request is forwarded to the parent node and the Step 3 is repeated. 5. Step 4 continues until the request reaches at root. The Data Grid model and example data access paths are shown in the Fig.1.

requesting clients is the cheapest and essential way to ensure faster response time. The Grid Replication Scheduler (GRS) is the managing entity for two-way replication scheme. At each cache node and the root, a Replica Manager is held that stores information about requested files and the time when requests were made. This info is accumulated and coordinated to GRS into a global workload table G. The attributes of G are (reqArrivalTime, clientID, fileID), where reqArrivalTime is the time when the request is arrived in the system. Replicas are registered in the Replica Catalog when they are created and placed to the nodes in the Grid.

3.1

A replica management system should be able to handle a large number of replicas and their creation and placement. Like most of the previous dynamic replication strategies, we base the decision of replica creation on the data access frequency. Over the time the GRS accumulates access request history in the global workload table G. The table G is processed to get a cumulative workload table W having attributes (clientID, fileID, NoA), where NoA stands for Number of Accesses. This table W is used to trigger replication of requested files. The GRS maintains all the necessary information about the replicas in Replica Catalog. Whenever the decision is made to initiate replication, the GRS registered the newly created replicas into the replica catalog along with the information of their creation time and the hosting nodes.

3.2

Fig.1. Data access operation in hierarchical Grid model

3

Two-way Replication Scheme

Replication techniques increase availability of data and facilitate load sharing. With respect to data access operations, Data Grids are read-only environments in which either data is introduced or existing data is replicated. To satisfy the latency constraints in the system, there are two ways: one is to vary the speed of data transfer, and other is to shorten the transfer distance. Since bandwidth and CPU speed are usually expensive to change, shortening transfer distance by placing replicas of data objects closer to

Replica Creation

Replica Placement

As stated, for each newly created replica, the GRS decides where to place the replica and registers that replica into the Replica Catalog. The decision of replica placement is made in the following way. The GRS sorts the cumulative workload table W on the basis of fileID and NoA; the fileID in ascending and NoA in descending order. Then, Project operation is applied on the sorted table, SW, over fileID and NoA to get table F(fileID, NoA). Since the table F may have many NoA entries corresponding to a fileID, therefore the table is processed to get AF, containing the aggregate values of NoA for each individual file so that GRS become aware of how many times a file was accessed in the system. Further, the table AF is sorted to get SF. The first half of SF contains entries of more-frequent files (MFFs) and the lower half consists of entries of less-frequent files (LFFs). We simply divide the number of entries in SF by 2 and round the result to an integer. For example, there would be 389 MFFs and 388 LFFs in a table of 777 entries (Fig.2). A data file may be requested by many clients in the Grid and a client which generates the maximum requests for a file may be termed as the best client node to host the replica of that file. We can easily get the information about the best


W Client file NoA c1 a 20 c2 b 30 c3 a 45 c1 c 40 c4 c 10

SW Client file NoA c3 a 45 c1 a 20 c2 b 30 c1 c 40 c4 c 10

109

F

AF

file NoA a 45 a 20 b 30 c 40 c 10

file NoA a 65 b 30 c 50

file best client a c3 b c2 c c1

SF file NoA a 65 c 50 b 30

MFFs: a, c LFFs: b

BC Fig.2. Procedure to find the MFFs and LFFs client for any requested file from the table SW. The GRS extracts and stores the entries of best clients for all files into the table BC. The procedure of obtaining MFFs and LFFs from the sorted workload table SW is depicted in Fig.2. The storage capacity of Grid nodes is an important consideration while replicating the data as the size may be huge. The client nodes comparatively have the lowest storage capacity in the Data Grid hierarchy. Therefore it is not suitable to choose a best client node as a replica server. However, we can get benefit from the principle of locality by placing replicas very close to the best client. We resolve to place replicas at the parent node of the best client. Thus all MFFs are replicated in the immediate parent nodes of the best clients. All the LFFs are replicated at the children of the root node along the path to the best clients. Two-wayReplicaPlacement(W) t Å Get-Time() SW Å Sort(W) over fileID ASC, NoA DESC BC Å Extract-BestClient(SW) F Å project(SW) over fileID, NoA AF Å Aggregate-NoA(F) over fileID SF Å sort(AF) over NoA MFF Å Upper-Half(SF) LFF Å Lower-Half(SF) con Replicate-HFF(HFF, BC, t) Replicate-LFF(LFF, BC, t)

Fig.3. Two-way Replica Placement Algorithm After replication, the GRS will flush out the workload tables in order to calculate the access statistics afresh for the next interval. While placing a replica if GRS finds that desired replica is already present at the selected node, it just updates the replica creation time in the catalog. The steps of Two-way replica placement algorithm are given in Fig.3.

The replication of MFFs and LFFs are depicted by the functions Replicate-MFF and Replicate-LFF respectively. These functions are executed concurrently to get the replicas spread to the selected locations. In each case, the replicas are placed along the way to the best client. Fig.4 depicts the steps for Replicate-MFF algorithm. For each file entry in the table MFF, the GRS finds its best client by consulting the table BC and then replicates that file at the parent node of its best client. Steps for Replicate-LFF algorithm are shown in Fig.5 Replicate-MFF(MFF, BC, t) for all record r in MFF do f Å r.fileID c Å BC[f].nodeID p Å Parent(c) if Exist-in(f) Update-CT(f, p, t) end if skip to end if AvailableSpace(p, t) < size(f) Evacuate(p, t) end if Replicate(f, p, t) end for

Fig.4. Algorithm for Replicating MFF files If the free space of the replica server is less than the size of the new replica, then some replicas may be deleted to make room for the new replica. Only if the available space of the node is greater or equal to the size of the requested data file, can functions Replicate-MFF and Replicate-LFF be called. In each case, the function reserves the storage space for file f in the selected node first, and then invokes the transmission of file f to the candidate node in the


background. The replica is transferred to the selected destination from a closest node that has the replica of data file f. After the transmission is completed, the new replica’s creation time is set to t. Replicate-LFF(LFF, BC, t) for all records r in LFF f Å r.fileID n Å BC[f].nodeID while n ≠ Root do n Å parent(n) end while c Å child(n) if Exist-in(f, c) update-CT(f, c, t) skip to end end if if AvailableSpace(c, t) < size(f) Evacuate(c, t) end if Replicate(f, c, t) end for

Fig.5. Algorithm for Replicating LFF files

3.3

Replica Replacement Policy

A replica replacement policy is essential to make decision which of the stored replicas should be replaced with the new replica in case there is a shortage of storage space at the selected node. At a specific time, the available space of a replica server will be equivalent to its remaining free space plus the space occupied by the possible redundant replicas. A replica is considered as redundant if it is created earlier than the current time session and currently not being active or referenced; meaning there is no request in the current session for this replica. The function Evacuate in Replicate-MFF and ReplicateLFF is used for this purpose. For a given selected node it checks the creation times of all present replicas and continues to remove each redundant replica until the storage space is sufficient to host the new replica.

and transported and saved to the local cache of the client. Initially all the data was held only at root node. As the time progresses, the access statistics are gathered and are used for the decision of replica creation and placement. When a replica is being transferred from one node to another, the link is considered busy for the duration of transfer and cannot be used for any other transfer simultaneously. We used an access pattern with medium degree of temporal locality. That is, some files were requested more frequently than others and such requests were 60% of the whole. Table 1. Grid tiers link bandwidth Data Grid tiers Bandwidth Scaling Tier 0 – Tier 1 2.5 Gbps 1MB Among Tier 1 7.0 Gpbs 2.8MB Tier 1 – Tier 2 2.5 Gpbs 1MB Among Tier 2 7.0 Gpbs 2.8MB Tier 2 – Tier 3 622 Mbps 0.24MB We run experiments in two categories. In first category, we use files of variable sizes ranging 500MB to 1GB. And in experiments of second category we used files of uniform size of 1GB. For convenience, we did scaling to file sizes and bandwidth values uniformly, reducing 1GB file size to 3.2MB. The scaling of bandwidth values between tiers is given in Table 1. No Replication

Two-way Strategy

18 16 Avg Response Time (sec)

110

14 12 10 8 6 4 2 0 1

2

3

Experiments

Two-way Strategy

25

Avg Response Time (sec)

In order to evaluate the proposed Two-way Replica Placement scheme, we conduct experiments using the Data Grid model shown in Fig.1. The link bandwidth was constructed according to the estimations shown in Table 1. We consider the data to be read-only and so there are no consistency issues involved. The simulation runs in sessions and each session having a random set of requests generated by various clients in the system. When a client generates a request for a file, the replica of that file is fetched from the nearest replica server

5

Fig.6. Experiment with data files of variable sizes No Replication

4

4

Request patterns with medium degree of temporal locality

20

15

10

5

0 1

2

3

4

5

Request patterns with medium degree of temporal locality

Fig.7. Experiment with data files of uniform size


For each client node, we keep a record of how much time it took for each file that it requested to be transported to it. The average of this time for various simulation runs was calculated. Compared to no replication case, the Two-way replication technique presents better performance as shown in Fig.6 and Fig.7. Since the file size is uniform in the experiment of second category, therefore in Fig.7 we have a straight line for no replication case.

5

Related Work

An initial work on dynamic data replication in Grid environment was done by Ranganathan and Foster in [10] proposing six replication strategies: No Replication or Caching, Best Client, Cascading, Plain Caching, Cascading plus Caching and Fast Spread. The analysis reveals that among all these top-down schemes, Fast Spread shows a relatively consistent performance through various access patterns; it gives best performance when access pattern are random. However when locality is introduced, Cascading offers good results. In order to find the nearest replica, each technique selects the replica server site that is at the least number of hops from the requesting client. In [1] an improvement in Cascading technique is proposed namely Proportional Share Replication policy. The method is heuristic one that places replicas on the optimal locations when the number of sites and the total replicas to be distributed is known. The work on dynamic replication algorithms is presented by Tang et al [11] in which they have shown improvement over Fast Spread strategy while keeping the size of data files uniform. The technique places replicas on Data Grid tree nodes in a bottom up fashion. In [9, 13], replica placement problem is formulated mathematically followed by theoretical proofs of the solution methods. A hybrid topology is used in [6] where ring and fat-tree replica organizations are combined into multi-level hierarchies. Replication of a dataset is triggered when requests for it at a site exceed some threshold. A cost model evaluates the data access costs and performance gains for creating replicas. The replication strategy places a replica at a site that minimizes the total access costs including both read and write costs for the datasets. The experiments were done to reflect the impact of size of different data files and the storage capacities of replica servers. The same authors proposed a decentralized data management middleware for Data Grid in [7]. Among various components of the proposed middleware, the replica management layer is responsible for the creation of new replicas and their transfer between Grid nodes. The experiments were done considering both the top down and bottom up methods separately with data repository located at the root of Data Grid and at the bottom (clients) respectively.

111

Each of these techniques uses either the top-down or bottom-up method for replica placement at a time. We in our work have used both methods together in order to enhance the data availability.

6

Conclusions

The management of huge amounts of data generated in scientific applications is a challenge. By replicating the frequent data to the selected locations, we can enhance the data availability. In this paper we proposed a two-way replication strategy for Data Grid environment. The most frequent files are placed very close to the users and the less frequent files are replicated from top to bottom one tier down the Grid hierarchy. The experiment results show the performance of two-way replica placement scheme.

7

References

[1] J. H. Abawajy. “Placement of File Replicas in Data Grid Environments” Proceedings of International Conf on Computational Sciences, LNCS 3038, 66-73, 2004. [2] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury and S. Tuecke. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets”; Journal of Network and Computing Appl., v23, 3, 187-200, 2000. [3] EGEE, http://www.eu-egee.org/ [4] Q. Fan, Q. Wu, Y. He, and J. Huang. “Transportation Strategies of the Data Grid” International Conference on Semantics, Knowledge, and Grid (SKG), 2006. [5] Ian Foster and Carl Kesselman. “The Grid2: Blueprint for a New Computing Infrastructure” Morgan Kaufmann, 2003. [6] H. Lamehamedi, B.K. Szymanski, Z. Shentu and E. Deelman. “Data Replication Strategies in Grid Environments” Proceedings of International Conf on Antennas and Propagation (ICAP), IEEE Computer Science Press, Los Alamitos, CA, 378-383, Oct 2002. [7] H. Lamehamedi and B.K. Szymansi. “Decentralized Data Management Framework for Data Grids”; FGCS (Elsevier), v23, 1, 109-115, 2007. [8] LHC Computing Grid, http://www.cern.ch/LHCgrid/ [9] Y.F. Lin, P. Liu and J.J. Wu. “Optimal Placement of Replicas in Data Grid Environment with Locality Assurance” International Conference on Parallel and Distributed Systems (ICPADS), 2006.

112


[10] Kavitha Ranganathan and Ian Foster. “Identifying Dynamic Replication Strategies for a High-performance Data Grid” Proceedings of International Grid Computing Workshop, LNCS 2242, 75-86, 2001. [11] Ming Tang, Bu-Sung Lee, Chai-Kiat Yeo and Xueyan Tang. “Dynamic Replication Algorithms for the Multi-tier Data Grid”; FGCS (Elsevier), v21, 775-790, 2005. [12] Grid Physics Network, http://www.griphyn.org/ [13] Y. Yuan, Y. Wu, G. Yang and F. Yu. “Dynamic Data Replication based on Local Optimization Principle in Data Grid” 6th International Conference on Grid and Cooperative Computing (GCC), 2007.


113

The Influence of Sub-Communities on a Community-Based Peer-to-Peer System with Social Network Characteristics Amir Modarresi1, Ali Mamat2, Hamidah Ibrahim2, Norwati Mustapha 2 Faculty of Computer Science and Information Technology, University Putra Malaysia 1 [email protected], 2{ali,hamidah,norwati}@fsktm.upm.edu.my

Abstract The objective of this paper is to investigate the effect of sub-communities on Peer-to-Peer systems which have social network characteristics. We propose a general Peer-to-Peer model based-on social network to illustrate these effects. In this model the whole system is divided into several communities based on the contents of the peers and each one is divided into several sub-communities. A computer based model is created and social network parameters are calculated in order to show the effect of sub-communities. The result confirms that a large community with many highly connected nodes can be substituted with many sub-communities with normal nodes.

Keywords: Peer-to-peer computing, social network, community

1. Introduction There are many systems form a network structure; a combination of vertices which are connected by some edges. Among them we can mention social network like collaboration network or acquaintance network; technical network such as World Wide Web and electrical network; biological network, such as food network and neural network. The concepts of social networks are applicable in many technical networks, especially those which connect people. These concepts help designers to catch more information about a group of people who are using the network and the result is providing better services for the group according to their interests and needs. Peer to Peer (P2P) systems also settle in this structure. Since people are working with these nodes, social network concepts can be applicable for them. In theoretical point of view, P2P systems create a graph in a way that each node will be a vertex and

each neighborhood relation between two nodes will be an edge of this graph. When no criterion is considered for choosing a neighbor, this graph will be a random graph [1]; however, two important factors [2] change this characteristic in P2P: 1) principal of limited interest which declares that each peer interests in some few contents of other peers and 2) spatial locality law. Since each node represents one user in the system, a P2P will be a group of users with different interests who try to find similar users. Such structure creates a social network. On the other hand Barabási [3] has shown that in the real social network the probability of occurring a node with higher degree is very low. In other words, the higher the degree the least likely it is to occur. This relation is defined by power law distribution, i.e. 𝑝 𝑑 = 𝑑 −𝑘 where k>0 is the parameter of distribution, for degree of network nodes. The network model which has been defined with characteristics in [3] has a short characteristic path length and a large clustering coefficient as well as a degree distribution that approaches a power law. Characteristic path length is a global property which measures the separation between two vertices; whereas clustering coefficient is a local property which measures the cliquishness of a typical neighborhood. As an example, we envision the scenario of sharing knowledge among researchers. Since each researcher has a limited number of interests, he can communicate with other researchers who work in the same area of interests. Because of many limitations like distance and resources researcher usually work with their colleagues in the same institute or college. Sometimes these connections can be extended to other places in order to get more cooperation. This behavior defines a social network with some dense clusters where these clusters are connected by few connections like figure 1. If one researcher is represented by one node, a P2P system will be created which obeys social network characteristics.

114


Bridge

One cluster

Figure 1: Many related clusters create a community

The remaining of this paper is organized as below. In section 2 some related works are reviewed. In section 3 community concepts and structure of the model are explained. In section 4 setting up a computer-based model and results are shown and section 5 concludes the paper.

2. Related Works Different structures and strategies have been introduced for P2P system for better performance and scalability. This part mainly reviews those approaches which focus on community and peer clustering. Locality proximate clusters have been used to connect all peers with the same proximity in one cluster. Number of hop counts and time zone are some of criteria for detecting such proximity. In [4] the general clusters have been introduced which support unfixed number of clusters. Two kinds of links, local and global, connect each node to other nodes in their own cluster or nodes in other clusters. This clustering system doesn’t concern about content of nodes. Physical attributes are the main criteria for making clusters. In [5] a Semantic Network Overly (SON) has been created based on common characteristics in an unstructured model. Peers with the same contents are connected to each other and make a SON which is actually a semantic cluster. The whole system can be considered as sets of SONs with different interest. If a peer, for example, in SON S1 searches contents unrelated to his group, finding proper peer is not always very efficient. If there is no connection between S1 and the proper SON, flooding must be used. Common interest is another criterion for making proper overlay. In [6] all peers with the same interest make a connection with each other, but locality of peers in one interest group has not been concerned. In [7] all peers with the same interests are recognized after receiving many proper answers based on their interests. Such peers make shortcuts, a logical connection, to each other. After a while a group of peers with the same interests will be created and the richer peer in connection will be the leader of the group. Since this structure is based on unstructured system and receiving proper answer in the range of the issued queries, we cannot expect that all peers

with the same interests in the system are gathered in one group. In [8] communities have been considered. Authors have described community as the gregariousness in a P2P network. Each community is created by one or more peers that have several things in common. The main concern in this paper was connectivity among peers in communities. They have explained neither the criteria of creation nor size of each community. In [9] communities have been modeled like human communities and can be overlapped. For each peer three main groups of interest attributes have been considered, namely personal, claimed, and private. Interests of each peer and communities in the system are defined as collections of those attribute values and peers whose attributes conform to a specific community will join it. Since 25 different attributes have been used in the model, finding a peer which has the same values for all of these attributes is not easy. That is why a peer may join in different communities with partial match in its attributes. Although the concept of communities is the same as our work, in our model a shared ontology defines the whole environment and one community is part of the environment. There is also a bootstrapping node in each domain in order to prevention of node isolation. Our model also uses such nodes, but their main role is controlling subcommunities. [10] uses a shared ontology in unstructured P2P for peer clustering. Each peer advertises his expertise to all of his neighbors. Each neighbor can accept or reject this advertisement according to his own expertise. Expertise of each peer is identified by the contents of files which the peer has stored. Since the ontology is used, a generic definition for the whole environment of the model is provided which is better than using some specific attributes. Super peers have also been used for controlling peer clustering and storing global information about the system. In [11] super peers are used in partially centralized model for indexing. All peers who obey system-known specific rules can connect to a designated super peer. It creates a cluster that all peers have some common characteristics. Search in each cluster is done by flooding, but sending query to just a group of peers will produce better performance. According to these rules, super peers who control common rules must create larger index; therefore they need more disk space and CPU power. In [12] instead of using rules, elements of ontology are used for indexing. In this structure each cluster is created based on indexed ontology which is similar to our method. All peers with the same attribute are indexed. Our model also uses super peers and elements of ontology for indexing, but instead of


referring to each node in the cluster, super peers refer to the representative of that cluster which controls sub-communities of a specific community. This will reduce the size of index to number of elements in ontology which is usually less than the number of peers in a large system and provide better scalability.

3. Overview and Basic Concepts of the Proposed Model We investigate the effect of sub-communities based on a proposed social network P2P model. First, we briefly state the community concept and then we introduce the model. This model uses ontology for defining the environment of the system and creating communities. It also uses super peers for referring to these communities. Sub-communities are considered for better locality inside each community.

3.1. Community Concepts A social network can be represented by a graph G(V, E) where V denotes a finite set of actors, simply people, in the network and E denotes relationship between two connected actors such that E  VV. Milgram [13] has shown that the world around us seems to be small. He experimentally showed that average shortest path between each two persons is six. Although this is an experimental result, we experience it in the real world. We usually meet persons who are unknown for us but turn out to be a friend of one of our friends. People usually make a social cluster based on their interests but in different size. Such clusters which are usually dense in connections are connected to each other by few paths. All of these clusters with similar characteristics create a community. In each community: 1) each person must be reachable in reasonable steps (what Milgram named as small world) and 2) each person must have some connections to others which are defined by clustering coefficient. With such characteristics some structure like tree or lattice cannot show the behavior of social network. As stated previously in section 1, each dense cluster in the network is connected to few other clusters. In each cluster, some individuals who are called hubs are more important than others, because they have more knowledge or connections than other individuals. In order to join to a cluster as a new member either a known person or a member of the cluster must be addressed. We summarize an example as an instantiate of our model. A computer scientist regularly has to search publications or correct bibliographic meta data. A scenario which we explain here is community of

115

researchers who share the bibliographic data via a peer-to-peer system. Such scenarios have been expressed in [14] and [10]. The whole data environment can be defined by ACM ontology [15]. Each community in the system is defined by an element of the ontology and represented by a representative node. Each community comprises of many sub-communities or clusters which are gathered around a hub. We show the effectiveness of subcommunities by the model. Figure 2 depicts this example. ACMTopic /Information_Systems

Super Peer

One Community Representative

Hub

Hub

Ordinary peer A CMTopic/Software One Sub-community

Figure 2: Principle elements of the model

As a social network point of view, hubs in our model have been defined with the same characteristics in the social network, like a knowledgeable person with rich connection. They define sub-communities or clusters inside a community. They may also make connection with other hubs. Representative and super peers work as a bridge in the model among communities. The difference is that representatives work as a bridge among communities which are closer in the ontology, while super peers work as bridges among all communities. Peers have a capacity for making connections which is defined based on power law distribution. It doesn’t allow the model to change to a tree like structure.

3.2. Definition of the Model We define our P2P model M (like figure 2) as below. M has a set of peer P where: 𝑃 = 𝑝1, 𝑝2 , … , 𝑝𝑛 . Each peer pi can have d different direct neighbors which make set Ni and each of them are identified by dij defined jth neighbor of peer i as: 𝑁𝑖 = 𝑑𝑖1 , 𝑑𝑖2 , … , 𝑑𝑖𝑑 . As a direct neighbor, pk is one logical hop away from pi, but physical connection between pi and pk may not a one hop connection. M uses shared ontology O in order to define the environment of the model; therefore by using a proper ontology this model can be applicable for different environment. For simplicity all peer pi have the same ontology. O can define several logical communities which have many peers with common interest. Each

116


community cl contains at least one member as a known member who is the representative of that community. This role is usually granted to the first peer who defines a new community cl and identified by rl. Since each community in real world is a set of clusters or sub-communities and members of each cluster usually obey some kind of proximity, such a structure must be considered in the model. Good criteria to address the proximity in a network can be either number of hop or IP address as a less precise metric. Since all peers in one community have similar interest, located peers with closer number of hop, it may provide closer distance among peers. Such configuration gives better response time for queries whose answers are in one community. In other word, locality of interest will be established in a better form inside the community. Popular peers or hubs are defined in order to provide such locality. Hubs are peers who are rich in contents and connections in a specific place. Each hub creates a sub-community or cluster in one area of a community. All other peers with the same interest who are in his nearby will connect to this hub and complete the sub-community. Since hubs establish many connections to other peers they are a good point of reference for searching documents. Hubs in one community can be connected to each other in order to create non-dense connections among sub-communities like figure 1. They create set ppl for cl. Formally, we have: ∨ 𝐶𝑙 : 𝑝𝑝𝑙 = 𝑝1 , 𝑝2 , … , 𝑝𝑠 Where S is number of subcommunities or hubs in the community, identified by policy of the system. Each hub in the community cl is also refereed by the representative rl. Once peer pi joins the community, pi asks representative rl about sub-communities inside the community. rl sends address of all hubs in ppl of the community. pi communicates with each of them and calculates his distance from each member of ppl. The shorter distance to any member of ppl identifies the cluster which pi must join. Contents of shared files in each pi identify the interest of pi. These interests can be defined by shared ontology O which is stored in each peer to understand the relationship among communities. If pi has different kinds of files which distinguish different interest, pi can contribute in different community cl, as a result, two communities can be connected to each other via pi. M also has a set of super peer SP where: 𝑆𝑃 = 𝑠𝑝1 , 𝑠𝑝2 , … , 𝑠𝑝𝑚 𝑎𝑛𝑑 𝑚 ≪ 𝑛 At the time of joining, pi will announce his interest to the closest super peer spk. According to the announcement, spk will identify the community who has similar interest. Actually spk identifies rl, the representative of the community cl. Since number of

communities are few as number of elements in the shared ontology, the size of index will be smaller than similar models which super peers index files or even peers in the system. This feature let the system be more scalable than the similar ones. The structure of the model M creates interconnected communities. This advantage let any peer can search the network for any interest, even if that interest is not as same as his own interest. Super peer spk provide such connections in the highest level of the model; while, representative rl of community cl creates interconnected sub-communities in lower level. Sub-community interconnection provides a path to nearly any peer inside the community. It means that a piece of data can be retrieved with high probability in the system with low overhead; because one part of the system, a community, for a specific query is searched.

4. Simulation Setup We wrote a simulator to create a computer based community model to show the behavior of subcommunities and in what extend they are close to a social network. We run two different experiments. In the first experiment we just simulate one community and assume that all peers with the same contents based on a shared ontology are gathered in a specific community, and then social network parameters are calculated in order to show the efficiency of the model. Since in each community all peers have the same characteristics, showing the behavior of one community is enough to show the characteristics of the whole system; although simulating more than one community is also possible. In the second experiment we define two different communities with different interests and create queries in the system, then we calculate number of successful queries and recall for finding answers. In the both experiment, we define the number of peers in the model in advance and identify a capacity for making connections with other peers based on power law distribution. The distribution of the connections among peers which is defined by the simulator is shown in figure 3. During the simulation, joining and leaving of peers are not considered. The first peer who joins the community is chosen as the representative of the community. Based on the definition of the model many peers who are richer in connection are chosen as hubs. When a new peer joins the community, he communicates with the representative. Representative returns addresses of the hubs in the community. Then, new peer sends a message to each hub and calculates his distance from them. This is done in the model by defining random hops between new peer and hubs. The hub with the


smallest path is chosen. Since hubs are normal peers with higher capacity for accepting connection, if all the connections have already been used another hub will be chosen. Such a restriction in connection limitation has many reasons. First, it allows controlling the connection distribution in the system. Second, after all hubs are full, the new peer must connect to other normal peers. This mimics the behavior of joining a member to a community by another member. If the new peer has capacity more than one connection, other neighbors will be chosen randomly. First, all the members inside the same subcommunity are chosen because they may have shorter distance and then, if all peers cannot accept any more connections, other peers from other sub-communities are chosen. These kinds of connections create potential bridges among sub-communities which make different sub-communities are connected accompany with the representative of the community. Since the locality is the main concern, such connections will be established if the target peers is rich in favor contents. 350

for other peers to connect to hubs. It decreases the characteristic path length. When capability of accepting connections is high, more than number of peers in the community, the graph of the model is moving toward complete graph. This explains larger value for cluster coefficient. Moreover, existence of many points of references in the model, hubs, decreases the characteristic path length. Needless to say, when peers have high capability of accepting connections, many other clusters are created inside the sub-community. Since they are implicit, reaching them won’t be very fast, except through its explicit sub-community. Table 1: The value of cluster coefficient for different sub-communities and connections Connections

40 hubs

20 Hubs

10 hubs

5 hubs

1 Hub

No Hub

10

0.39

0.26

0.19

0.14

0.11

0.096

50

0.41

0.39

0.42

0.29

0.15

0.11

100

0.43

0.4

0.42

0.46

0.25

0.15

500

0.57

0.56

0.53

0.54

0.59

0.48

1000

0.69

0.64

0.64

0.62

0.64

0.48

Max.

Frequency of Nodes per Connections

300 250

0.8

200

0.7

150 100 50 0 0

50

100 150 Connections

200

250

Figure 3: Distribution of connection among nodes

Watts in [16] has shown that a small world graph is a graph which is located between regular and random graphs. Such a graph has a characteristic path length as low as random graph and cluster coefficient as high as regular graph. The highest cluster coefficient belongs to fully connected graph and shortest path is obviously 1. So we calculate cluster coefficient and characteristic path length for the model. Table 1 shows the result of cluster coefficient for a community with 500 nodes. The capability of accepting maximum connections, the number of hubs and sub-communities are changing. As it can be expected, by defining subcommunities cluster coefficient is increased even with just one sub-community. With the small number of hubs and less capability of accepting connection, many peers are connected to each other without any connection to any hubs. This effect defines longer characteristic path length in table 2. When number of connections is increased, the cluster coefficient is also increased. Moreover, there will be more chance

Cluseter Coeffiecent

Frequency

117

Sub Communities

0.6

No

0.5

1

0.4

5

0.3

10

0.2

20

0.1

40

0 0

200

400 600 800 Maximum Connections

1000

1200

Figure 4: Cluster coefficient while maximum connections change in different sub-communities

Since the results in table 1 show the path length for one community, by adding 2 extra steps the value for the whole model is calculated. This is the average path length when one peer in one community tries to reach another one in different community through the available super peers in the model. In the second experiment, we define two different communities and 1000 peers in advance. Each peer can choose one of the defined communities and join it during network construction. This selection is based on the interest of the peers which is related to loaded files for each peers. 60 percent of loaded files in each peer have the similar interests as same as peer’s interest. The other 40 percent are files which are not related to the interest of the peer. Moreover, we do not consider any special shared data capacity for hubs. A peer is chosen as a hub that has larger data capacity with the same distribution that other

118


peers obey. Considering extra storages creates better results.

Connections

40 hubs

20 Hubs

10 Hubs

5 Hubs

1 Hub

No Hub

10

3.08

3.75

4.14

4.24

4.36

4.42

Max.

0.8 Recall

Table 2: The value of characteristic path length for different sub-communities and connections

0.9 Sub Communities

0.7

No

0.6

1

0.5

5

0.4

10

0.3

20

0.2

40

0.1 0

50

2.64

2.67

2.76

2.83

2.88

2.92

100

2.48

2.47

2.54

2.55

2.52

2.61

500

2.04

2.01

2.06

2.11

1.91

1.96

1000

1.95

1.91

1.91

1.93

1.88

2.01

5 Sub Communities No 1 5 10 20 40

Path Length

4 3 2 1 0 0

200

400 600 800 Maximum Connections

1000

1200

Throughout the simulation, a peer is chosen randomly and poses a query. The query may have the same interest as the peer’s interest or other interests which have defined in the model. If the query has the same interest, answers will be found in the same community, otherwise it is sent to a proper community. We define a successful query as a query which can retrieve at least one result. As we can expect, by increasing number of sub-communities and cluster coefficient the number of successful queries is also increased. Figure 6 shows this result. Likewise the recall rate is also increased. The recall rate is also affected by these changing. Figure 7 shows the values of recall rate. 1 Sub Communities

Success Rate

No 1 5 10 20 40

0.6 0.4 0.2 0 0

10

20 30 Maximum Connections

40

10

20 30 Maximum Connections

40

50

Figure 7: Recall values for different subcommunities when maximum connection change

In unstructured network when flooding is used for answering queries, number of neighbors has a direct effect on recall rate and also network traffic. By defining many sub-communities we can obtain same recall rate with lower network connection. Moreover, during answering queries just one part of the network is affected by produced traffic.

5. Conclusion

Figure 5: Characteristics path length while maximum connections change in different subcommunities

0.8

0

50

Figure 6: Success rate while maximum connections change in different sub-communities

We can conclude that when sub-community concept is used, powerful nodes in P2P systems can be substituted with normal nodes and few connections. Choosing a tradeoff between maximum number of connections in the system and number of sub-communities can reduce resource consumption in nodes and index size in representatives. In other words, by using sub-communities a P2P model like our model can be constructed with regular nodes instead of powerful nodes.

References [1] Erdős, P. and Rényi, A. "On Random Graphs", 1956. [2] Chen, H., Z., Huang and Gong, Z., "Efficient Content Location in Peer-to-Peer Systems.", In proceedings of the 2005 IEEE International Conference on e-Business Engineering (ICEBE’05), 2005. [3] Barabási, Albert-László and Albert, Réka., "Emergence of Scaling in Random Networks.", Sience, 1999, Vols. 286:509-512. [4] Hu, T. -H and Sereviratne, A., "General Clusters in Peer-to-Peer Networks.", ICON, 2003. [5] Crespo, A. and Garcia-Molina, H., "Semantic Overlay Networks for P2P Systems.", USA : Agents and Peer-toPeer Computing (AP2PC), 2004. [6] Sripanidkulchai, K., Maggs, B. M. and Zhang, H., "Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems", INFOCOM, 2003. [7] Chen, Wen-Tsuen, Chao, Chi-Hong and Chiang, JengLong., "An Interested-based Architecture for Peer-to-Peer Network Systems.", AINA 2006, 2006.


[8] Shijie, Z., et al., "Interconnected Peer-to-Peer Network: A Community Based Scheme.", AICT/ICIW 2006, 2006. [9] Khambatti, M., Dong Ryu, K. and and Dasgupta, P., "Structuring Peer-to-Peer Networks Using Interest-Based Communities.", DBISP2P 2003, Springer LNCS 2944, 2003. [10] Haase, P., et al., "Bibster - A Semantics-Based Bibliographic Peer-to-Peer System.", In international Semantic Web Conference 2004, 2004. [11] Nejdl, W., et al., "Super Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks.", In proceedings of WWW 2003, 2003. [12] Schlosser, M., et al., "HyperCuP—Hypercubes, Ontologies and Efficient Search on P2P networks.", Bologna, Italy : In International Workshop on Agents and Peer-to-Peer Computing, 2002. [13] Milgram, S., "The Small World Problem.", Psychology Today, 1967, Vols. 1(1):61-67. [14] Ahlborn, B., Nejdl, W. and Siberski, W., "OAI-P2P: A Peer-to-Peer Network for Open Archives.", In 2002 International Conference on Parallel Processing Workshops (ICPPW’02), 2002. [15] ACM. 1998 ACM Computing Classification System. [Online] http://www.acm.org/class/1998. [16] Watts, D. and Strogatz, S., "Collective Dynamics of 'Small World' Networks.", Nature Journal, Vol. 393, P.440442, 1998.

119

120



SESSION SECURITY AND RELIABILITY ISSUES + MONITORING STRATEGIES Chair(s) TBA

121

122



123

Estimating Reliability of Grid Systems using Bayesian Networks O. Doguc, J. E. Ramirez-Marquez Department of Systems Engineering and Engineering Management, Stevens Institute of Technology, Hoboken, New Jersey 07030, USA

Abstract - With the rapid spread of computational environments for large-scale applications, computational grid systems are becoming popular in various areas. In general, a grid service needs to use a set of resources to complete certain tasks. Thus, in order to provide a grid service, allocating required resources to the grid is necessary. The conventional reliability models have some common assumptions that cannot be applied to the grid systems. This paper discusses the use of Bayesian networks as an efficient tool for estimating grid service reliability.

Keywords: reliability

1

Bayesian

networks, K2,

Grid

system

Introduction

With the rapid spread of computational environments for large-scale applications, computational grid systems are becoming popular in various areas. However most of those applications require utilization of various geographically or logically distributed resources such as mainframes, clusters and data sources owned by different organizations. In such circumstances, grid architecture offers an integrated system in which the communication between two nodes is available. The ability to share resources is a fundamental concept for grid systems; therefore, resource security and integrity are prime concerns [1]. Traditionally, the function of computer networks is to exchange files between two remote computers. But in grid systems there is a demand that networks can provide all kinds of services such as computing, management, storage and so on [2]. Grid system reliability becomes an important issue for the system users due to excessive system requirements. As an example, the Internet is a large-scale computational grid system. Due to large number of internet users and the resources that are shared through Internet, interactions between the users and resources cannot be easily modeled. Moreover Internet users can share resources through other users, and same resources can be shared by multiple users;

which makes it difficult to understand the overall system behavior. As a recent topic, there are a number of studies on estimating grid system reliability in the literature [3-6]. In these studies, the grid system reliability is estimated by focusing on the reliabilities of services provided in the grid system. For this purpose, the grid system components that are involved in a grid service are classified into spanning trees, and each tree is studied separately. However these studies mostly focus on understanding grid system topology rather than estimating actual system reliability. Thus for simplification purposes, they make certain assumptions on component failure rates, such as satisfying a probabilistic distribution. Bayesian networks (BN) have been proposed as an efficient method [7-9] for reliability estimation. For systems engineers BN provide significant advantages over traditional frameworks, mainly because they are easy to interpret and they can be used in interaction with domain experts in the reliability field [10]. In general and from a reliability perspective, a BN is a directed graph with nodes that represent system components and edges that show the relationships among them. Within this graph, each edge is assigned with a probabilistic value that shows the degree (or strength) of the relationship it represents. Using the BN structure and the probabilistic values, the system reliability can be estimated with the help of Bayes’ rule. There are several recent studies for reliability estimation using BN [7, 9, 11-13], which require specialized networks that are designed for a specific system. That is, the BN to be used for analyzing system reliability should be known beforehand, so that the BN can be built by an expert who has “adequate” knowledge about the system under consideration. However, human intervention is always open to unintentional mistakes that could cause discrepancy in the results [14]. To address these issues, this paper introduces a methodology for estimating grid system reliability by combining various techniques such as BN construction from raw component and system data, association rule mining and evaluation of conditional probabilities. Based on extensive literature review, this is the first study that incorporates these methods for estimating grid system reliability. Unlike previous studies, the methodology proposed in this paper does not rely on any assumptions

124


about the component failure rates in grid systems. Our methodology automates the process of BN construction by using the K2 algorithm (a commonly used association rule mining algorithm), that identifies the associations among the grid system components by using a predefined scoring function and a heuristic. The K2 algorithm has proven to be efficient and accurate for finding associations [15] from a dataset of historical data about the system. The K2 algorithm reduces the algorithmic complexity of finding associations from exponential to quadratic [15] with respect to the number of components in the system. According to our proposed method, once the BN is efficiently and accurately constructed, reliabilities of grid services can be estimated with the help of Bayes’ rule.

2

does not span all the requested resources anymore [5]. For example in the grid system in Figure 1 {G1, G3, G5} is a MRST but {G1, G2, G5} is not, because {G1, G2} already covers all the requested resources. The reliability of a service in a grid system can be evaluated by using the reliabilities of services through MRST and it has been shown that MRST among a grid system can be efficiently discovered [5].

Grid Systems

Different than typical distributed systems, the computational grid systems require large-scale sharing of resources on different types of components. A service request in a grid system involves a set of nodes and links, through which the service can be provided. In a grid system, the Resource Managers (RM) control and share resources, while the Root Nodes (RN) request service from RM (an RN may also share resources) [5]. Reliability of a grid system can be estimated by using reliabilities of services provided through the system. There are several studies in the literature that focus on the reliability of grid systems, however many of them rely on certain assumptions [3-6, 16] that will be discussed. Dai and Wang present a methodology to optimally allocate the resources in a grid system in order to maximize the grid service reliability [5]. They use genetic algorithm to find the optimum solution efficiently among numerous possibilities. Later Levitin and Dai propose dividing grid services into smaller-size tasks and subtasks, then assigning the same tasks to different RM for parallel processing [16]. An example grid system is displayed in Figure 1. The RM are shown as single and RN are shown as double circles in the figure. In order to evaluate the reliability for a grid service, the links and nodes that are involved in that service should be identified. Dai and Wang show that the links and nodes in each grid service form a spanning tree [5]. The resource spanning tree (RST) is defined as a tree that starts from the requestor RN (as its root) and covers all resources that are required for the requested grid service [5]. For example in the grid system in Figure 1, when the RN, G1 requests the resources R1 and R5, there are several paths to provide this service. {G1, G2}, {G1, G3, G5}, {G1, G4, G6}, {G1, G3, G5, G6} and {G1, G2, G5, G6} are some of the RST that include the requested resources. The number of RST for a grid service can be quite large; and minimum resource spanning tree (MRST) is defined as an RST that does not contain any redundant components. Thus, when a component is removed from an MRST, it

Figure 1: A sample grid system

3

Bayesian Networks

Estimation of systems reliability using BN dates back as early as 1988, when it was first defined by Barlow [17]. The idea of using BN in systems reliability has mainly gained acceptance because of the simplicity it allows to represent systems and the efficiency for obtaining component associations. The concept of BN has been discussed in several earlier studies [18-20]. More recently, BN have found applications in, software reliability [21, 22], fault finding systems [19], and general reliability modeling [23]. One could summarize the BN as an approach that represents the interactions among the components in a system from a probabilistic perspective. This representation is performed via a directed acyclic graph, where the nodes represent the variables and the links between each pair of nodes represent the causal relationships between the variables. From a system reliability perspective, the variables of a BN are defined as the components in the system while the links represent the interaction of the components leading to system “success” or “failure”. In a BN this interaction is represented as a directed link between two components, forming a child and parent relationship, so that the dependent component is called as the child of the other. Therefore, the success probability of a child node is conditional on the success probabilities associated with each of its parents. The conditional probabilities of the child nodes are calculated by using the Bayes’ theorem via the probability values assigned to the parent nodes. Also, absence of a link between any two nodes of a BN indicates that these components do not interact for system failure/success thus, they are considered independent of each other and their probabilities are calculated separately. As will be discussed in Section 3.2 in detail, calculations for the


independent nodes are skipped during the process of system reliability estimation, reducing the total amount of computational work. 3.1

Bayesian

Network

Construction

Using

Historical Data The K2 algorithm, for construction of a BN, was first defined by Cooper and Herskovits as a greedy heuristic search method [24]. This algorithm searches for the parent set for a node that has the maximum association with it. The K2 algorithm is composed of two main factors: a scoring function to quantify the associations and rank the parent sets according to their scores, and a heuristic to reduce the search space to find the parent set with highest degree of association [24]. Without the heuristic, the K2 algorithm would need to examine all possible parent sets, i.e. starting from the empty set, it should consider all subsets. Even with a restriction on the maximum number of parents (u), the search space would be as large as 2u (total number of subsets of a set of size u); which requires an exponential-time search algorithm to find the most optimal parent set. With the heuristic, the K2 algorithm does not need to consider the whole search space; it starts with the assumption that the node has no parents and adds incrementally that parent whose addition most increases the scoring function. When addition of no single parent can increase the score, the K2 algorithm stops adding parents to the node. Using the heuristic reduces the size of the search space from exponential to quadratic. With the help of the K2 algorithm, we develop a methodology that uses historical system and component data to construct a BN model. Moreover, as stated above, the BN model is very efficient in representing and calculating the interactions between system components. 3.2

125

In Figure 2 the topmost nodes (G1, L1 and L2) do not have any incoming edges, therefore they are conditionally independent of the rest of the components in the system. The prior probabilities that are assigned to these nodes should be known beforehand -with the help of a domain expert or using historical data about the system. Based on these prior probabilities, the success probability of a dependent node, such as G3, can be calculated using Bayes’ theorem as shown in Equation (1): p (G3 | G1 , L1 ) =

p (G1 , L1 | G3 ) p (G3 ) p (G1 , L1 )

(1)

As shown in Equation (1), the probability of the node G3 is only dependent on its parents, G1 and L1 in the BN shown in Figure 2. Total number of computations done for calculating this probability is reduced from 2n (where n is the number of nodes in the network) to 2m, where m is number of parents for a node (and m cr , j ≥ 1

i

(RB).

of a resource provider

ri can be

expressed as follows,  m   k =1  m  = m    k = m −c r m 

reflects how far the Resource have been utilized and proven its successful results in the job execution over a specific

historic value, the Resource provider ( ri ) should have atleast

S (r ) v

149

if m

f (r , t )

The function

i

v

i

CMax = f max (Cri ) where i = 1 .. N

----- (5)

NMax = f max ( Nri ) where i = 1 .. N

----- (6)

simple average of the Availability and the Success Rate over iteration less than the constant value cr . For the computation over iteration greater than a constant

cr

we apply a moving

average scheme over a period of cr . The constant cr may vary depends on the application nature of the grid. In a grid environment, where resources have been frequently used, the

cr constant cr

constant

can be fixed to a higher value and vice versa, the can be fixed to a smaller value. Here we consider

the constant value cr which is not pertaining over a specific time but only with the number of times the Resources have been communicated with the Resource Broker. This mainly

The ‘N’ in the equation (5) and (6) represents the total number of Resources considered at the time of the scheduling and the function f max (Cri ) computes the maximum value of the CPU among the Resources selected in terms of the unit GHz. Similarly the function f max ( Nri ) computes the maximum value of the Network bandwidth in terms of the unit MBps, from the list of Resource providers selected. The main significance of these two functions is to make a comparative analysis among the Resource provider’s list and to select the best one. More weightage should be given to a Resource provider ( ri ) whose CPU and Bandwidth are higer. Thus the

150


function

f (r , t ) computes the present system capability P

i

s

of a Resource provider ( ri ). Here we introduce the arbitrary

C1 and C2 which may vary depends on the CMax and NMax values obtained respectively. Thus the Trust of a Resource provider ( ri ) can be represented as follows. constant

T (r i ) = α

[ f (r , n )]+ β [ f (r , t )] H

i

P

i

s

in the table 1. By applying our mathematical model, the Trust of all the Resources have been computed. These Trust computation is based on the constraint that equal weightage should be given for both the Historic value and the present system capability. The various trust value thus computed using our mathematical model is depicted in the following diagram, Figure 2. Simulation results of Trust value

----- (7)

The constants α and β are the two weights with respect to the functions. These constants can be fixed to have any value for giving any priority for the past and the present function. These constants can be fixed to any value depending on the System to decide whether to give more importance to the historic value or the present system capability of a entities.

3

Trust Computation Example

To illustrate how our Trust model works, we provide an application example of evaluating the Trust of a Service provider using our Mathematical model proposed. Let us assume, there are some N Service providers and each entity has built some level of Trust with the Grid environment, i.e each has some previous transactions. We describe how our mathematical model helps to identify the Suitable resources available in the Grid and hence updating the Trust value of every Resource which took part in the service life cycle. For eg, we took 30 Service Providers each with a different past historic Trust and the System features as tabulated in the table 1. Table1 . Simulation Results of Trust parameters

From the above figure 2, Resource5 has the highest Trust value. Resource5 has the highest Availability and Success Rate of 0.8 and 0.9 respectively. Similarly, looking up the Bandwidth and CPU power, it has the 570 and 1.5 respectively. This proves that our selection criteria have been satisfied, and the Resources which has the highest value in both the factors is selected. Similarly, the experiments have been carried out by fixing the priority based Trust computation. In this experiment, we have given more weight-age to the CPU power and Bandwidth configuration, rather than the Historic value. The same Resource characteristics as depicted in the table 1 are taken into consideration and in a similar fashion, the Trust value is computed using the Mathematical model. The various Trust values which have been computed can be shown in the following diagram. Figure 3. Simulation results of Trust value

Let us assume that all these resources are available and suitable for the user's requirement. The various trust metrics like Success rate, Availability rate, CPU (GHz) and Bandwidth (MBps) and their associate Resources are shown


From the figure 3, the Resource having the highest Trust value is the Resource22. Since, our Resource selection is based on the more weight-age to the System capability rather than the historic value, the Resource which has the highest value of System capability should be selected. By applying our mathematical model, the Resource having the Highest Trust value is the Resource22. Resource22 has the Success and Availability rate of 0.5 and 0.6 respectively. Similarly, the Resource22 has the highest bandwidth and CPU power of 890 and 3.5 respectively. After an each transaction, the Availability of all those Resources have been updated. Similarly, the Transaction results, whether the success or failure rate has been updated for a particular Resource entity. Thus, these updated results act as a source for the historic trust as when computed for the next transaction.

4

Conclusion

In this paper, we have proposed a Trust Model architecture, which manages and computes the Trust of any Service entities available in the Grid. The value of Trust thus evolved is fully objective in nature and makes an advantage over the other model of considering the subjective nature. We also proposed a Mathematical model for computing the Trust value of any Service entities. Further this Trust architecture can be integrated to any Grid Meta Scheduler, for the ease of improving the Resource selection module and hence increases the reliability of Meta-scheduler.

5

Acknowledgement

The authors would like to thank the Department of Information Technology, Ministry of Communication and Information Technology of India for their financial support and encouragement in pursing this research by its sponsored Centre for Advanced Computing Research and Education (CARE).

References [1] Foster, I., Kesselman C., Tuecke, S., “The Anatomy of the Grid: Enabling Scalable Virtual Organizations,” International J. Supercomputer Applications 2001 [2] Foster, I., Kesselman. C., Nick, J.M., Tuecke, S., “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Open Grid Service Infrastructure WG, Global Grid Forum 2002. [3] Foster I, Kesselman C (editors). The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann: San Fransisco, CA, 1999.

151

[4] Farag Azzedin and Muthucumaru Maheswaran "Integrating Trust into Grid Resource Management Systems" ” Systems International Conference on Parallel Processing. ., Vol 1., pp 47-54, 2002. [5] Ramin Yahyapour, Philipp Wieder (editors), “Grid Scheduling Use Cases,” GFD-I.064, Grid Scheduling Architecture Research Group (GSA- RG), March 26, 2006. [6] Eduardo Huedo, Ruben S. Montero and Ignacio M. Llorente, “An Experimental Framework for Executing Applications in Dynamic Grid Environments,” ICASE Nov 2002. [7] R. Yahyapour and Ph. Wieder ,“Grid Scheduling Use Cases v1.5, Grid Working Draft, Global Grid Forum, 2005. https://forge.gridforum.org/projects/gsa-rg/document / GridScheduling Use Cases V1.5.doc [8] Abramson D, Giddy J, Kotler L, “High performance parametric modeling with nimrod/g: Killer application for the global grid?,” Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS 2000), April 2000; 520–528. [9] Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture for a resource management and scheduling system in a global computational Grid. Proceedings of the International Conference on High Performance Computing in Asia–Pacific Region (HPC Asia 2000), 2000. [10] U. Schwiegelshohn, R. Yahyapour, Ph. Wieder, “Resource Management for Future Generation Grids,” Technical Report TR-0005, Institute on Scheduling and Resource Management, CoreGRID – Network of Excellence, May 2005 [11] T. Grandison and M. Sloman, “A Survey of Trust in Internet Applications”, IEEE Communications Surveys and Tutorials, Vol. 4, No. 4, pp. 2-16, 2000. [12] R.Dingledine, et al “Reputation in p2p anonymity systems”, proc.of the 1st Workshop on Economics of p2p systems June, 2003. [13] B.Gross and et al “Balances of power on ebay:Peers or unequals.” , Workshop on Economics of p2p systems June, 2003. [14] K.Aberer and et al “Managing trust in peer-to-peer information systems.” Proceedings of the 10th International Conference on Information and Knowledge Management, 2001 [15] Indrajit Ray and Sudip Chakraborth, “A Vector Model of Trust Developing Trustworty Systems”,1999, 259-278.

152


[16] Bin Yu and Munindar P.Singh, “Distributed Reputation Management For Electronic Commerce”, First International Joint Conference on Autonomous Agents and Multiagent Systems, Bologna, Italy, 2002.

[28] Tyrone Grandison, Morris Sloman , "Specifying and Analysing Trust for Internet Applications", 2nd IFIP conference on e-commerce, e-Business, e-Government, I3e2002, Lisbon oct.2002.

[17] A. Abdul-Rahman,”The PGP Trust Model”,EDI-Forum, April 1997.

[29] Zhengqiang Liang and Weisong Shi, "PET: A PErsonalized Trust model with Reputation and Risk Evaluation for P2P Resource Sharing", Proceedings of the 38th Hawaii International Conference on System Sciences (2005).

[18] M.Blaze,J. Feigenbaum, and J. Lacy, “Decentralized trust management,” IEEE Conference on Security and Privacy, 1996 [19] M. Blaze, “Using the KeyNote trust management system,” AT&T Research Labs,1999. [20]P. Resnick, R. Zeckhauser, E. Friedman and K. Kuwabara,“ Reputation systems”,Communications of the ACM 43(12):45–48, 2001 [21] R. Dingledine, N. Mathewson, and P. Syverson, “Reputation in p2p anonymity systems”, Proc. of the 1st Workshop on Economics of Peer-to-PeerSystems, June 2003. [22] Matt Blaze, Joan Feigenbaum, and Jack Lacy “Decentralized Trust Management”, IEEE Symposium on Security and Privacy,1996, Oakland CA, May 6-8 1996. IEEE Press [23] Ernesto Damiani, De Capitani di Vimercati, Stefano Paraboschi, Pierangela Samarati and Fabio Violante, “A reputation-based approach for choosing reliable resources in peer-to-peer networks”, 9th ACM conference on Computer ACM Press, Nov 2002 [24] Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina, “The EigenTrust Algorithm for Reputation Management in P2P Networks”, Twelfth International World Wide Web Conference, 2003, Budapest, Hungary, May 20-24 2003. ACM Press. [25] Michael Brinklov and Robin Sharp, “Incremental Trust in Grid Computing”, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07), March 2007 [26] Runfang Zhou and Kai Hwang, “Trust Overlay Networks for Global Reputation Aggregation in P2P Grid Computing”, IEEE International Parallel and Distributed Processing Symposium (IPDPS-2006), Rhodes Island, Grace, April, 2006. [27] Muhammad Hanif Durad, Yuanda Cao, “A Vision for the Trust Managed Grid”, Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)

[30] Gabriel Queiroz Lana and Carlos Becker Westphall , ”User Maturity Based Trust Management for Grid Computing”, Seventh International Conference on Networking, 2008. [31] Mahantesh Hosamani, Harish Narayanappa and Hridesh Rajan “Monitoring the Monitor: An Approach towards Trustworthiness in Service Oriented Architecture” , TR #0707 Initial Submission: , 2007. [32] P.Varalakshmi, S. Thamarai Selvi and M.Pradeep, “A multi-broker trust management framework for resource selection in grid”, IAMCOM, 2007.


153

Research on Security Resource Management Architecture for Grid Computing System 1

Tu Guoqing1 Computer School, Wuhan University, Wuhan, Hubei, China

Abstract- In grid computing environment, large heterogeneous resources are shared with geographically distributed virtual organization memberships, each having their own resource management policies and different access and cost models. There have been many projects that have designed and implemented the resource management systems with a variety of architectures and services. Grid applications have increasingly multifunctional and security requirements. However, current techniques mostly protect only the resource provider from attacks by the user, while leaving the user comparatively dependent on the well behavior of the resource provider. In this paper, we analyze security problems existing in Grid Computing System and describes the security mechanism in Grid Computing System, and propose a domain-based trustworthy resource management architecture for grid computing system. Keywords: grid computing, resource management

1 Introduction Grid applications are distinguished from traditional client-server applications by their simultaneous use of large numbers of resources, dynamic resource requirements, use of resources from multiple administrative domains, complex communication structures, and stringent performance requirements, among others[1]. Many of these applications rely on the ability for message processing intermediaries to forward messages, and Controlling access to applications through robust security protocols and security policy is paramount to controlling access to VO resources and assets. Thus, authentication mechanisms are required so that the identity of individuals and services can be established, and service providers must implement authorization mechanisms to enforce policy over how each service can be used. The security challenges faced in a Grid environment can be grouped into three categories: integration with existing systems and technologies, interoperability with different “hosting environments”, and trust relationships among interacting hosting environments. Security of Grid Computing System should solve the following problems: user masquerade, server masquerade, data wiretapping and sophisticating, remote attack, resource abusing, malicious program, and system integrity. Grid Computing System is a complicated, dynamic and wide-area system, adding restricted authorization on user cannot be solved by the current technologies. So developing new security architecture is necessary.

Now, some most famous security models are put into application, such as GT2 Grid Security Model and GT3 Security Model for OGSA[2]. Based on analyzing deeply and comparing with many kinds of security model for resource management in grid computing system, as an effective trustworthy security model in grid computing system, the domain- based security model for Resource Management Architecture is presented in this paper. This paper is organized as follows: In section 2 background and related works are reviewed. Section 3 describes the architecture of the domain-based model and the advanced reservation algorithm. The conclusion is in Section 4.

2 Related work The security in Grid architecture is of major concern as the sharing the Grid environments. It is also more than sharing data or basic computing resources in large organizations. Primarily, Grid environments aim at direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resourcebrokering strategies. While crossing the physical and organization boundaries, Grid environment demands solutions to support security policies and management of credentials. Support for remote access to computing and data resources is to be provided. Further, Grid technology includes wide permutations of mobile devices, gateways, proxies, load balancers, globally distributed data centers, demilitarized zones etc.

2.1 Grid Security Challenges The security challenges in Grid environment can be grouped into three main categories, Integration, Interoperability, and Trust Relationship[3]. Integration is unreasonable to expect that a single security technology can be defined to address all Grid security challenges and to be adoptable in every hosting environment. Legacy infrastructure cannot be changed rapidly, and hence the security architecture in Grid environment should integrate with existing security infrastructure and models. For example, each domain in a Grid environment is likely to have one or more registries in which user accounts are maintained; such registries are unlikely to be shared with other organizations or domains. Similarly, authentication mechanisms deployed in an existing environment that is reputed secure and reliable will continue to be used. Each domain typically has its own authorization infrastructure that is deployed, managed and supported. It will not typically be acceptable to replace any of these technologies in favor of a single model or mechanism.

154


Interoperability, grid technology is designed to operate services that traverse multiple domains and hosting environments. In order to correctly and efficiently interact with other systems, interoperability is needed at multiple levels, such as Protocol level, policy level and identity level. The trust relationship problem is made more difficult in a Grid environment by the need to support the dynamic, user-controlled deployment and management of transient services. Trust relationship among the participating domains in Grid environment is important for end-to-end traversals. This combination of dynamic policy overlays and dynamically created entities drives the need for three key functions in a Grid security model. 1. Multiple security mechanisms. Organizations participating in a VO often have significant investment in existing security mechanisms and infrastructure. Grid security must interoperate with, rather than replace, those mechanisms. 2. Dynamic creation of services. Users must be able to create new services (e.g., “resources”) dynamically without administrator intervention. These services must be coordinated and must interact securely with other services. Thus, we must be able to name the service with an assertable identity and to grant rights to that identity without contradicting the governing local policy. 3. Dynamic establishment of trust domains. In order to coordinate resources, VOs need to establish trust among not only users and resources in the VO but also among the VO’s resources, so that they can be coordinated. These trust domains can span multiple organizations and must adapt dynamically as participants join, are created, or leave the VO. In summary, security challenges in a Grid environment can be addressed by categorizing the solution areas:(1) integration solutions where existing services needs to be used, and interfaces should be abstracted to provide an extensible architecture; (2) interoperability solutions so that services hosted in different virtual organizations that have different security mechanisms and policies will be able to invoke each other; and (3) solutions to define, manage and enforce trust policies within a dynamic Grid environment. The dependency between these three categories is illustrated in Fig.1. Integrate Extensible Architecture Using existing services

Interoperate Secure Interoperability Protocol mapping

Trust Trust Relationships Trust Establishment

Fig.1 Categories of security challenges

2.2 Grid Security Requirements Security is one of the characteristics of an OGSAcompliant component. The basic requirements of an OGSA security model are that security mechanisms be pluggable and discoverable by a service requestor from a service des-

cription. OGSA security must be seamless from edge of network to application and data servers, and allow the federation of security mechanisms not only at intermediaries, but also on the platforms that host the services being accessed. The basic OGSA security model must address the following security disciplines: (1) Authentication. Provide plug points for multiple authentication mechanisms and the means for conveying the specific mechanism used in any given authentication operation. The authentication mechanism may be a custom authentication mechanism or an industry-standard technology. The authentication plug point must be agnostic to any specific authentication technology. (2) Delegation. Provide facilities to allow for delegation of access rights from requestors to services, as well as to allow for delegation policies to be specified. When dealing with delegation of authority from an entity to another, care should be taken so that the authority transferred through delegation is scoped only to the tasks intended to be performed and within a limited lifetime to minimize the misuse of delegated authority. (3) Single Logon. Relieve an entity having successfully completed the act of authentication once from the need to participate in re-authentications upon subsequent accesses to OGSA-managed resources for some reasonable period of time. This must take into account that a request may span security domains and hence should factor in federation between authentication domains and mapping of identities. This requirement is important from two perspectives: a) It places a secondary requirement on an OGSAcompliant implementation to be able to delegate an entity’s rights, subject to policy b) If the credential material is delegated to intermediaries, it may be augmented to indicate the identity of the intermediaries, subject to policy. (4) Credential Lifespan and Renewal. In many scenarios, a job initiated by a user may take longer than the life span of the user’s initially delegated credential. In those cases, the user needs the ability to be notified prior to expiration of the credentials, or the ability to refresh those credentials such that the job can be completed. (5) Authorization. Allow for controlling access to OGSA services based on authorization policies attached to each service. Also allow for service requestors to specify invocation policies. Authorization should accommodate various access control models and implementation. (6) Privacy. Allow both a service requester and a service provider to define and enforce privacy policies, for instance taking into account things like personally identifiable information (PII), purpose of invocation, etc. (Privacy policies may be treated as an aspect of authorization policy addressing privacy semantics such as information usage rather than plain information access.) (7) Confidentiality. Protect the confidentiality of the underlying communication (transport) mechanism, and the confidentiality of the messages or documents that flow over the transport mechanism in a OGSA compliant infrastruc-


ture. The confidentiality requirement includes point–to–point transport as well as store-and-forward mechanisms. (8)Message integrity. Ensure that unauthorized changes made to messages or documents may be detected by the recipient. The use of message or document level integrity checking is determined by policy, which is tied to the offered quality of the service (QoS). (9) Policy exchange. Allow service requestors and providers to exchange dynamically security (among other) policy information to establish a negotiated security context between them. Such policy information can contain authentication requirements, supported functionality, constraints, privacy rules etc. (10) Secure logging. Provide all services, including security services themselves, with facilities for time-stamping and securely logging any kind of operational information or event in the course of time - securely meaning here reliably and accurately, i.e. so that such collection is neither interruptible nor alterable by adverse agents. Secure logging is the foundation for addressing requirements for notarization, nonrepudiation, and auditing. (11) Assurance. Provide means to qualify the security assurance level that can be expected of a hosting environment. This can be used to express the protection characteristics of the environment such as virus protection, firewall usage for Internet access, internal VPN usage, etc. Such information can be taken into account when making a decision about which environment to deploy a service in. (12) Manageability. Explicitly recognize the need for manageability of security functionality within the OGSA security model. For example, identity management, policy management, key management, and so forth. The need for security management also includes higher-level requirements such as anti-virus protection, intrusion detection and protection, which are requirements in their own rights but are typically provided as part of security management. (13) Firewall traversal. A major barrier to dynamic, cross-domain Grid computing today is the existence of firewalls. As noted above, firewalls provide limited value within a dynamic Grid environment. However, it is also the case that firewalls are unlikely to disappear anytime soon. Thus, the OGSA security model must take them into account and provide mechanisms for cleanly traversing them—without compromising local control of firewall policy. (14) Securing the OGSA infrastructure. The core Grid service specification (OGSI) presumes a set of basic infrastructure services, such as handleMap, registry, and factory services. The OGSA security model must address the security of these components. In addition, securing lower level components that OGSI relies on would enhance the security of the OGSI environment.

2.3 GT2 Grid Security Model The Globus Toolkit version 2 (GT2) includes services for Grid Resource Allocation and Management (GRAM), Monitoring and Discovery (MDS), and data movement (GridFTP). These services use a common Grid Security Infrastructure (GSI) to provide security functionality[2]. GSI defines a

155

common credential format based on X.509 identity certificates and a common protocol based on transport layer security. Each GSI certificate is issued by a trusted party known as a certificate authority (CA), usually run by a large organization or commercial company. In order to trust the X.509 certificate presented by an entity, one must trust the CA that issued the certificate. A single entity in an organization can decide to trust any CA, without necessarily involving the organization as a whole. This feature is key to the establishment of VOs that involve only some portion of an organization, where the organization as a whole may provide little or no support for the VO. CAS allows VOs to express policy, and it allows resources to apply policy that is a subset of VO and local policy.This process comprises three steps: (1) The user authenticates to CAS and receives assertions from CAS expressing the VO’s policy in terms of how that user may use VO resources. (2) The user then presents the assertion to a VO resource along with the usage request. (3) In evaluating whether to allow the request, the resource checks both local policy and the VO policy expressed in the CAS assertion.

2.4 GT3 Grid Security Model Version 3 of the Globus Toolkit (GT3) and its accompanying Grid Security Infrastructure (GSI3) provide the first implementation of OGSA mechanisms. GT3’s security model seeks to allow applications and users to operate on the Grid in as seamless and automated a manner as possible. Security mechanisms should not have to be instantiated in an application but instead should be supplied by the surrounding Grid infrastructure, allowing the infrastructure to adapt on behalf of the application to meet the application's requirements. The application should need to deal with only application-specific policy. GT3 uses the following powerful features of OGSA and Web services security to work toward this goal: (1)Casts security functionality as OGSA services to allow them to be located and used as needed by applications. (2) Uses sophisticated hosting environments to handle security for applications and allow security to adapt without having to change the application. (3) Publishes service security policy so that clients can discover dynamically what credentials and mechanisms are needed to establish trust with the service. (4) Specifies standards for the exchange of security tokens to allow for interoperability. In order to establish trust, two entities need to be able to find a common set of security mechanisms that both understand. The use of hosting environments and security services, as described previously, enables OGSA applications and services to adapt dynamically and use different security mechanisms. The published policy in OGSA can express requirements for mechanisms, acceptable trust roots, token formats, and other security parameters. An application wishing to interact with the service can examine this published policy and gather the needed credentials and functionality by contacting appropriate OGSA security services. The security of request can be described following steps. Firstly, the client’s hosting environment retrieves and

156


inspects the security policy of the target service to determine what mechanisms and credentials are required to submit a request. Secondly, if the client’s hosting environment determines that the needed credentials are not already present, it contacts a credential conversion service to convert existing credentials to the needed format, mechanism, and/or trust root. Thirdly, the client’s hosting environment uses a token processing and validation service to handle the formatting and processing of authentication tokens for exchange with the target service. This service relieves the application and its hosting environment from having to understand the details of any particular mechanism. Fourthly, on the server side, the hosting environment likewise uses a token processing service to process the authentication tokens presented by the client. Lastly, after authentication and the determination of client identity and attributes, the target service’s hosting environment presents the details of the request and client information to an authorization service for a policy decision.

3 Domain-based security architecture for grid computing system Based on the five-layered security architecture on considering the designation and accomplishment of Grid

security project, we propose domain-based trustworthy resource management architecture for grid computing system. The security architecture that we have already briefly depicted at GCC2002 is shown as Fig. 2. The domain-based security architecture assumes that each group of VOs is protected by special security VOs that trust each other. These VOs are responsible for authorizing access to services/resources within that group. All delegations are stored by the security agent, which has the ability to reason about them. A requester can execute a right or access a resource by providing its identity and/or authorization information to the security VO. The security VO checks this information for validity, and reads its policies to verify that the requester has the right. If the requesting VO does not have the right, the security VO returns an error message, otherwise it forwards the request to the VO in charge of the resource, access VO, along with a message saying that the request is authorized by the security agent. As the security VO is trusted by every other in the system, the requesting VO is granted access. If the access VO has the computing power to reason about certificates, rights and delegations the request can be sent directly to it, instead of via the security VO.

Security architecture of the Grid computing system Grid security application layer Grid security protocol layer Grid security basic layer System and Network security tech layer Node and interconnection layer Fig.2 Security architecture for grid computing system

3.1 Definitions of Elements of Security architecture (1) Object is resource or process of Grid Computing System. Object is protected by security policy. Resource may be file, memory, CPU, equipment, etc. Process may be process running on behalf of user, process running on behalf of resource, etc. “O” denotes Object. (2) Subject is user, resource or process of Grid Computing System. Subject may destroy Object. Resource may be file, memory, CPU, equipment, etc. Process may be process running on behalf of user, process running on behalf of resource, etc. “S” denotes Subject.

(3) Trust Domain is a logical, administrative region of Grid Computing System. Trust Domain has clear border. “D” denotes Trust Domain. (4) Representation of Object: There are two kinds of Object in Grid Computing System, which are Global Object and Local Object . A Global Object is the abstraction of one or many Local Objects. Global Objects and Local Objects exist in Grid Computing System at the same time. (5) Representation of Subject: There are two kinds of Subject in Grid Computing System, which are Global Subject and Local Subject. A Global Subject is the abstraction of one or many Local Subjects. Global Subjects and Local Subjects exist in Grid Computing System at the same time.


157

(6) Representation of Trust Domain: There are two kinds of Trust Domain in Grid Computing System, which are Global Trust Domain DG and Local Trust Damian DL. Global Trust Domain is the abstraction of all Local Trust Domains. Global Trust Domain and Local Trust Domain exist in Grid Computing System at the same time. Trust Domain denotes Trust Domain, {O} denotes the set of all Objects existing in this Trust Domain, {S}denotes the set of all Subjects existing in this Trust Domain, and P denotes Security Policy of this Trust Domain. Global Trust Domain can be denoted by DG=({OG},{SG},PG), and Local Trust Domain can be denoted by Di=({Oi},{Si},Pi) i=1,2,3… Let us assume that there are two domains DOM1 and DOM2 that are collaborating on a certain project. If Bob, an administrator at DOM1, wants to access the database of the

Intrusion Detection

Anti-virus Management

of Grid Computing System consists of three elements: Objects existing in this Trust Domain, Subjects existing in this Trust Domain and Security Policy which protect Objects against Subjects. Trust Domain can be denoted by D=({O},{S},P), D client, DOM2, and if Bob has the permission to do so, he sends a Request for Action to his own security agent. The security agent returns an authorization certificate, which Bob uses to access the database. We also assume that Bob has the permission to access the database and that this permission can be delegated. Bob wants all users to access the database as well, and so he sends a certificate containing a delegate statement to security VO. The architecture of domain-based security for Grid Computing System is show as Fig3.

Secure Conversations

Access Control Enforcement

End-point Policy

Authorization

Mapping Rules

Policy

Privacy Policy

Policy Management

User Management

Key Management

Policy Expression and Exchange

Trust Model

Transport, protocol, message security

Fig.3 Architecture of domain-based security

3.2 Policy and Implementation of the domainbased security architecture The Grid Computing System is abstracted to the elements such as Objects, Subjects, Security Policies, Trust Domains, Operations, Authorization, etc. Grid Computing System is composed of four parts: Global Trust Domain, Local Trust Domain, Operations and Authorizations[5,6]. It can be denoted by G=(DG,{ Di },{O j},{AK}) i=1,2,3… j=1,2,3… k=1,2,3… G denotes Grid Computing System, DG denotes Global Trust Domain, {Di} denotes the set of all Local Domain, {Oj} denotes the set of all Operations, {AK} denotes the set of all Authorizations. The security of Grid Computing System can be regarded as the relationship among the basic elements. That is to say, “user access and use resources” can be abstracted as “Subject operate Object”, this can be denoted by S—OP—>O. Checking the relationship of Subject, Object and Security Policy, we can examine whether Subject can operate Object, and also can tell whether user can access resource.

This policy consists of authorization and delegation policies. Authorization policies deal with the rules for checking the validity of requests for actions. An example of a rule for authorization would be checking the identity certificate of an agent and verifying that the agent has an axiomatic right. Delegation policies describe rules for delegation of rights. A rule for delegation would be checking that an agent has the ability to delegate before allowing the delegation to be approved. A policy also contains basic or axiomatic rights, and rights associated with roles. We introduce the concept of primitive or axiomatic rights, which are rights that all individuals possess and that are stored in the global policy. For example, there are basic rights that are not often expressed, but used implicitly. A policy can be viewed as a set of rules for a particular domain that defines what permissions a user has and what permissions she/he can obtain. How the Domain-based security policy actually takes place as follows. (1) SA-DOM2 loads the domain policy for DOM2 and loads a global shared policy. (2) SA-DOM1 loads the domain policy for DOM1 and loads a global shared policy.

158


(3) SA-DOM2 sends message to SA-DOM1 saying that SA-DOM1 has the right to delegate access to db5, which is a database in DOM2, to all users: a) tell(sa-DOM2, sa-DOM1, idelegate(StartTime, EndTime, sa-Dom2, sa-Dom1, canDo(X,accessDB(db5), user(X,DOM1)), true, true). b) SA-DOM1 asserts the proposition: delegate(IssueTime, StartTime, EndTime, sa-Dom2, sa-Dom1, canDo(X,accessDB(db5),user(X,DOM1)), true, true). c) SA-DOM1 gives all administrators the right to access db5, but not the ability to delegate: tell(sa-Dom1, sa-Dom1, idelegate(StartTime, EndTime, sa-Dom1, X, canDo(X, accessDB(db5), true), role(X, administrator), false). d) A delegate statement to be inserted into the knowledge base: delegate(IssueTime, StartTime, EndTime, saDom1, X, canDo(Z, accessDB(db5), true), role(X, administrator), false) (4)Bob requires some information from database, db5, at DOM2. He sends a request to SA-DOM1 along with his certificate request(Bob, accessDB(db5)). (5) SA-DOM1 knows that the request is from Bob because of his certificate. It then checks the rules to see if Bob as an administrator has access to db5. As this is true, SADOM1 creates an authorization certificate and sends it back to Bob. (6) Bob sends a request to SA-DOM2 with his ID certificate and the authorization certificate: request(Bob, accessDB(db5)). (7) SA-DOM2 verifies both the certificates and checks its policy. SA-DOM2 approves the access and the request is sent to the agent controlling access to the database. (8) If Eric, a user tries to access the database, db5, his request will fail because the SA-DOM1 has only given administrators the right. If all these steps complete successfully, the target hosting environment then presents the authorized request to the target service application. The application, knowing that the hosting environment has already taken care of security, can focus on application-specific request processing steps.

4 Conclusions Grid computing presents a number of security challenges that are met by the Globus Toolkit’s Grid Security Infras-

tructure (GSI). Version 3 of the Globus Toolkit (GT3) implements the emerging Open Grid Services Architecture; its GSI implementation (GSI3) takes advantage of this evolution to improve on the security model used in earlier versions of the toolkit. GSI3 remains compatible (in terms of credential formats) with those used in GT2, while eliminating privileged network services and making other improvements. Its development provides a basis for a variety of future work. In particular, we propose the domain-based trustworthy resource management architecture for the grid computing system, we believe it will be very useful in the grid computing system.

5 References [1] I. Foster, C. Kesselman, G. Tsudik, S. Tuecke. A Security Architecture for Computational Grids. Proc. 5th ACM Conference on Computer and Communications Security Conference, pp. 83-92, 1998. [2] Foster, I. and Kesselman, C. Globus: A Toolkit-Based Grid Architecture. Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 259-278. [3] OGSA-SEC-WG Draft, “Security Architecture for Open Grid Services”, https://forge.gridforum.org, 2007. [4] M.Blaze, J.Feigenbaum, J.Lacy. ”Decentralized Trust Management”, IEEE Proceedings of the 17th Symposium on Security and Privacy, 1996 [5] Ellison, M.Carl, et al, ”SPKI Certificate Theory”, RFC 2693, Internet Society, 1999 [6] W.Johnston, C.Larsen. ”A use-condition centered approach to authenticated global capabilities: security architectures for large-scale distributed collaboratory environments” , Available at: http://wwwitg.1bl.gov/Security/Arch/publications.html [7] A.Herzberg, Y.Mass, J.Michaeli, D.Naor, Y.Ravid, ”Access control meets public key infrastructure, or: assigning roles to strangers” Available at: http://www.hrl.il.ibm.com/TrustEstablishment/paper.htm [8] IBM, Microsoft, RSA Security and VeriSign. Web Services Trust Language (WS-Trust), 2002.


SESSION GRID UTILITIES, SYSTEMS, TOOLS, AND ARCHITECTURES Chair(s) TBA

159

160



161

On Architecture of the Economic-aware Data Grid Thuy T. Nguyen1,2, Thanh D. Do1, Tung T. Doan2, Tuan D. Nguyen2, Trong Q. Duong2, Quan M. Dang3 1 Deparment of Information Systems, Hanoi University of Technology, Hanoi, Vietnam 2 High Performance Computing Center, Hanoi University of Technology, Hanoi, Vietnam 3 School of Information Technology, International University, Germany {thuynt-fit, thanhdd-fit, tungdt-hpc}@mail.hut.edu.vn

Abstract – Data Grid has been adopted as the nextgeneration platform by many scientific communities to share, access, transport, process, and manage large data collections distributed worldwide. The increasing popularity of Data Grid as a solution for large dataset issues as well as large-scale processing problems promises the adoption of many organizations that have a great demand at this moment. This requires applying innovative business model to conventional Data Grid. In this paper, we propose a business model based on outsourcing approach and a framework called the Economic-aware Data Grid that takes the responsibilities of coordinating operations. The framework works in an economic-aware way to minimize the cost of the organization. Keywords: Data Grid, economic-aware, business model, grid computing, outsourcing.

1

Introduction

Inspired by the need of sharing, accessing, transporting, processing and managing large data collections distributed worldwide between universities, laboratories as well as High Performance Computing Center (HPCC), Data Grid has appeared and evolved as the next-generation distributed storage platform by many scientific communities[1],[2] as well as industrial companies. Many enterprises show great demand to handle business operation that deals with distributed data sets, such as a large, multinational financial service group with a range of activities covering banking, brokerage, and insurance. The amount of data that has to be retained and manipulated has been growing rapidly. Those organizations have to deal with the problems of organizing massive volumes of data and running data mining applications under certain conditions such as limited budget and resources. Among related existing technologies, scientific Data Grid seems to be the most suitable one to solve two problems above. Its model usually includes many HPCCs with enormous storage capacity and computing power. However, it has some drawbacks. First, investing a Data Grid needs a lot of money. The financial sources for building such Data Grid are from governments or scientific funding foundations.

Hence, researchers within the Data Grid can use those resources freely. Second, the resources might be used unfairly or inefficiently. Because of these major drawbacks, only a few of business applications are built around these Grid solutions. In this paper, , we integrate the Software as a Service (SaaS)[3] business model into Data Grid to overcome disadvantages of scientific Data Grid. In particular, we propose a business model, which consists of three main participants: the resource provider, the software provider and the organization deploying the economic-enhanced Data Grid for its business operations. The Data Grid will work in an economic-aware way by complementing necessary scenarios and components. The rest of this paper is organized as follows. In Section 2, we explain in detail our motivation for the economic-aware Data Grid. The proposed business model is demonstrated in Section 3. In Section 4, we describe the high-level system design. We cite related works in Section 5. After discussing some problems that need to be considered in our framework in Section 6, we present future works and conclude the paper.

2

Motivation

The model of scientific Data Grid does not fully meet the requirements of enterprises. Thus, an economic-enhanced Data Grid is an explicit choice over other possibilities. It is worth describing here, in more detail, why they should make this choice. An economic-aware Data Grid can save lot money in resource investment. It not only provides capability of using resources efficiently but also ensures fairness among participants. Considering the case of an investment bank, it has many geographical distributed branches each of which has its own business. Each branch usually runs data mining applications over the set of collected financial data. With the time, this activity becomes important and need to be extended. This leads to two challenges. First, the computing tasks need more computing power and storage capacity. Second, the data source is not just from the branch but also from other branches. Because all the branches belong to the investment bank, the data can be shared among branches with a suitable

162


authorization policy. Thus, the bank needs a technology to share large data sets effectively. To deal with both problems, it is necessary to have a Data Grid in the investment bank. We can see many similar scenarios in the real world.

main contribution of the paper includes the proposal for a business model and the design of an economic-aware Data Grid framework. In next Sections, we will demonstrate our business model to explain how it meets expectations above.

To build such a Data Grid, one solution is applying the way of scientific Data Grid, in which each branch invests to build its own computer center. Those computer centers are then connected together to form a Data Grid. Users in each branch can use the Grid freely. However, this approach has many disadvantages. First, it costs a lot of money for hardware and software investment. They also need money to operate and maintain the operations of the computer center, such as money for electric power, human resource and so on. The initial investment and maintaining cost could take most of the budget of the branch. Second, the resource utilization is not efficient. Usually, data mining applications are executed when all the financial data are collected. This happens at the end of a month, end of a quarter or end of a year. At those periods, all computing resources are employed and the workload is always 100 %. In normal period, the workload is much lower. Thus, many computers run wastefully. Finally, there may have unfair resource usage on the Grid. Because individual departments within an investment bank are very competitive, the notion of freely sharing their resources is very difficult to accept. Some branches may contribute little resource to the Grid but use a lot.

3

Another approach is outsourcing. This means each branch does not invest to build a computer center itself but hire from resource providers and pay per use. In other word, the investment bank will build a Data Grid over the business Grid. This approach overcomes disadvantages discussed above. It brings about many benefits to the investment bank and its branches, as follows:

•

•

Economy and efficiency. The users can obtain resources whenever they need them. The degree of need is expressed under user control and can be described with high precision for resource use for now and for future times of interest, like before deadlines. Thanks to pay-per-use characteristic, users are sure to get the great benefits by saving a large amount of money invested in their own computing center. Additionally, by hiring necessary resources in run-time, they could avoid the wastefulness of computing resource as regarding above. Fair sharing. To use resource, users have to pay by using accounting and charging service. Thus, the more branches use resources, the more they have to pay.

However, up to now, there is no business model and technical solution to realize this approach. The work in this paper is the first attempt to solve this issue. In particular, the

Proposed business model

Figure 1. The business model of the system

3.1

Participants

The business model includes three main participants, as illustrated in Figure 1. They are resource provider, software provider, and the organization deploying the Data Grid. 3.1.1 Resource providers The resource provider provides server, storage, and network resources. The providers already have their own accounting, charging, and billing as well as job deployment modules. They offer storage capacity and computing power service. We assume that the price of using resource is published. To ensure the quality of service (QoS), the resource providers should have advanced resource reservation mechanisms. The users could be charged for storage capacity, the number of computing nodes, and the amount of bandwidth they use. 3.1.2

Software provider

Software providers are business entities that provide software services. In particular, they provide software and its license to ensure that, the software can work under negotiated condition. The income of the software provider is from selling software license. 3.1.3

Organization deploying the Data Grid

Organization consists of many branches distributed worldwide. Instead of building a computer center for each branch, the organization has an Information Technology (IT) central department. This department has an economic-aware Data Grid middleware responsible for coordinating internal data sharing among branches.


The Economic-aware Data Grid should belong to the organization deploying the Data Grid because of many reasons. First, it saves the organization much cost of using broker service. Second, it is easier for organization to apply cost optimization policy when having its own control system. Finally, giving data management task for a third party is not as trustable as by itself. The goal of the economic-aware Data Grid is managing the Data Grid in a way that minimizes the cost of the organization.

3.2

163

4

High-level system design

Working mechanism

To use the Data Grid’s service, users in each branch have to join the system. This could be done by requesting and getting certificates through the browsers from grid administrators. Then, users could use the economic-aware Data Grid. The Economic-aware Data Grid performs two main closely related tasks. The first task is the data management service. It includes data transfer service, replication service, authorization and so on. The second task is the job execution service. It receives the requirement from users, gets the software, locates the data, reserves computing resources, deploys, runs the software and returns the result. The output data must be stored or replicated somewhere on the Data Grid. The contract with software providers and resource providers is realized with Service Level Agreement Negotiation (SLA Negotiation)[4, 5]. Obviously, the pay-peruse model brings advantages of saving money and efficiency as analyzed in Section 2. Discussing about fair sharing, we could look at the scenario in which the user in each branch puts, gets, finds data and runs jobs on the Data Grid. The branch has to pay the cost for using storage and computing power to resource providers. It also has to pay the cost for using software to the software providers. Storage service cost includes data transfer in/out cost. Thus, if user in branch 2 conducts many transfers from the data storage of branch 1, letting branch 1 pay for the transfer cost is unfair. Thus, it is necessary to have payment among branches to ensure fair sharing The more resources a branch use, the more it has to pay for.

Figure 2. High-level system design In this section, we show a design of economic-aware Data Grid, which coordinates the operation in our business model. The high-level system design is illustrated in Figure 2. It consists of various services, which will be described as follows.

4.1

Data manipulation

Like any other Data Grid, data manipulation is a basic service in our system. It helps users to put files to Grid as well as find, access and download files they want. As each branch has a separate storage on the Grid, the file should be put to that storage. As illustrated in Figure 3, the system serves the user in the following order: (1) The Grid receives the requirements with the Grid Portal; (2) The Grid Portal invokes the Metadata Catalog Service (MCS) to find the appropriate information depending on the user's request. If the request is 'put', the MCS will return the data storage location (the store service address); if the request is find, download or delete, the MCS will return the data location. (3) Basing on the information provided from MCS, Grid Portal invokes services provided by service providers to handle the request. (4) When the request completes or fails, the Grid Portal notifies to the user. If the request is completed successfully, the Grid Portal stores the accounting information to the accounting service (5) and stores relevant metadata to MCS as well as Replica Location Service (LRS) (6).

164


the QoS. (6-7) If the operation is completed successfully, the Data Replication service stores data information to MCS and RLS. (8) It also stores the accounting information to the accounting service.

4.3

Figure 3. Scenario of putting a file on Grid

4.2

Replication

The replication service is used to reduce access latency, improve data locality, and increase robustness, scalability and performance of distributed applications. The system should analyze the pattern of previous file requests, and replicate files toward sites that show a corresponding increased frequency of file access request [6]. The Replication function is performed by Data Replication Service module.

Job Execution

When a user wants to run a job, he provides the software’s name, name of the input/output data, resource requirements and a deadline to complete the job. The scenario of Job Execution is illustrated in Figure 5. It could be described as follows: (1) The Grid Portal receives the user's request. (2) The Grid Portal invokes the SaaS service. (3) The SaaS invokes the Software Discovery service to find the location of the software provider. (4) The SaaS invokes the MCS and RLS to find the location of the data file. (5) The SaaS invokes Scheduling service to find the suitable resource provider. (6) The SaaS signs SLA contract to hire software, computing resource and bandwidth with software providers as well as resource providers. (7)The SaaS transfers the software and data to the execution site and executes the job. (8) During the execution, the monitoring module is invoked to observe the QoS. (9) If error occurs, SaaS will invoke the Error Recovery module. (10) When the execution finishes, the SaaS moves the output data to the defined places and updates MCS and RLS. (11) The SaaS also stores accounting information to accounting service.

Figure 4. Scenario of replication on Grid Figure 5. Scenario of job execution on Grid The operation of the Data Replication Service is showed in Figure 4 and could be described as follows: (1) The Data Replication service receives request, reads and interprets it. (2) The Data replication service invokes scheduling service to find out a suitable replication location. The scheduling service discovers candidate resources, matches the user's requirements and the candidate resources in an optimal way and then returns selected resources to the Data Replication Service. (3) The Data Replication service reserves bandwidth with resource providers by an SLA. (4) The Data Replication service invokes the file transfer service of the determined resource provider for transferring data. (5) The Data Replication service invokes monitoring module to monitor

We emphasize that in our system, unlike the general SaaS, the number of software here is not so big, thus the Software Discovery module is relatively simple.

5

Related works

Most current researches in economy grid pay little effort on sharing internal and large data sets effectively. Instead, they are developing open Grid architecture that allows several providers and consumers to be interconnected and to trade services. Others are developing and deploying business models with the purpose of selling their own products and


services such as GridASP[7], GRASP[8], GRACE[9], BIG[10] and EU-funded project GridEcon[11]. These models usually do not involve several providers. For instant, Sun Utility Grid [12] and Amazon EC2 [13] , provide on-demand computing resources while VPG [14] and WebEx [15] provide certain on-demand applications (music, video ondemand, web conferencing, etc) that do not relate to data sharing. Current large-scale Data Grid projects such as the Biomedical Informatics Research Network (BIRN) [16], the Southern California Earthquake Center (SCEC) [17], and the Real-time Observatories, Applications, and Data management Network (ROADNet) [18] make use of the San Diego Supercomputer Center (SDSC) Storage Resource Broker as the underlying Data Grid technology. These applications require widely distributed access to data by many people in many places. The Data Grid creates virtual collaborative environments that support distributed but coordinated scientific and engineering research. The economic aspects are not considered in those projects. In [19], a cost model for distributed and replicated data over a wide area network is presented. Cost factors for the model are the network, data server and application specific costs. Furthermore, the problem of job execution is discussed under the viewpoint of sending the job to the required data (code mobility) or sending data to a local site and executing the job locally (data mobility). However, in the model, the cost is not money but the time of job execution. With this assumption, the system is pseudo economic-aware. Moreover, the infrastructure works with the best effort mechanism. The QoS and resource reservation are not considered. Thus, it does not suit with the business environment. Heiser et. al. [20] proposed a commodity market of storage space within the Mungi operating system. The proposed system focuses on the extra accounting system used for backing store management. All accounting of the operations on storage objects can be done asynchronously without slowing down such operations. It is based on bank accounts from which rent is collected for the storage occupied by objects. Rent automatically increases as available storage runs low, forcing users to release unneeded storage. Bank accounts use a taxation system to prevent excessive build up of funds on underutilized accounts. However, the system considers only the storage resource and the scope of the system is just inside an organization. Buyya [21], [22] discussed the possible use of economy in a scientific Data Grid environment. Specifically, a tokenexchange approach is proposed to regulate demand for data access from the servers of the Data Grid. For example, a token may correspond to 10KB of data volume. By default, a single user may only access as much data as he has tokens. This gives other users a chance to access data. However, the amount of data that they access for a given token depends on

165

various parameters such as demand, system load, and QoS, etc. The users can trade-off between QoS and tokens. The negotiation/redistribution of tokens after their expiration, their mapping to real money and the pricing policies of storage servers are not discussed Moreover, this work focuses on the resource provider level while we focus on the system built up over the commercial resource providers.

6

Discussion

According to a survey of EU-funded project GridEcon [11], investment banks have been using Grid computing at least 3-4 years. Currently, most of the investment banks have completed this exercise to link these heterogeneous Grids and develop limited resource sharing. The next step is to create an internal shared computing platform for Utility Computing. Further evolutions planned include: (1) Following selfservice approach: users submit a job directly to the global grid without going through the IT department; (2) Applying SaaS model and open source to reduce cost, especially software-licensing costs. (3)Adding SLA monitoring, policy management, and charge-back across heterogeneous resources. Our solution combines the last two. This brings about great advantages (Section 2), but the biggest drawbacks are the dependence of data-intensive application on the network performance as well as the limitation of user's local access to the software. In the near future, these will no longer exist due to the rapid development of high-speed internet. Our works focus on one specific part of this promising trend. We integrate SaaS model with the Data Grid instead of the general grid. Further research includes two issues: What is the scheduling mechanism for single job and workflow job? What is the hiring strategy for storage resource? Scheduling mechanism is one of the most important problems in any distributed system. In scientific Grid, a scheduling component (such as such as Legion[23], Condor[24], AppLeS[25, 26] Netsolve[27], Punch[28]) decides which jobs are to be executed at which site based on certain cost functions. In the economic approach, the scheduling decision is flexibly conducted regarding to end users' requirements. Whereas a conventional model often deals with software and hardware costs for running applications, the business model primarily charges the endusers for services that they consume based on the value they derive from it. Pricing based on users’ demand and the supply of resources is the main drive in the competitive, economic market model.. The second issue is the hiring strategy for storage resource and software. Our system is based on business Grid and works in an Economic-aware way. Therefore, we put the benefit problems between consumers and providers. The consumers must pay for usage and the providers get money for their storage and software resource. Nevertheless, how the

166


consumers could hire the storage resource and software is the problem that depends on each organization deploying this system. They could hire on demand. That means they only pay whenever they use. However, in many cases, the users also want to hire resource a long time before using that one. The providers could account for storage usage or for data transfer, or the quality of storage and QoS provided. Hiring strategy must be paid a lot of attention in any Economicaware system.

7

Conclusion and future work

In this paper, we raised a scope of problems in which current researches have not yet solve completely. We proposed the Economic-aware Data Grid as an appropriate solution for them. Our solution based on two ideas. The first is integrating SaaS into Data Grid. The second is that the system is based on business Grid and works with Economicaware way. We presented the high-level design and three basic operating scenarios of our system to demonstrate how the system could tackle the problems. Our proposed system has many advantages. It brings economic benefit to business organization. It also gives provider a chance to get money from their resource and software. It also reduces the complexity in resource management problems in compare with scientific Data Grid. We strongly believe that Economicaware Data Grid in particular and Economic-aware Grid in general will play a more important role in the development of Grid computing in the near future. We have specified all components at the class level including the working mechanism, interface with other components and input/output parameters. We have designed UML diagrams and portal interface. For next steps, we intend to build a prototype of the system using Java language. We have built a test bed including 12 computers from our HPC Center. We plan to deploy and do some experiments to verify our approach.

Acknowledgements This research is conducted at High Performance Computing Center, Hanoi University of Technology, Hanoi, Vietnam. It is supported in part by VN-Grid, SAREC 90 RF2, and KHCB 20.10.06 projects.

References [1]. Venugopal, S., R. Buyya, and R. Kotagiri, A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing. 2006. [2]. Allcock, B., et al., Data management and transfer in high-performance computational grid environments, in Parallel Computing. 2002. [3]. Konary, E.T.a.A., Software-as-a-Service Taxonomy and Research Guide. 2005.

[4]. Quan, D.M. and O. Kao, SLA negotiation protocol for Grid-based workflows. 2005. [5]. Quan, D.M. and O. Kao, On Architecture for SLAaware workflow in Grid environments. Journal of Interconnection Networks, World Scientific Publishing Company, 2005. [6]. Bell, W.H., et al. Evaluation of an Economy-Based File Replication Strategy for a Data Grid. in Proceedings of the 3rd IEEE/ACMInternational Symposium on Cluster Computing and the Grid (CCGrid 2003). 2003. Tokyo,Japan: IEEECSPress. [7]. GridASP Website. [cited; Available from: http://www.gridasp.org/wiki/. [8]. GRASP Web site. [cited; Available from: http://eugrasp.net/. [9]. GRACE Website. [cited; Available from: http://www.buyya.com/ecogrid/. [10]. T. Weishupl, F.D., E. Schikuta, H. Stockinger, and a.H. Wanek. Business In the Grid: The BIG Project. in GECON2005, the 2nd International Workshop on Grid Economics and Business Models. 2005. Seoul. [11]. GridEcon Website. [cited; Available from: http://gridecon.eu/html/deliverables.shtml. [12]. Sun Grid Web site. [cited; Available from: http://www.sun.com/service/sungrid/. [13]. Amazon Web Services. [cited; Available from: http://www.amazon.com/. [14]. Falcon, F., GRID - A Telco perspective: The BT Grid Strategy, in the 2nd International Workshop on Grid Economics and Business Models. 2005: Seoul. [15]. WebEx Connect: First SaaS Platform to Deliver Mashup Business Applications for Knowledge Workers. 2007 [cited; Available from: http://www.webex.com/pr/pr428.html. [cited; [16]. Biomedical Informatics Research Network. Available from: http://www.nbirn.net/. [17]. Southern California Earthquake Center. [cited; Available from: http://www.scec.org/. [18]. Real-time Observatories, Applications, and Data [cited; Available from: management Network. http://roadnet.ucsd.edu/. [19]. Stockinger, H., et al. Towards a Cost Model for Distributed and Replicated Data Stores. in Proceedings of the 9th Euromicro Workshop on Parallel and Distributed Processing (PDP 2001) 2001. Italy. [20]. G. Heiser, F.L., and S. Russell. Resource Management in the Mungi Single-Address-Space Operating System. in Proceedings of Australasian Computer Science Conference. 1998. Perth Australia. [21]. Buyya, R., et al., Economic models for resource management and scheduling in Grid computing. 2002. [22]. Buyya, R., D. Abramson, and S. Venugopal, The grid economy. 2005. [23]. S.Chapin, J. Karpovich, and A. Grimshaw, The legion resource management system, in The 5th Workshop Job Scheduling Strategies for Parallel Processing. 1999: San Juan,Puerto Rico.


[24]. M. Litzkow, M. Livny, and M. Mutka, CondorA hunter of idle workstations, in the 8th Int.Conf.Distributed Computing Systems(ICDCS1988). 1988: SanJose,CA. [25]. Berman, F. and R. Wolski, The AppLeS project: A status report, in the 8th NEC Research Symp. 1997: Berlin, Germany. [26]. H. Casanova, et al., The AppLeS parameter sweep template: User-level middleware for the grid, in the IEEE Supercomputing Conf.(SC2000). 2000: Dallas,TX. [27]. Casanova, H. and J. Dongarra, NetSolve: A network server for solving computational science problems. Int.J.Super computing Applicat.High Performance Computing, 1997. [28]. Kapadia, N. and J. Fortes, PUNCH: An architecture for web-enabled wide-area network-computing. Cluster Computing, 1999.

167

168


Grid-enabling complex system applications with QosCosGrid: An architectural perspective Valentin Kravtsov1, David Carmeli1, Werner Dubitzky2, Krzysztof Kurowski3;4, and Assaf Schuster1 1 Technion - Israel Institute of Technology, Technion City, Haifa, Israel 2 University of Ulster, Coleraine, Northern Ireland, UK 3 Poznan Supercomputing and Networking Center, Poznan, Poland 4 University of Queensland, St. Lucia, Brisbane, Australia Abstract - Grids are becoming mission-critical components in research and industry, offering sophisticated solutions in leveraging large-scale computing and storage resources. Grid resources are usually shared among multiple organizations in an opportunistic manner. However, an opportunistic or "best effort" quality-of-service scheme may be inadequate in situations where a large number of resources need to be allocated and applications which rely on static, stable execution environments. The goal of this work is to implement what we refer to as quasiopportunistic supercomputing. A quasi-opportunistic supercomputer facilitates demanding parallel computing applications on the basis of massive, non-dedicated resources in grid computing environments. Within the EUsupported project QosCosGrid we are developing a quasiopportunistic supercomputer. In this work we present the results obtained from studying and identifying the requirements a grid needs to meet in order to facilitate quasi-opportunistic supercomputing. Based on these requirements we have designed architecture for a quasiopportunistic supercomputer. The paper presents and discusses this architecture. Keywords: Grid, Quasi-Opportunistic Supercomputing.

1

Introduction

Supercomputers are dedicated, special-purpose multiprocessor computing systems that provide close-to-best achievable performance for demanding parallel workloads [12]. Supercomputers possess a set of characteristics that enable them to process such workloads efficiently. First, all the high-end hardware components, such as CPUs, memory, interconnects and storage devices are characterized not only by considerable capacity levels but also by a high degree of reliability and performance predictability. Second, supercomputer middleware provides a convenient abstraction of a homogeneous computational and networking environment, automatically allocating resources according to the underlying networking topology [2]. Third, the resources of a conventional supercomputer are managed exclusively by a single centralized system. This enforces global resource

utilization policies, thus maximizing hardware utilization while minimizing the turnaround time of individual applications. Together, these features give supercomputers their unprecedented performance, stability and dependability characteristics. The vision of grids becoming powerful virtual supercomputers can be attained only if their performance and reliability limitations can be overcome. Due to the considerable differences between grids and supercomputers, the realization of this vision poses considerable challenges. Some of the main challenges are briefly discussed below. The co-allocation of a large number of participating CPUs. In conventional supercomputers, where all CPUs are exclusively controlled by a centralized resource management system, the simultaneous allocation (co-allocation) and invocation (co-invocation) of processing units is handled by suitable co-allocation and co-invocation components [10]. In grid systems, however, inherently distributed management coupled with the non-dedicated (opportunistic) nature of the underlying resources makes co-allocation very hard to accomplish. Previous research has focused on co-allocation in grids of supercomputers and dedicated clusters [17], [15], [3]. The co-allocation problem has received little attention in the high-performance grid computing community. While coallocation issues arise in other situations (e.g., co-allocation of processors and memory, co-allocation of CPU and networks, setup of reservations along network paths), the dynamic, non-dedicated nature of grids presents special challenges [6]. The potential for failure and the heterogeneous nature of the underlying resource pool are examples of such special challenges. Synchronous communications. Typically, synchronous communications form a specific communication topology pattern (e.g., stencil exchange in MM5 [15][13] and local structures in complex systems [7]). This is satisfied by supercomputers via special-purpose, low-latency, highthroughput hardware as well as optimized allocation by the resource management system to ensure that the underlying networking topology matches the application's communication pattern [2]. In grids, however, synchronous communication over a wide area network (WAN) is slow, and topology-aware allocation is typically not available despite the existing support of communication libraries [8].


Allocation of resources does not change during runtime. While always true in supercomputers, this requirement is difficult to satisfy in grids, where low reliability of resources and WANs, as well as uncoordinated management of different parts of the grid, contribute to extreme fluctuations in the number of available resources. Fault tolerance. In large-scale synchronous computation, the high sensitivity of individual processes to failures usually leads to termination of the entire parallel run in case of such failures. While rare in supercomputers because the high specifications of their hardware, system and component failures are very common in grid systems. We define a quasi-opportunistic supercomputer as a grid system which could address the challenges mentioned above but still hide many of the grid-related complexities from applications and users. In this paper we present some of the early results coming out of the QosCosGrid project1 which is aimed at developing a quasi-opportunistic supercomputer. The main contributions of this paper are that we •

Introduce and motivate the concept of a quasiopportunistic supercomputer.

•

Summarize the main requirements of a quasiopportunistic supercomputer.

•

Present a detailed system architecture designed for the QosCosGrid quasi-opportunistic supercomputer.

169

CPUs. In this context, co-allocation also means that resources for a certain task are allocated in advance. Those resources must be negotiated in advance and guaranteed to be available when the task's time slot arrives. This implies the need for a sophisticated distributed negotiation protocol which is supported by advance reservation mechanisms. Topology aware resource management. A complex system simulation is usually composed of multiple agents performing tasks of different complexity. The agents are arranged in a dynamic topology with different patterns of communication. To execute such a simulation, appropriate computational and network resources must be allocated. To perform this task, resource co-allocation algorithms must consider the topological structure of resource requests and offers and match these appropriately. Economics-based grid. In "best-effort" grids, local cluster administrators are likely to increase the priorities of local users, possibly disallowing remote jobs completely and thus disassembling the grid back into individual clusters. This is because administrators lack suitable incentives to share the resources. We believe that resource co-allocation must be backed up by an economic model that motivates resource providers to honor their guarantees to the grid user and force the user to carefully weigh the cost of resource utilization. This model is also intended to address the "free-rider" problem [1].

Service-level agreements. The economic model should be supported by formal agreements, whose fulfillment can later 2 Requirements of a quasi-opportunisticbe confirmed. Thus, an expressive language is required to describe such agreements, along with monitoring, supercomputer accounting and auditing systems that can understand such a Many real-world systems involve large numbers of language. highly interconnected heterogeneous elements. Such structures, known as complex systems, typically exhibit nonCross-domain fault tolerant MPI and Java RMI linear behavior and emergence [4]. The methodologies used communication. The majority of current fault-tolerant MPI to understand the properties of complex systems involve and Java RMI implementations provide transparent fault modeling, simulation and often require considerable tolerance mechanisms for clusters. However, to provide a computational resources that only supercomputers can reliable connection within a grid, a cross-domain, faultdeliver. However, many organizations wanting to model and tolerant and grid-middleware-aware communication library simulate complex systems lack the resources to deploy or is needed. maintain such a computing capability. This was the motivation that prompted the initiation of the QosCosGrid Distributed checkpoints. In grids, nodes and network project, whose aim it is to develop core grid technology failures will inevitably occur. However, to assure that an capable of providing quasi-opportunistic supercomputing entire application will not be aborted after a single failure, grid services and technology. Modeling and simulation of distributed checkpoints and restart protocols must be used to complex systems provide a huge range of applications stop and migrate the whole application or part of it. requiring supercomputer or supercomputer-like capabilities. The requirements derived from the analysis of nine diverse Scalability, extensibility, ease of use. In order to be widely complex systems applications are summarized below. accepted by the complex systems community, the QosCosGrid system must offer easy interfaces yet still allow Co-allocation. Complex systems simulations require extensions and further development. Additionally, for realsimultaneous execution of code on very high numbers of world grids, the system must be scalable in terms of computing resources and supported users. 1

www.QosCosGrid.com

170


Interoperability. The system must facilitate seamless interoperation and sharing of computing and data resources. Standardization. To facilitate interoperation and evolution of the QosCosGrid system, the design and implementation should be based on existing and emerging grid and gridrelated standards and open technology.

3

QosCosGrid System Architecture

The working hypothesis in QosCosGrid project is that a quasi-opportunistic supercomputer (as characterized in Section 1) can be built by means of a collaborative grid which facilitates sophisticated resource sharing across multiple administrative domains (ADs). Loosely based on Krauter [16], a collaborative grid consists of several organizations participating in a virtual organization (VO) and sharing resources. Each organization contributes its resources for the benefit of the entire VO, while controlling its own administrative domain and own resource allocation/sharing policies. The organizations agree to connect their resource pools to a trusted "grid level" middleware which tries to achieve optimal resource utilization. In exchange for this agreement, partners gain access to very large amounts of computational resources. The QosCosGrid architecture is depicted in Figure 1. The diagram depicts a simplified scenario involving three administrative domains (labeled Administrative Domain 1, 2 and 3). Administrative Domain 3 consists of two resource pools, each of which is connected to an AD-level service, located in the center of the diagram. AD-level services of all the administrative domains are connected to a trusted, distributed grid-level service. The grid-level services are designed to maximize the global welfare of the users in the entire grid.

Figure 1: QosCosGrid system architecture.

3.1

Grid fabric

The basic physical layer consists of computing and storage nodes connected in the form of computing cluster. The computing cluster is managed by a local resource management system (LRMS) – in our case a Platform Load Sharing Facility (LSF Cluster) – but may be replaced by other advanced job scheduling systems such as SGE or PBSPro. The LSF cluster runs batch or interactive jobs, selecting execution nodes based on current load conditions and the resource requirements of the application. The current load of CPUs, network connections, and other monitoring statistics are collected by the cluster and networking monitoring system, which is tightly integrated with LRMS and external middleware monitoring services. In order to execute cluster-to-cluster parallel applications, the QosCosGrid system supports the advance reservation of computing resources, in addition to basic job execution and monitoring features. Advance reservation is critical to the QosCosGrid system as it enables the QosCosGrid middleware to deliver resources on-demand with significantly improved quality of service.

3.1.1

FedStage Open DRMAA Service Provider and advance reservation APIs A key component of the QosCosGrid architecture is a LRMS offering job submission, monitoring, and advance reservation features. However, for years LRMS provided either only proprietary script-based interfaces for application integration or nothing at all, in which case the command-line interface was used. Consequently, no standard mechanisms existed for programmers to integrate both grid middleware services and applications with local resource management systems. Thanks to Open Grid Forum and its Distributed Resource Management Application API (DRMAA) working group [19], the DRMAA 1.0 specification has recently been released. It offers a standardized API for application integration with C, Java, and Perl bindings. Today, DRMAA implementations that adopt the latest specification version are available for many local resource management systems, including SGE, Condor, PBSPro, and Platform LSF, as well as for other systems such as GridWay or XGrid. In the QosCosGrid we have successfully used FedStage2 DRMAA for LSF and integrated those APIs with the Open DRMAA Service Provider (OpenDSP). OpenDSP is an open implementation of SOAP Web Service multi-user access and policy-based job control using DRMAA routines implemented by LRMS. As a lightweight and highly efficient software component, OpenDSP allows easy and fast remote access to computing resources. Moreover, as it is based on standard Web Services technology, it integrates well with higher level grid middleware services. It uses a request-response-based communication protocol with standard JSDL XML and SOAP schemas protected by 2

http://www.fedstage.com/wiki/


transport level security mechanisms such as SSL/TLS or GSI. However, neither DRMAA nor OpenDSP provide standard advance reservation and resource synchronization APIs required by cross-domain parallel applications. Therefore, in the QosCosGrid project, we have extended DRMAA and proposed standard advance reservation APIs that are suited to the various APIs of underlying local resource management systems.

3.1.2

QosCosGrid parallel cross-domain execution environments The QosCosGrid Open MPI (QCG-OMPI) is an implementation of the message passing interface that enables users to deploy and use transparently MPI applications in the QosCosGrid testbed, and to take advantage of local interconnects technology3. QCG-OMPI supports all the standard high-speed network technologies that Open MPI supports, including TCP/Ethernet, shared memory, Myrinet/GM, Myrinet/MX, Infiniband/OpenIB, or Infiniband/mVAPI. In addition, it supports inter-cluster communications using relay techniques in a manner transparent to users and can be integrated with the LSF Cluster. QCG-OMPI relates to a check-pointing interface that provides a coordinated check-pointing mechanism on demand. To the best of our knowledge, no other MPI solution provides a fault-tolerant mechanism on a transparent grid deployment environment. Our intention is that the QCG-OMPI implementation will be fully compliant with the MPI 1.2 specification from the MPI Forum4. QosCosGrid ProActive (QCG-ProActive) is a Java-based grid middleware for parallel, distributed and multi-threaded computing integrated with OpenDSP. It is based on ProActive, which provides a comprehensive framework and programming model to simplify the programming and execution of parallel applications. ProActive uses by default the RMI Java standard library as a portable communication layer, supporting the following communication protocols: RMI, HTTP, Jini, RMI/SSH, and Ibis [5].

3.2

Administrative domain- and grid-level services

Grid fabric software components, in particular OpenDSP, QCG-OMPI and QCG-ProActive, must be deployed on physical computing resources at each administrative domain and be integrated with the AD-level services. AD-level services, in turn, are connected to the Grid-level services in order to share and receive information about the entire grid as well as for tasks that cannot be performed within a single administrative domain. We distinguish five main high-level types of services:

171

3.2.1

Grid Resource Management System The Grid Resource Management System (GRMS) is a grid meta-scheduling framework which allows developers to build and deploy resource management systems for largescale distributed computing infrastructures at both administrative domain and grid levels. The core GRMS broker module has been improved in QosCosGrid to provide both dynamic resource selection and mapping, along with advance resource reservation mechanisms. As a core service for all resource management processes in the QosCosGrid system, the GRMS supports load-balancing among LRMS, workflow execution, remote job control, file staging and advanced resource reservation. At the administrative level, the GRMS communicates with OpenDSP services to expose the remote access to underlying computing resources controlled by LRMS. Administrative domain-level GRMS is synchronized with the Grid-level GRMS during the job submission, job scheduling and execution processes. At the grid level, the GRMS offers much more advanced coallocation and topology-aware scheduling mechanisms. From the user's perspective, all parallel applications and their requirements (including complex resource topologies) are formally expressed by an XML-based job specification language called QCG Job Profile. Those job requirements, together with resource offers, are provided to the GRMS during scheduling and decision making processes.

3.2.2

Accounting and economic services Accounting services support the needs of users and organizations with regard to allocated budgets, credit transactions, auditing, etc. These services are responsible for (a) Monitoring: Capturing resource usage records across the administrative domains, according to predefined metrics; (b) Usage record storage: Aggregation and storage of the usage records gathered by the monitoring system; (c) Billing: Assigning a cost to operations and charging the user, taking into account the quality of service actually received by the user; (d) Credit transaction: The ability to transfer credits from one administrative domain to another as means of payment for received services and resources; (e) VO management: Definition of user groups, authorized users, policies and priorities; (f) Accounting: Management of user groups' credit accounts, tracking budget, economical obligations, etc.; (g) Budget planning: Cost estimations for aggregation of resources according to the pricing model; and (h) Usage analysis: Analysis of the provided quality of service using information from usage records, and comparison of this information to the guaranteed quality of service.

3.2.3 3 4

http://www.open-mpi.org/ http://www.mpi-forum.org/

Resource Topology Information Service (RTIS) The RTIS provides information on the resource topology and availability. Information is provided by means

172


of the Resource Topology Graph (RTG) schema, instances of which depict the properties of the resources and their interconnections. For a simpler description process, the RTG does not contain a "point-to-point" representation of the desired connections but is based instead on the communication groups concept, which is quite similar to the MPI communicator definition. The main goals of the RTIS are to facilitate topology-aware services to discover the grid resources picture as well as to disclose information about those resources, on a "need-to-know" basis.

The SLA describes the service time interval, and the provided QoS – resources, topology, communication, and mapping of user processes to provider's resources. SLAs are represented using the RTG model, and are stored in RTIS. The SLA Compliance Monitor analyzes the provided quality of service for each time slot, and calculates a weighted compliance factor for the whole execution. The compliance factor is used by the pricing service (which is a part of accounting services) to calculate the costs associated with the service if it is provided successfully, or the penalties that arise when a guarantee is violated.

3.2.4

4

Grid Authorization System Currently, the most common solution for mutual authentication and authorization of grid users and services is the Grid Security Infrastructure (GSI). The GSI is a part of the Globus Toolkit and provides fundamental security services needed to support grids [9]. In many GSI-based grid environments, the user's identity is mapped to a corresponding local user identity, and further authorization depends on the internal LRMS mechanisms. The authorization process is relatively simple and static. Moreover, it requires that the administrator manually modify appropriate user mappings in the gridmap file every time a new user appears or has to be removed. If there are many users in many administrative domains whose access must be controlled dynamically, the maintenance and synchronization of various gridmap files becomes an important administrative issue. We believe that more advanced mechanisms for authorization control are required to support dynamic changes in security policy definition and enforcement over a large number of middleware services. Therefore, in QosCosGrid we have adopted the Grid Authorization Service (GAS), an authorization system integrated with various grid middleware such as the Globus Toolkit, GRMS or OpenDSP. GAS offers dynamic finegrained access control and enforcement for shared computing services and resources. Taking advantage of the strong authentication mechanisms implemented in PKI and GSI, it provides crucial security mechanisms in the QosCosGrid system. From the QosCosGrid architecture perspective, the GAS can also be treated as a trusted single logical point for defining security policies.

3.2.5

Service-Level Agreement Management System In order to enforce the rules of the economic system, we employ a service-level agreement (SLA) protocol [18]. A SLA defines a dynamically established and managed relationship between the resource providers and resource consumers. Both parties are committed to the negotiated terms. These commitments are backed up by organizationwide policies, incentives, and penalties to encourage each party to fulfill its obligations. For each scheduled task, a set of SLAs is signed by the administrative domain of the task owner, and by each of the provider administrative domains.

Related Work

One of the largest European grid projects, Enabling Grids for E-SciencE (EGEE) [11], has developed a grid system that facilitates the execution of scientific applications within a production level grid environment. To the best of our knowledge, EGEE does not support advance reservation, checkpoint and restart protocols, and cannot guarantee the desired level of quality of service for long executions. One of the major drawbacks of the EGEE system stems from the presence of large numbers of small misconfigured sites. This results in considerable delays. To some extent, this problem is caused by the sheer scale of the system, but also by the lack of an appropriate incentive for the participating administrative domain administrators. The HPC4U [14] project is arguably closest to the objectives of QosCosGrid. Its goal is to expand the potential of the grid approach to complex problem solving. This is envisaged to be done through the development of software components for dependable and reliable grid environments, combined with service-level agreements and commoditybased clusters providing quality of service. The QosCosGrid project differs from HPC4U mainly in its "grid orientation". QosCosGrid assumes multi-domain, parallel executions (in contrast to within-cluster parallel execution in HPC4U) and applies different MPI and checkpoint/restart protocols that are grid-oriented and highly scalable. VIOLA (Vertically Integrated Optical Testbed for Large Applications in DFN) [20] is a German national project intended for the execution of large-scale scientific applications. The project emphasizes the provision of high quality of service for execution node interconnection. VIOLA uses UNICORE grid middleware as an implementation of an operational environment and offers a newly developed meta-scheduler component, which supports co-allocation on optical networks. The main goals of the VIOLA project include testing of advanced network equipment and architectures, development and test of software tools for user-driven dynamical provision of bandwidth, interworking of network equipment from different manufacturers, and enhancement and test of new advanced applications. The VIOLA project is mainly targeted at supporting a new generation of networks, which


provide many advanced features that are not present in the majority of up-to-date clusters and of course not at the Internet level.

5

Conclusions

The main objective of the QosCosGrid project is to address some of the key technical challenges to enable development and execution of large scale parallel experiments across multiple administrative, resource and security domains. In this paper we summarized presented the main requirements and important software components that make up consistent software architecture. This high-level architecture perspective is intended to give readers the opportunity to understand a design concept without the need to know the many intricate technical details related in particular to Web service, WSRF technologies, remote protocol design, and security in distributed systems. The QosCosGrid architecture is designed to address key qualityof-service, negotiation, synchronization, advance reservation and access control issues by providing well-integrated grid fabric, administrative domain- and grid-level services. In contrast to existing grid middleware architectures, all QosCosGrid system APIs have been created on top of carefully selected third-party services that needed to meet the following requirements: open standards, open source, high performance, and security and trust. Moreover, the QosCosGrid pluggable architecture has been designed to enable the easy integration of parallel development tools and supports fault-tolerant cluster-to-cluster message passing communicators and libraries that are well known in highperformance computing domains, such as Open MPI, ProActive and Java RMI. Finally, we believe that without extra support for administrative tools, it would be difficult to deploy, control, and maintain such a big system. Therefore, as a proof of concept, we have developed various client tools to help administrators connect sites from Europe, Australia and the USA. After the initial design and deployment stage, we have begun carrying out many performance tests of the QosCosGrid system and crossdomain application experiments. Collected results and analysis will be taken into account during the next phase of system re-design and deployment. Acknowledgments. The work described in this paper was supported by the EC grant QosCosGrid IST FP6 033883.

6

173

Blue Gene/L supercomputer". IBM Journal of Research and Development, 49(2-3):425–436, 2005. [3] F. Azzedin, M. Maheswaran, and N. Arnason. "A synchronous co-allocation mechanism for grid computing systems". Cluster Computing, 7(1):39–49, 2004. [4] Complex systems. Science, 284(5411), 1999. [5] J. Cunha and O. Rana. "Grid computing: Software environments and tools, chapter 9, Programming, Composing, Deploying for the Grid". Springer Verlag, 2006. [6] K. Czajkowski, I. Foster, and C. Kesselman. "Resource co-allocation in computational grids". In Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing (HPDC-'99), page 37, 1999. [7] R. Bagrodia et al. Parsec. "A parallel simulation environment for complex systems". Computer, 31(10):77– 85, 1998. [8] I. Foster and N.T. Karonis. "A grid-enabled MPI: Message passing in heterogeneous distributed computing systems". IEEE/ACM Conference on Supercomputing, pages 46–46, Nov. 1998. [9] I. Foster and C. Kesselman. "The grid: Blueprint for a new computing infrastructure", chapter 2, The Globus Toolkit. Morgan Kaufmann, 1999. [10] E. Frachtenberg, F. Petrini, J. Fernandez, S. Pakin, and S. Coll. "STORM: Lightning-fast resource management". In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pages 1–26, USA, 2002. IEEE Computer Society Press. [11] F. Gagliardiand, B. Jones, F. Grey, M-E. Bgin, and M. Heikkurinen. "Building an infrastructure for scientific grid computing: Status and goals of the EGEE project". Philosophical Transactions A of the Royal Society: Mathematical, Physical and Engineering Sciences, 363(833):1729–1742, 2005. [12] S. L. Graham, C. A. Patterson, and M. Snir. "Getting Up to Speed: The Future of Supercomputing". National Academies Press, USA, 2005.

References

[1] N. Andrade, F. Brasileiro, W.Cirne, and M. Mowbray. "Discouraging free riding in a peer-to-peer CPU-sharing grid". In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC-'04), 2004. [2] Y. Aridor, T. Domany, O. Goldshmidt, J.E. Moreira, and E. Shmueli. "Resource allocation and utilization in the

[13] G.A. Grell, J. Dudhia, and D. R. Stauffer. "A Description of the Fifth Generation Penn State/NCAR Mesoscale Model (MM5)". NCAR, USA, 1995. [14] M. Hovestadt. "Operation of an SLA-aware grid fabric". IEEE Trans. Neural Networks, 2(6):550–557, 2006. [15] Y.S. Kee, K. Yocum, A.A. Chien, and H. Casanova. "Improving grid resource allocation via integrated selection

174


and binding". In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC'06), page 99, New York, USA, 2006, ACM. [16] K. Krauter, R. Buyya, and M. Maheswaran. "A taxonomy and survey of grid resource management systems for distributed computing". Software – Practice and Experience, 32:135–164, 2002. [17] D. Kuo and M. Mckeown. "Advance reservation and co-allocation protocol for grid computing". In Proceedings of the 1st Intl. Conf. On e-Science and Grid Computing, pages 164–171, Washington, DC, USA, 2005. IEEE Computer Society. [18] H. Ludwig, A. Keller, A. Dan, and R. King. "A service level agreement language for dynamic electronic services". WECWIS, page 25, 2002. [19] P. Troeger, H. Rajic, A. Haas, and P. Domagalski. "Standardization of an API for distributed resource management systems". In Seventh IEEE International Symposium on Cluster Computing and the Grid, pages 619– 626, 2007. [20] VIOLA Vertically Integrated Optical Testbed for Large Application in DFN 2005. http://www.viola testbed.de/.


175

The Scientific Byte Code Virtual Machine Rasmus Andersen

Brian Vinter

University of Copenhagen eScience Centre 2100 Copenhagen, Denmark Email: [email protected]

University of Copenhagen eScience Centre 2100 Copenhagen, Denmark Email: [email protected]

Abstract—Virtual machines constitute an appealing technology for Grid Computing and have proved a promising mechanism that greatly simplifies and enforces the employment of grid computer resources. While existing sandbox technologies to some extent provide secure execution environments for applications deployed in a heterogeneous platform as the Grid, they suffer from a number of problems including performance drawbacks and specific hardware requirements. This project introduces a virtual machine capable of executing platform-independent byte codes specifically designed for scientific applications. Native libraries for the most prevalent applications domains mitigate the performance penalty. As such, grid users can view this machine as a basic grid computing element and thereby abstract away the diversity of the underlying real compute elements. Regarding security, which is of great concern to resource owners, important aspects include stack isolation by using a harvard memory architecture, and no support for neither I/O nor system calls to the host system. Keywords: Grid Computing, virtual machines, scientific applications.

I. I NTRODUCTION Although virtualization was first introduced several decades ago, the concept is now more popular than ever and has revived in a multitude of computer system aspects that benefit from properties such as platform independence and increased security. One of those applications is grid Computing[5] which seeks to combine and utilize distributed, heterogeneous resources as one big virtual supercomputer. Regarding utilization of the public’s computer resources for grid computing, virtualization, in the sense of virtual machines, is a necessity for fully leveraging the true potential of grid computing. Without virtual machines, experience shows that people are, with good reason, reluctant to put their resources on a grid where they have to not only install and manage a software code base, but also allow native execution of unknown and untrusted programs. All these issues can be eliminated by introducing virtual machines. As mentioned, virtualization is by no means a new concept. Many virtual machines exist and many of them have been combined with grid computing. However, most of these were designed for other purposes and suffer from a few problems when it comes to running high performance scientific applications on a heterogeneous computing platform. Grid computing is tightly bonded to eScience, and while standard jobs may run perfectly and satisfactory in existing virtual

e−Science

Grid

Fig. 1.

Virtual Machines

Relatationship between VMs, the Grid, and eScience

machines, ’gridified’ eScience jobs are better suited for a dedicated virtual machine in terms of performance. Hence, our approach addresses these problems by developing a portable virtual machine specifically designed for scientific applications: The Scientific Byte Code Virtual Machine (SciBy VM). The machine implements a virtual CPU capable of executing platform independent byte codes corresponding to a very large instruction set. An important feature to achieve performance is the use of optimized native libraries for the most prevalent algorithms in scientific applications. Security is obviously very important for resource owners. To this end, virtualization provides the necessary isolation from the host system, and several aspects that have made other virtual machines vulnerable have been left out. For instance, the SciBy VM supports neither system calls nor I/O. The following section (II) motivates the usage of virtual machines in a grid computing context and why they are beneficial for scientific applications. Next, we describe the architecture of the SciBy VM in Section III, the compiler in Section IV, related work in Section VI and conclusions in Section VII. II. M OTIVATION The main building blocks in this project arise from properties from virtual machines, eScience, and a grid environment in a combined effort, as shown in figure 1. The individual interactions impose several effects from the viewpoint of each end, described next. A. eScience in a Grid Computing Context eScience, modelling computationally intensive scientific problems using distributed computer networks, has driven the development of grid technology and as the simulations get

176

more and more accurate, the amount of data and needed compute power increase equivalently. Many research projects have already made the transition to grid platforms to accommodate the immense requirements for data and computational processing. Using this technology, researchers gain access to many networked computers at the cost of a highly heterogeneous computing platform. Obviously, maintaining application versions for each resource type is tedious and troublesome, and results in a deploy-port-redeploy cycle. Further, different hardware and software setups on computational resources complicate the application development drastically. One never knows to which resource a job is submitted in a grid, and while it is possible to assist each job with a detailed list of hardware and software requirements, researchers are better left off with a virtual workspace environment that abstracts a real execution environment. Hence, a virtual execution environment spanning the heterogeneous resource platform is essential in order to fully leverage the grid potential. From the view of applications, this would render a resource access uniform and thus the much easier ”compile once run anywhere” strategy; researchers can write their applications, compile them for the virtual machine and have them executed anywhere in the Grid. B. Virtual Machines in a Grid Computing Context Due to the renewed popularity of virtualization over the last few years, virtual machines are being developed for numerous purposes and therefore exist in many designs, each of them in many variants with individual characteristics. Despite the variety of designs, the underlying technology encompasses a number of properties beneficial for Grid Computing [4]: 1) Platform Independence: In a grid context, where it is inherently intrinsic to move around application code as freely as application data, it is highly profitable to enable applications to be executed anywhere in the grid. Virtual machines bridge the architectural boundaries of computational elements in a grid by raising the level of abstraction of a computer system, thus providing a uniform way for applications to interact with the system. Given a common virtual workspace environment, grid users are provided with a compile-once-run-anywhere solution. Furthermore, a running virtual machine is not tied to a specific physical resource; it can be suspended, migrated to another resource and resumed from where it was suspended. 2) Host Security: To fully leverage the computational power of a grid platform, security is just as important as application portability. Today, most grid systems enforce security by means of user and resource authentication, a secure communication channel between them, and authorization in various forms. However, once access and authorization is granted, securing the host system from the application is left to the operating system. Ideally, rather than handling the problems after system damage has occurred, harmful - intentional or not - grid applications should not be able to compromise a grid resource in the first place.


Virtual machines provide stronger security mechanisms than conventional operating systems, in that a malicious process running in an instance of a virtual machine is only capable of destroying the environment in which it runs, i.e. the virtual machine. 3) Application Security: Conversely to disallowing host system damage, other processes, local or running in other virtualized environments, should not be able to compromise the integrity of the processes in the virtual machine. System resources, for instance the CPU and memory, of a virtual machine are always mapped to underlying physical resources by the virtualization software. The real resources are then multiplexed between any number of virtualized systems, giving the impression to each of the systems that they have exclusive access to a dedicated physical resource. Thus, grid jobs running in a virtual machine are isolated from other grid jobs running simultaneously in other virtual machines on the same host as well as possible local users of the resources. 4) Resource Management and Control: Virtual machines enable increased flexibility for resource management and control in terms of resource usage and site administration. First of all, the middleware code necessary for interacting with the Grid can be incorporated in the virtual machine, thus relieving the resource owner from installing and managing the grid software. Secondly, usage of physical resources like memory, disk, and CPU usage of a process is easily controlled with a virtual machine. 5) Performance: As a virtual machine architecture interposes a software layer between the traditional hardware and software layers, in which a possibly different instruction set is implemented and translated to the underlying native instruction set, performance is typically lost during the translation phase. Despite of recent advances in new virtualization and translation techniques, and the introduction of hardware-assisted capabilities, virtual machines usually introduce performance overhead and the goal remains achieving near-native performance only. The impact depends on system characteristics and the applications intended to run in the machine. To summarize, virtual machines are an appealing technology for Grid Computing because they solve the conflict between the grid users at the one end of the system and resource providers at the other end. Grid users want exclusive access to as many resources as possible, as much control as possible, secure execution of their applications, and they want to use certain software and hardware setups. At the other end, introducing virtual machines on resources enables resource owners to service several users at once, to isolate each application execution from other users of the system and from the host system itself, to provide a uniform execution environment, and managed code is easily incorporated in the virtual machine. C. A Scientific Virtual Machine for Grid Computing Virtualization can occur at many levels of a computer system and take numerous forms. Generally, as shown in Figure 2, virtual machines are divided in two main categories:


System virtual machines and process virtual machines, each branched in finer division based on whether the host and guest instruction sets are the same or different. Virtual machines with the same instruction set as the hardware they virtualize do exist in multiple grid projects as mentioned in Section VI. However, since full cross-platform portability is of major importance, we only consider emulating virtual machines, i.e. machines that execute another instruction set than the one executed by the underlying hardware. Virtual Machines Process VMs

Same ISA

Fig. 2.

Different ISA

System VMs

Same ISA

Different ISA

Virtual machine taxonomy. Similar to Figure 13 in [9]

System virtual machines allow a host hardware platform to support multiple complete guest operating systems, all controlled by a virtual machine monitor and thus acting as a layer between the hardware and the operating systems. Process virtual machines operate at a higher level in that they virtualize a given platform for user applications. A detailed description of virtual machines can be found in [9]. The overall problem with system virtual machines that emulate the hardware for an entire system, including applications as well as an operating system, is the performance loss incurred by converting all guest system operations to equivalent host system operations, and the implementation complexity in developing a machine for every platform type, each capable of emulating an entire hardware environment for essentially all types of software. Since the application domain in focus is scientific applications only, there is really no need for full-featured operating systems. As shown in Figure 3, process level virtual machines are simpler because they only execute individual processes, each interfaced to the hardware resources through a virtual instruction set and an Application Binary Interface. Applications ISA ABI

Operating System

ISA ABI

Applications Virtual Machine

Virtual Machine Operating System

Fig. 3.

System VMs (left) and Process VMs (right)

Using the process level virtual machine approach, the virtual machine is designed in accordance with a software development framework. Developing a virtual machine for which there is no corresponding underlying real machine may sound counterintuitive, but this approach has proved successful in several cases, best demonstrated by the power and usefulness of the Java Virtual Machine. Tailored to the Java programming language, it has provided a platform independent computing environment for many application domains, yet there is no

177

commonly used real Java machine 1 . Similar to Java, applications for the SciBy VM are compiled into a platform independent byte code which can be executed on any device equipped with the virtual machine. However, applications are not tied to a specific programming language. As noted earlier, researchers should not be forced to rewrite their applications in order to use the virtual machine. Hence, we produce a compiler based on a standard ansi C compiler. D. Enabling Limitations While the outlined work at hand may seem comprehensive, especially the implementation burden with virtual machines for different architectures, there are a some important limitations that greatly simplify the project. Firstly, the implementation burden is lessened drastically by only giving support for running a single sequential application. Giving support for entire operating systems is much more complex in that it must support multiple users in a multi-process environment, and hardware resources such as networking, I/O, the graphics processor, and ’multimedia’ components of currently used standard CPUs are also typically virtualized. Secondly, a virtual machine allows fine-grained control over the actions taken by the code running in the machine. As mentioned in Section VI, many projects use sandbox mechanisms in which they by various means check all system instructions. The much simpler approach taken in this project is to simply disallow system calls. The rationale for this decision is that: • scientific applications perform basic calculations only • using a remote file access library, only files from the grid can be accessed • all other kinds of I/O are not necessary for scientific applications and thus prohibited • indispensable systems calls must be routed to the grid III. A RCHITECTURAL OVERVIEW The SciBy Virtual Machine is an abstract machine executing platform independent byte codes on a virtual CPU, either by translation to native machine code or by interpretation. However, in many aspects it is designed similarly to conventional architectures; it includes an Application Binary Interface, an Instruction Set Architecture, and is able to manipulate memory components. The only thing missing in defining the architecture is the hardware. As the VM is supposed to be run on a variety of grid resources, it must be designed to be as portable as possible, thereby supporting many different physical hardware architectures. Based on the previous sections, the SciBy VM is designed to have 3 fundamental properties: • Security • Portability • Performance That said, all architectural decisions presented in the following sections rely solely on providing portability. Security is 1 The

Java VM has been implemented in hardware in the Sun PicoJava chips

178

obtained by isolation through virtualization, and performance is solely obtained by the use of optimized native libraries for the intended applications and taking advantage of the fact that scientific applications spend most of their time in these libraries. The byte code is as such not designed for performance. Therefore, the architectural decisions do not necessarily seek to minimize code density, minimize code size, reduce memory traffic, increase the average number of clock cycles per instruction, or other architectural evaluation measurements, but more for simplicity and portability. A. Application Binary Interface The SciBy VM ABI defines how compiled applications interface with the virtual machine, thus enabling platform independent byte codes to be executed without modification on the virtual CPU. At the lowest level, the architecture defines the following machine types arranged in big endian order: • 8-bit byte • 16-, 32-, or 64-bit halfword • 32-, 64-, or 128-bit word • 64-, 128-, or 256-bit doubleword In order to support many different architectures, the machine exists in multiple variations with different word sizes. Currently, most desktop computers are either 32- or 64-bit architectures, and it probably won’t be long before we see desktop computers with 128-bit architectures. By letting the word size be user-defined, we capture most existing and nearfuture computers. Fundamental primitive data types include, all in signed two’s complement representation: • 8-bit character • integers (1 word) • single-precision floating point (1 word) • double-precision floating point (2 words) • pointer (1 word) The machine contains a register file of 16384 registers, all 1 word long. This number only serves as a value for having a potentially unlimited amount of registers. The reasons for this are twofold. First of all due to forward compatibility, since the virtual register usage has to be translated to native register usage, in which one cannot tell the upper limit on register numbers. So basically, in a virtual CPU, one should be sure to have more registers than the host system CPU. Currently, 16384 registers should be more than enough, but new architectures tend to have more and more registers. Secondly, for the intended applications, the authors believe that a register-based architecture will outperform a stack-based one[8]. Generally, registers have proved more successful than other types of internal storage and virtually every architecture designed in the last few decades uses a register architecture. Register computers exist in 3 classes depending on where ALU instructions can access their operands, register-register architectures, register-memory architectures and memorymemory architectures. The majority of the computers shipped


nowadays implement one of those classes in a 2- or 3-operand format. In order to capture as many computers as possible, the SciBy VM supports all of these variants in a 3-operand format, thereby including 2-operand format architectures in that the destination address is the same as one of the sources. B. Instruction Set Architecture One key element that separates the SciBy VM from conventional machines is the memory model: The machine defines a Harvard memory architecture with separate memory banks for data and instructions. The majority of conventional modern computers use a von Neumann architecture with a single memory segment for both instructions and data. These machines are generally more vulnerable to the well-known buffer overflow exploits and similar exploits derived from ’illegal’ pointer arithmetic to executable memory segments. Furthermore, the machine will support hardware setups that have separate memory pathways, thus enabling simultaneous data and instruction fetches. All instructions are fetched from the instruction memory bank which is inaccessible for applications: All memory accesses from applications are directed to the data segment. The data memory segment is partitioned in a global memory section, a heap section for dynamically allocated structures, and a stack for storing local variables and function parameters. 1) Instruction Format: The instruction format is based on byte codes to simplify the instruction stream. The format is as follows: Each instruction starts with a one-byte operation code (opcode) followed by possibly more opcodes and ends with zero or more operands, see Figure 4. In this sense, the machine is a multi-opcode multi-address machine. Having only a single one-byte opcode limits the instruction set to only 256 different instructions, whereas multiple opcodes allows for nested instructions, thus increasing the number of instructions exponentially. A multi-address design is chosen to support more types of hardware. OP 0 OP 0 OP 0

Fig. 4.

R1

R2

8

24

OP 8 OP 8

R1 16

R2 32

OP 16

R3 40

R1 24

R3 48 R2

40

R3 56

Examples of various instruction formats on register operands.

2) Addressing Modes: Based on the popularity of addressing modes found in recent computers, we have selected 4 addressing modes for the SciBy VM, all listed below. • Immediate addressing: The operand is an immediate, for instance MOV R1 4 which moves the number 4 to register 1. • Displacement addressing: The operand is an offset and a register pointing to a base address, for instance ADD


R1 R1 4(R2) which adds to R1 the value found 4 words from the address pointed out by R2. • Register addressing: Operand is a register, for instance MOV R1 R2 • Register indirect addressing: Address part is a register containing the address of an operand, for instance ADD R1, R1, (R2), which adds to R1 the value found at the address pointed out by R2. 3) Instruction Types: Since the machine defines a Harvard architecture, it is important to note that data movement is carried out by LOAD and STORE operations which operate on words in the data memory bank. PUSH and POP operations are available for accessing the stack. Table I summarizes the most basic instructions available in the SciBy VM. Almost all operations are simple 3-address operations with operands, and they are chosen to be simple enough to be directly matched by native hardware operations.

179

linear algebra, fast fourier transformations, or other library functions. Hence, by providing optimized native libraries, we can take advantage of the synergy between algorithms, the compiler translating them, and the hardware executing them. Equipping the machine with native libraries for the most prevalent scientific algorithms and enabling future support for new libraries increases the number of potential instructions drastically. To address this problem, multiple opcodes allows for nested instructions as shown in Figure 5. The basic instructions are accessible using only one opcode, whereas a floating point operation is accessed using two opcodes, i.e. FP lib FP sub R1 R2 R3, and finally, if one wishes to use the WFTA instruction from the FFT_2 library, 3 opcodes are necessary: FFT lib FFT 2 WFTA args. Halt

Fp_add

Push

FP_sub

Pop

Instruction group Moves Stack Arithmetic Boolean Bitwise Compare Control

Load

Mnemonic load, store push, pop add, sub, mul, div, mod and, or, xor, not and, or, shl, shr, ror, rol tst, cmp halt, nop, jmp, jsr, ret, br, be eq, br lt, etc

Store String_move String_cmp

FP_lib

In addition to the basic instruction set, the machine implements a number of basic libraries for standard operations like floating-point arithmetic and string manipulation. These are extensions to the virtual machine and are provided on an perarchitecture basis as statically linked native libraries optimized for specific hardware. As explained above, virtual machines introduce a performance overhead in the translation phase from virtual machine object code to the native hardware instructions of the underlying real machine. The all-important observation here is that scientific applications spend most of their running time executing ’scientific instructions’ such as string operations,

WFTA

FFT_3

Fig. 5.

C. Libraries

FFT_1 FFT_2

TABLE I BASIC I NSTRUCTION S ET OF THE S CI B Y VM

While these instructions are found in virtually every computer, they exist in many different variations using various addressing modes for each operand. To accommodate this and assist the compiler as much as possible, the SciBy VM provides regularity by making the instruction set orthogonal on both operations, data types, and the addressing modes. For instance the ’add’ operation exists in all 16 combinations of the 4 addressing modes on the two source registers for both integers and floating points. Thus, the encoding of an ’add’ instruction on two immediate source operands takes up 1 byte for choosing arithmetic, 1 byte to select the ’add’ on two immediates, 2 bytes to address one of the 16384 registers as destination register and then 16 bytes for each of the immediates, yielding a total instruction length of 36 bytes.

PFA

Str_lib FFT_lib

Native libraries as extension to the instruction set

A special library is provided to enable file access. While most grid middlewares use a staging strategy that downloads all input files prior to the job execution and uploads output files afterwards, the MiG-RFA [1] library accesses files directly on the file server on an on-demand basis. Using this strategy, an application can start immediately, and only the needed fragments of the files it accesses are transferred. Ending the discussion of the architecture, it is important to re-emphasize that all focus in this part of the machine is on portability. For instance, when evaluating the architecture, one might find that: •

•

•

• •

Having a 3-operand instruction format may give unnecessarily large code size in some circumstances Studies may show that the displacement addressing mode is typically used to nearby addresses, thereby suggesting that these instructions only need a few bits for the operand Using register-register instructions may give unnecessarily high instruction count in some circumstances Using byte codes increases the code density Variable instruction encoding decreases performance

Designing an architecture includes a lot of trade-offs, and even though many of these issues are zeroed by the interpreter or translator, the proposed byte code is far from optimal by normal architecture metrics. However, the key point is that we target only a special type of applications on a very broad hardware platform.

180

IV. C OMPILATION AND T RANSLATION While researchers do not need to rewrite their scientific applications for the SciBy VM, they do need to compile their application using a SciBy VM compiler that can translate the high level language code to the SciBy VM code. While developing a new compiler from scratch of course is a possibility, it is also a significant amount of work which may prove unprofitable since many compilers designed to be retargetable for new architectures already exist. Generally, retargetable compilers are constructed using the same classical modular structure: A front end that parses the source file, and builds an intermediate representation, typically in the shape of a parse tree, used for machine-independent optimizations, and a back end that translates this parse tree to assembly code of the target machine. When choosing between open source retargetable compilers, the set of possibilities quickly narrows down to only a few candidates: GCC and LCC. Despite the pros of being the most popular and widely used compiler with many supported source languages in the front end, GCC was primarily designed for 32-bit architectures, which greatly complicates the retargeting effort. LCC however, is a light-weight compiler, specifically designed to be easily retargetable to a new architecture. Once compiled, a byte code file containing assembly instruction mnemonics is ready for execution in the virtual machine, either by interpretation or by translation, where instructions are mapped to the instruction set of the host machine using either a load-time or run-time translation strategy. Results remain to be seen, yet the authors believe that in case a translator is preferable to an interpreter, the best solution would be load-time translator, based on observations from scientific applications: • their total running time is fairly long which means that the load-time penalty is easily amortized • they contain a large number of tight loops where runtime translation is guaranteed to be inferior to load-time translation V. E XPERIMENTS To test the proposed ideas, a prototype of the virtual machine has been developed, in the first stage as a simple interpreter implemented in C. There is no compiler yet, so all sample programs are hand-written in assembly code with the only goal of giving preliminary results that will show whether development of the complete machine can be justified. The first test is a typical example of the scientific applications the machine targets: A Fast Fourier Transform (FFT). The program first computes 10 transforms on a vector of varying sizes, then checksums the transformed vector to verify the result. In order to test the performance of the virtual machine, the program is also implemented in C to get the native base line performance, and in Java to compare the results of the SciBy VM with an existing widely used virtual machine. The C and SciBy VM programs make use of the fftw library[6], while the Java version uses an FFT algorithm from


Vector size 524288 1048576 2097152 4194304 8388608

Native 1.535 3.284 6.561 14.249 29.209

SciBy VM 1.483 3.273 6.656 14.398 29.309

Java 7.444 19.174 41.757 93.960 204.589

TABLE II C OMPARISON OF THE PERFORMANCE OF AN FFT APPLICATION ON A 1.86 GH Z I NTEL P ENTIUM M PROCESSOR , 2MB CACHE , 512 MB RAM

Vector size 524288 1048576 2097152 4194304 8388608

Native 0.879 1.857 3.307 6.318 13.045

SciBy VM 0.874 1.884 3.253 6.354 12.837

Java 4.867 10.739 23.520 50.751 110.323

TABLE III C OMPARISON OF THE PERFORMANCE OF AN FFT APPLICATION ON A DUAL CORE 2.2 GH Z AMD ATHLON 4200 64- BIT, 512 K B CACHE PER CORE , 4GB RAM

the SciMark suite[7]. Obviously, this test is highly unfair in disfavor of the Java version for several reasons. Firstly, the fftw library is well-known to give the best performance, and comparing hand-coded assembly with compiler-generated high-level language performance is a common pitfall. However, even though Java-wrappers for the fftw library exist, it is essential to put these comparisons in a grid context. If the grid resources were to run the scientific applications in Java Virtual Machine, the programmers - the grid users - would not be able to take advantage of the native libraries, since allowing external library calls breaks the security of the JVM. Thereby, the isolation level between the executing grid job and the host system is lost2 . In the proposed virtual machine, these libraries are an integrated part of the machine, and using them is perfectly safe. As shown in Table II the FFT application is run on the 3 machines using different vector size, 219 , ..., 223 . The results show that the SciBy VM is on-par with native execution, and that the Java version is clearly outperformed. Since the fftw library is multithreaded, we repeat the experiment on a dual core machine and on a quad dual-core machine. The results are shown in Table III and Table IV. From these results it is clear that for this application there 2 In fact there is a US Patent (#6862683) on a method to protect native libraries

Vector size 524288 1048576 2097152 4194304 8388608

Native 0.650 1.106 1.917 3.989 7.796

SciBy VM 0.640 1.118 1.944 3.963 7.799

Java 4.955 12.099 27.878 61.423 134.399

TABLE IV C OMPARISON OF THE PERFORMANCE OF AN FFT APPLICATION ON A QUAD DUAL - CORE I NTEL X EON CPU, 1.60 GH Z , 4MB CACHE PER CORE , 8GB RAM


is no overhead in running it in the virtual machine. It has immediate support for multi-threaded libraries, and therefore the single-threaded Java version is even further outperformed on multi-core architectures. VI. R ELATED W ORK GridBox [3] aims at providing a secure execution environment for grid applications by means of a sandbox environment and Access Control Lists. The execution environment is restricted by the chroot command which isolates each application in a separate file system space. In this space, all system calls are intercepted and checked against pre-defined Access Control Lists which specify a set of allowed and disallowed actions. In order to intercept all system calls transparently, the system is implemented as a shared library that gets preloaded into memory before the application executes. The drawback of the GridBox library is the requirement of a UNIX host system and application and it does not work with statically linked applications. Further, this kind of isolation can be opened if an intruder gains system privileges leaving the host system unprotected. Secure Virtual Grid (SVGrid) [10] isolates each grid applications in its own instance of a Xen virtual machine whose file system and network access requests are forced to go through the privileged virtual machine monitor where the restrictions are checked. Since each grid virtual machine is securely isolated from the virtual machine monitor from which it is controlled, many levels of security has to be opened in order to compromise the host system, and the system has proved its effectiveness against several malicious software tests. The performance of the system is also above acceptable with a very low overhead. The only drawback is that while the model can be applied to other operating systems than Linux, it still makes use of platform-dependent virtualization software. The MiG-SSS system [2] seeks to combine Public Resource Computing and Grid Computing by using sandbox technology in the form of a virtual machine. The project uses a generic Linux image customized to act as a grid resource and a screen saver that can start any type of virtual machine capable of booting an ISO image, for instance VMware player and VirtualBox. The virtual machine then boots the linux image which in turn retrieves a job from the Grid and executes the job in the isolated sandbox environment. Java and the Microsoft Common Language Infrastructure are similar solutions trying to enable applications written in the Java programming language or the Microsoft .Net framework, respectively, to be used on different computer architectures without being rewritten. They both introduce an intermediate platform independent code format (Java byte code and the Common Intermediate Language respectively) executable by hardware-specific execution environments (the Java Virtual Machine and the Virtual Execution System respectively). While these solution have proved suitable for many application domains, performance problems and their requirement of a specific programming language class rarely used for scientific

181

applications disqualifies the use of these virtual machines for this project. VII. C ONCLUSIONS AND F UTURE W ORK Virtual machines can solve many problems related to using desktop computers for Grid Computing. Most importantly, for resource owners, they enforce security by means of isolation, and for researchers using the Grid, they provide a level of homogeneity that greatly simplifies application deployment in an extremely heterogeneous execution environment. This paper presented the basic ideas behind the Scientific Byte Code Virtual Machine and proposed a virtual machine architecture specifically designed for executing scientific applications on any type of real hardware architecture. To this end, efficient native libraries for the most prevalent scientific software packages are an important issue that the authors believe will greatly minimize the performance penalty normally incurred by virtual machines. An interpreter has been developed to give preliminary results that have shown to justify the ideas of the machine. The machine is on-par with native execution, and on the intended application types, it outperforms the Java virtual machines deployed in grid context. After the proposed initial virtual machine has been implemented, several extension to the machine are planned, including threading support, debugging, profiling, an advanced library for a distributed shared memory model, and support for remote memory swapping. R EFERENCES [1] Rasmus Andersen and Brian Vinter, Transparent remote file access in the minimum intrusion grid, WETICE ’05: Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise (Washington, DC, USA), IEEE Computer Society, 2005, pp. 311–318. , Harvesting idle windows cpu cycles for grid computing, GCA [2] (Hamid R. Arabnia, ed.), CSREA Press, 2006, pp. 121–126. [3] Evgueni Dodonov, Joelle Quaini Sousa, and Hélio Crestana Guardia, Gridbox: securing hosts from malicious and greedy applications, MGC ’04: Proceedings of the 2nd workshop on Middleware for grid computing (New York, NY, USA), ACM Press, 2004, pp. 17–22. [4] Renato J. Figueiredo, Peter A. Dinda, and José A. B. Fortes, A case for grid computing on virtual machines, ICDCS ’03: Proceedings of the 23rd International Conference on Distributed Computing Systems (Washington, DC, USA), IEEE Computer Society, 2003. [5] Ian Foster, The grid: A new infrastructure for 21st century science, Physics Today 55(2) (2002), 42–47. [6] Matteo Frigo and Steven G. Johnson, The design and implementation of FFTW3, Proceedings of the IEEE 93 (2005), no. 2, 216–231, special issue on ”Program Generation, Optimization, and Platform Adaptation”. [7] Roldan Pozo and Bruce Miller, Scimark 2.0, http://math.nist.gov/scimark2/. [8] Yunhe Shi, David Gregg, Andrew Beatty, and M. Anton Ertl, Virtual machine showdown: stack versus registers, VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments (New York, NY, USA), ACM, 2005, pp. 153–163. [9] J.E. Smith and R. Nair, Virtual machines: Versatile platforms for systems and processes, Morgan Kaufmann, 2005. [10] Xin Zhao, Kevin Borders, and Atul Prakash, Svgrid: a secure virtual environment for untrusted grid applications, MGC ’05: Proceedings of the 3rd international workshop on Middleware for grid computing (New York, NY, USA), ACM Press, 2005, pp. 1–6.

182


Interest-oriented File Replication in P2P File Sharing Networks Haiying Shen Department of Computer Science and Computer Engineering University of Arkansas, Fayetteville, AR 72701

Abstract - In peer-to-peer (P2P) file sharing networks, file replication avoids overloading file owner and improves file query efficiency. Most current methods replicate a file along query path from a client to a server. These methods lead to large number of replicas and low replica utilization. Aiming to achieve high replica utilization and efficient file query, this paper presents an interest-oriented file replication mechanism. It clusters nodes based on node interest. Replicas are shared by nodes with common interest, leading to less replicas and overhead, and enhancing file query efficiency. Simulation results demonstrate the effectiveness of the proposed mechanism in comparison with another file replication method. It dramatically reduces the overhead of file replication and improves replica utilization. Keywords: File replication, Peer to peer, Distributed hash table

1. Introduction Over the past years, the immerse popularity of Internet has produced a significant stimulus to peer-to-peer (P2P) file sharing networks [1, 2, 3, 4, 5]. A popular file with very frequent visit rate in a node will overload the node, leading to slow response to file requests and low query efficiency. File replication is an effective method to deal with the problem of overload due to hot files. Most current file replication methods [6, 7, 8, 9, 10, 11] replicate a file along query path between a requester and a file owner to increase the possibility that a query can encounter a replica node during routing. We use Path to denote this class of methods. In Path, a file query still needs to be routed until it encounters a replica node or the file owner. However, they cannot guarantee that a request meets a replica node. To enhance the effectiveness of file replication on query efficiency, this paper presents an interest-oriented file replication mechanism, namely

Cluster. It groups nodes with a common interest into a cluster. Cluster is novel in that replicas are shared by nodes with common interest, leading to less replicas and overhead, and enhancing file query efficiency. The rest of this paper is structured as follows. Section 2 presents a concise review of representative file replication approaches for P2P file sharing networks. Section 3 presents Cluster file replication mechanism, including its structure and algorithms. Section 4 shows the performance of Cluster in comparison to other approaches with a variety of metrics, and analyzes the factors effecting file replication performance. Section 5 concludes this paper.

2. Related Work File replication in P2P systems is designed to release the load in hot spots and meanwhile decrease file query time. PAST [12] replicates each file on a set number of nodes whose ID matches most closely to the file owner’s ID. It has load balancing algorithm for non-uniform storage node capacities and file sizes, and uses caching along the lookup path for non-uniform popularity of files to minimize fetch distance and to balance the query load. Similarly, CFS [6] replicates blocks of a file on nodes immediately after the block’s successor on the Chord ring [1]. Stading et al. [7] proposed to replicate a file in locality close nodes near the file owner. LAR [13] and Gnutella [14] replicate a file in overloaded nodes at the file requester. Backslash [7] pushes cache to one hop closer to requester nodes as soon as nodes are overloaded. Freenet [8] replicates objects both on insertion and retrieval on the path from the requester to the target. CFS [6], PAST [12], LAR [13], CUP [9] and DUP [10] perform caching along the query path. Cox et al. [11] studied providing DNS service over a P2P network. They cache index entries, which are DNS mappings, along search query paths. Ghodsi et al. [15] proposed a symmetric replication


scheme in which a number of IDs are associated with each other, and any item with an ID can be replicated in nodes responsible for IDs in this group. HotRoD [16] is a DHT-based architecture with a replication scheme. An arc of nodes (i.e. successive nodes on the ring) is “hot” when at least one of these nodes is hot. In the scheme, “hot” arcs of nodes are replicated and rotated over the ID space. By tweaking the degree of replication, it can trade off replication cost for load balance. Path methods still require file request routing, and the methods cannot ensure that a request meets replica node during routing. Thus, they cannot significantly improve file query efficiency. Rather than replicating file in a single requester, Cluster replicates a file for nodes with a common interest. Consequently, the file replication overhead is reduced, and replica utilization and query efficiency are increased.

3. Interest-oriented File Replication 3.1. Overview We use Chord Distributed Hash Table (DHT) P2P system [1] as an example to explain the file replication in P2P file sharing networks. Without loss of generality, we assume that nodes have their interests and these interests can be uniquely identified. A node’s interests are described by a set of attributes with globally known string description such as “image”, “music” and “book”. The interest attributes are fixed in advance for all participating nodes. Each interest corresponds to a category of files. A node frequently requests its interested files. The strategies that allows content to be described in a node with metadata [17, 18, 19, 20, 21] can be used to derive the interests of each node. Due to the space limit, we don’t explain the details of the strategies.

3.2. Cluster Structure Construction Consistent hash function such as SHA-1 is widely used in DHT networks for node or file ID due to its collision-resistant nature. Using such a hash function, it is computationally infeasible to find two different messages that produce the same message digest. Therefore, the consistent hash function is effective to cluster interest attributes based on their differences. Same interest attributes will have the same consistent hash value, while different interest attributes will have different hash values. Next, we introduce how to use the consistent hash function to cluster nodes based on interest. To facilitate such a clustering, the information of nodes with a

183

common interest should be marshaled in one node in the DHT network, so that these nodes can locate each other in order to constitute a cluster. Although logically close nodes do not necessarily have common interest, Cluster enables common-interest nodes report their information to the same node. In a DHT overlay [1], an object with a DHT key is allocated to a node by the interface of Insert(key,objet), and the object can be found by Lookup(key). In Chord, the object is assigned to the first node whose ID is equal to or follows the key in the ID space. If two objects have similar keys, then they are stored in the same node. We use H to denote the consistent hash value of a node’s interest, and Inf o to denote the information of the node including its IP address and ID. Because H distinguishes node interests, if nodes report their information to the DHT overlay with their H as the key by Insert(H,Info), the information of common-interest nodes with similar H will reach the same node, which is called repository node. As a result, a group of information in a repository node is the information of nodes with a common interest. The repository node can further classify the information of the nodes based on their locality. The information of the locality can be included into the reporting information. The work in [22] introduced a method to get the locality information of a node. Please refer to [23] for the details of methods to measure a node’s locality. Therefore, a node can find other nodes with the same interest in its repository node by Lookup(H). In each cluster, highest-capacity node is elected as the server of other nodes in the cluster (i.e., clients). Thus, each client has a link connecting to its server, and a server connects to a group of clients in its cluster. The server has an index of all files and file replicas in its clients. Every time a client accepts or deletes a file replica, it reports to the server. A server uses broadcasting to communicate with its clients. P2P overlay is characterized by dynamism in which nodes join, leave and fail continually. The structure maintenance mechanism in my previous work [24] is adopted for the Cluster structure maintenance. These techniques are orthogonal to our study in this paper.

3.3. File Replication and Query Algorithms Cluster reduces the number of replicas, increases replica utilization and significantly improves file query efficiency. Rather than replicating a file to all nodes in a query path, it considers the request frequency of a group of nodes, and makes replicas for nodes with a common interest. Since common-interest nodes are grouped in a cluster, a node can get its frequently-requested file in its


Ave. path length

184

5 4.5 4 3.5 3 2.5 2 1.5 1

Cluster Path

5

10

15

20

25

Number of replicating operations when overloaded

Figure 1. Average path length.

own cluster without request routing in the entire system. This significantly improves file query efficiency. In addition to file owners, we assume other nodes can also replicate files. Therefore, file owner and servers are responsible for file replication. For simplicity, in the following, we use server to represent both of them. When overloaded, a server replicates the file in a node in a cluster. Recall that the information of commoninterest nodes are further classified into sub-groups based on node locality or query visit rate. The server chooses the sub-group with the highest file query frequency, and then chooses the node with the highest file query frequency. Unlike Path methods that replicate a file in all nodes in a query path, Cluster avoids unnecessary file replications by only replicating a file for a group of frequentrequesters. It guarantees that file replicas are fully utilized. Requesters that frequently query for file f can get file f from itself or its cluster without query routing. Thus, the replication algorithm improves file query efficiency and saves file replication overhead. Considering that file popularity is non-uniform and time-varying and node interest varies over time, some file replicas become unnecessary when there are few queries for them. To cope with this situation, Cluster lets each server keep track of file visit rate of replica nodes, and periodically remove the under-utilized replicas. If a file is no longer requested frequently, there will be few file replicas for it. The adaptation to cluster query rate ensures that all file replicas are worthwhile and there is no waste of overhead for unnecessary file replica maintenance. When node i requests for a file, if the file is not in the requester’s interests, the node uses DHT Lookup(key) function to query the file. Otherwise, node i first queries the file in its cluster among nodes interested in the file. Specifically, in the first step, node i sends a request to its server in the cluster of this inter-

est. The server searches the index for the requested file in its cluster. Searching among interested nodes has high probability to find a replica of the file. After the steps, if the query still fails, node i resorts to Lookup(key) function.

4


This section presents the performance evaluation of Cluster in average case in comparison with Path. We use replica hit rate to denote the percentage of the number of queries that are resolved by replica nodes among total queries. The experiment results demonstrate the superiority of Cluster over Path in terms of average lookup path, replica hit rate, and the number of replicas. In the experiment, when overloaded, a file owner or a client’s server conducts a file replicating operation. In a file replicating operation, Cluster replicates a file to a single node while Path replicates a file to a number of nodes along a query path. In the experiments, the number of nodes was set to 2048. We assumed there were 200 interest attributes, and each attribute had 500 files. We assumed a bounded Pareto distribution for the capacity of nodes. The shape of the distribution was set to 2, the lower bound of a node’s capacity was set to 500, and the upper bound was set to 50000. The number of queried files was set to 50, and the number of queries per file was set to 1000. This distribution reflects real world situations where machine capacities vary by different orders of magnitude. The file requesters and the queried files were randomly chosen in the experiments.

4.1

Effectiveness of File Replication

Figure 1 plots the average path length of Cluster and Path. We can see that Cluster leads to shorter path length than Path. Unlike Cluster that replicates a file only in one node in each file replicating operation, Path


185

1

4000 Number of file replicas

Replica hit rate

3500

0.9

0.8 Cluster Path

0.7

Cluster Path

3000 2500 2000 1500 1000 500 0

0.6 5

10

15

20

5

25

Figure 2. Replica hit rate. replicates file in nodes along a query path. Therefore, it increases replica hit rate and produces shorter path length. However, it is unable to guarantee that every query can encounter a replica. Cluster can achieve much higher lookup efficiency with much less replicas. It illustrates the effectiveness of Cluster to replicate files for a group of nodes. A node can get a file directly from a node in its own cluster which enhances the utilization of replicas and also reduces the lookup path length. Cluster lets a replica be shared within the group of nodes, which increases the utilization of replicas and reduces path length. Figure 2 demonstrates the replica hit rate of different approaches. We can observe that Cluster leads to higher hit rate than Path. For the same reason observed in Figure 1, Path replicates files at nodes along the routing path, more replica nodes render higher possibility for a file request to meet a replica node. However, its efficiency is outweighed by its much more replicas. In addition, it cannot ensure that each request can be resolved by a replica node. Cluster replicates a file for a group of common-interest nodes, which improves the probability that the file query is resolved by a replica node, leading to higher hit rate.

4.2

Efficiency of File Replication

Figure 3 illustrates the total number of replicas in Cluster and Path. It shows that the number of replicas increases as the number of replicating operations increases. This is due to the reason that more replicating operations for a file lead to more total replicas. Path generates excessively more replicas than others, and Cluster produces less number of replicas. In each file replicating operation, Path replicates a file in multiple nodes along a routing path, and Cluster replicates a file in a single node. Therefore, Path generates much more replicas than others. In Cluster, a replica is fully utilized through being shared by a group of nodes, generating high replica

10

15

20

25



Figure 3. Number of replicas. hit rate, and reducing the possibility that a file owner is overloaded. Thus, the file owner has less replicating operations, which leads to fewer replicas and less overhead for replica maintenance.

5

Conclusions

Most current file replication methods for P2P file sharing networks incur prohibitively high overhead by replicating a file in all nodes along a query path from a client to a server. This paper proposes an interestoriented file replication mechanism. It generates a replica for a group of nodes with the same interest. The mechanism reduces file replicas, guarantees high query efficiency and high utilization of replicas. Simulation results demonstrate the superiority of the proposed mechansim in comparison with another file replication approach. It dramatically reduces the overhead of file replication and produces significant improvements in lookup efficiency and replica hit rate. Acknowledgements This research was supported in part by the Acxiom Corporation.

References [1] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions on Networking, 1(1):17–32, 2003. [2] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proc. of ACM SIGCOMM, pages 329–350, 2001.

186


[3] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for largescale peer-to-peer systems. In Proc. of IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, 2001. [4] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. Kubiatowicz. Tapestry: An Infrastructure for Fault-tolerant wide-area location and routing. IEEE Journal on Selected Areas in Communications, 12(1):41–53, 2004. [5] H. Shen, C. Xu, and G. Chen. Cycloid: A Scalable Constant-Degree P2P Overlay Network. Performance Evaluation, 63(3):195–216, 2006. [6] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stocia. Wide-area cooperative storage with CFS. In Proc. of the 18th ACM Symp. on Operating Systems Principles (SOSP-18), 2001. [7] T. Stading, P. Maniatis, and M. Baker. Peer-topeer Caching Schemes to Address Flash Crowds. In Proc. of the International workshop on Peer-ToPeer Systems, 2002. [8] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A Distributed Anonymous Information Storage and Retrieval System. In Proc. of the International Workshop on Design Issues in Anonymity and Unobservability, pages 46–66, 2001. [9] M. Roussopoulos and M. Baker. CUP: Controlled Update Propagation in Peer to Peer Networks. In Proc. of the USENIX 2003 Annual Technical Conf., 2003.

[14] Gnutella home page. http://www.gnutella.com. [15] A. Ghodsi, L. Alima, and S. Haridi. Symmetric Replication for Structured Peer-to-Peer Systems. In Proc. of International Workshop on Databases, Information Systems and Peer-to-Peer Computing, page 12, 2005. [16] T. Pitoura, N. Ntarmos, and P. Triantafillou. Replication, Load Balancing and Efficient Range Query Processing in DHTs. In Proc. of the 10th International Conference on Extending Database Technology (EDBT), 2006. [17] W. Nejdl, W. Siberski, M. Wolpers, and C. Schmnitz. Routing and clustering in schema-based super peer networks. In Proc. of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS), 2003. [18] A. Y. Halevy, Z. G. Ives, P. Mork, and I. Tatarinov. Piazza: Data management infrastructure for semantic web applications. In Proc. of the 12nd International World Wide Web Conference, 2003. [19] K. Aberer, P. Cudr` e-Mauroux, and M. Hauswirth. The chatty web: Emergent semantics through gossiping. In Proc. of the 12nd International World Wide Web Conference, 2003. [20] A. L¨ oser, W. Nejdl, M. Wolpers, and W. Siberski. Information integration in schema-based peer-topeer networks. In Proc. of the 15th International Conference of Advanced Information Systems Engieering (CAiSE), 2003.

[10] L. Yin and G. Cao. DUP: Dynamic-tree Based Update Propagation in Peer-to-Peer Networks. In IEEE International Conference on Data Engineering (ICDE), 2005.

[21] W. Nejdl, M. Wolpers, W. Siberski, A. L¨ oser, I. Bruckhorst, M. Schlosser, and C. Schmitz. Superpeer-based routing and clustering strategies for rdfbased peer-to-peer networks. In Proc. of the 12nd International World Wide Web Conference, 2003.

[11] R. Cox, A. Muthitacharoen, and R. T. Morris. Serving DNS using a Peer-to-Peer Lookup Service. In Proc. of the International workshop on Peer-ToPeer Systems, 2002.

[22] H. Shen and C. Xu. Locality-Aware and ChurnResilient Load Balancing Algorithms in Structured Peer-to-Peer Networks. IEEE Transactions on Parallel and Distributed Systems, 2007.

[12] A. Rowstron and P. Druschel. Storage Management and Caching in PAST, a Large-scale, Persistent Peer-to-peer Storage Utility. In Proc. of the 18th ACM Symp. on Operating Systems Principles (SOSP-18), 2001.

[23] H. Shen. Interest-oriented File Replication in P2P File Sharing Networks. Technical Report TR-200801-85, Computer Science and Computer Engineering Department, University of Arkansas, 2008.

[13] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher. Adaptive Replication in Peerto-Peer Systems. In Proc. of the 24th International Conference on Distributed Computing Systems (ICDCS), 2004.

[24] H. Shen and C.-Z. Xu. Hash-based Proximity Clustering for Efficient Load Balancing in Heterogeneous DHT Networks. Journal of Parallel and Distributed Computing (JPDC), 2008.


CRAB: an Application for Distributed Scientific Analysis in Grid Projects D. Spiga1-3-6, S. Lacaprara2, M.Cinquilli3, G. Codispoti4, M. Corvo5, A. Fanfani4, F. Fanzago5-6, F. Farina6-7, C. Kavka8, V. Miccio6, and E. Vaandering9 1University of Perugia, Perugia, Italy 2INFN Legnaro, Padova, Italy 3INFN Perugia, Perugia, Italy 4INFN and University of Bologna, Bologna, Italy 5CNAF, Bologna, Italy 6CERN, Geneva, Switzerland 7INFN Milano-Bicocca, Milan, Italy 8INFN Trieste, Trieste, Italy 9FNAL, Batavia, Illinois, USA

187

188


Abstract - Starting from 2008 the CMS experiment will produce several Petabytes of data each year, to be distributed over many computing centers located in many different countries. The CMS computing model defines how the data has to be distributed in a way that CMS physicists can efficiently run their analysis over the data. CRAB (CMS Remote Analysis Builder) is the tool, designed and developed by the CMS collaboration, that facilitates the access to the distributed data in a totally transparent way. The tool's main feature is the possibility to distribute and and to parallelize the local CMS batch data analysis processes over different Grid environments. CRAB interacts with the local user environment, CMS Data Management services and the Grid middleware. Keywords: Grid Computing, Distributed Computing, Grid Application, High Energy Physics Computing.

Introduction The Compact Muon Solenoid (CMS)[1] is one of two large general-purpose particle physics detectors integrated in to the proton-proton Large Hadron Collider (LHC)[2] at CERN (European Organization for Nuclear Research) in Switzerland. The CMS detector has 15 millions of channels; through them data will be taken at a rate of TB/s and selected by an on-line selection system (trigger) that will reduce the frequency of data taken from 40 MHz (LHC frequency) to 100 Hz (writing data frequency), that means 100 MB/s and 2 PB data per year. This challenging experiment is a collaboration constituted by about 2600 physicist from 180 scientific institutes all over the world. The quantity of data to analyze (and to simulate) needs a plenty of resources to satisfy the computational experiment requirements, as the large disk space needed to store the data and many cpus where to run physicist's algorithms. It is also needed a way to make all data and shared resources accessible from all the people in the collaboration. This environment has encouraged the CMS collaboration to define an ad-hoc computing model that satisfy these problematics.. This leans on the use of Grid computing resources, services and toolkits as basic building blocks, making realistic requirements on grid services. CMS decided to use a combination of tools, provided by the LCG (LHC Computing Grid)[3] and OSG (Open Science Grid)[4] projects, as well as specialized CMS tools are used together. In this environment, the computing system has been arranged in tiers (Figure 1) where the majority of computing resources are located away from the host lab. The system is geographically distributed, consistent with the nature of the CMS collaboration itself: • The Tier-0 computing centre is located at CERN and is directly connected to the Figure 1. Multi-tier architecture based on distributed resources and Grid services. experiment for the initial processing,


• •

reconstruction and data archiving. A set of large Tier-1 centers where the Tier-0 distributes data (processed and not); these centers provide also considerable services for different kind of data reprocessing. A typical Tier-1 site distributes the processed data to smaller Tier-2 centers which have powerful cpu resources to run analysis tasks and Monte Carlo simulations.

Distributed analysis model in CMS For the CMS experiment in the Grid computing environment there are many problematic points due to the wide amount of dislocated resources. The CMS Workflow Management System (CMS WM) manages the large scale data processing and reduction process which is the principal focus of experimental HEP computing. The CMS WM has to be considered the main flow to manage and access the data, giving to the user an unique interface that allows to interact with the generic Grid services and with the specific experiment services as an unique common environment. The Grid services are mainly constituted by the Grid WMS which takes jobs into account, performs match-making operations and distributes the jobs toward the Computing Element (CE); the CE manages local queues that point to a set of resources located to a specific site as the Worker Nodes where the jobs run; finally there are Storage Elements (SE) that are logic entities that warranty an uniform access to a storage area where the data is stored. As part of its computing model, CMS has chosen a baseline in which the bulk experiment-wide data is pre-located at sites and thus the Workload Management (WM) system submit the jobs to the correct CE. The Dataset Bookkeeping System (DBS) allows to discover and access various forms of event data in a distributed computing environment. The analysis model[5] is batch-like and consists of main steps: the user runs interactively on small samples of the data somewhere in order to develop his code and test it; once the code is the desired one, the user selects a large data and submit the very same code to analyze the chosen data. The results is made available to the user to be analyzed interactively. The analysis can be done in step, saving the intermediate ones and iterating over the latest ones. The distributed analysis workflow over the Grid backs on the Workflow Management System, which is not directly user oriented. In fact the analysis flow in the above specified distributed environment is a more complex computing task because it assume to know which data are available, where data are stored and how to access them, which resources are available and are able to comply with analysis requirements, also at the above specified Grid and CMS infrastructure details.

The CMS Remote Analysis Builder The users do not want to deal with the previously described issues and they want to analyze data in a simple way. The CMS Remote Analysis Builder (CRAB)[6] is the application designed

189

190


and deployed by the CMS collaboration that, following the CMS WM, allows a transparent access to distributed resources over the Grid to end physicists. CRAB perform three main operations: • Interaction with the CMS analysis framework (CMSSW) used by the users to develop their applications that runs over the data; • The Data Discovery step, interacting with the CMS data management infrastructure, when required data is found and located; • The Grid specific steps: from submission to output retrieval are fully handled. The typical workflow (Figure 2) involve the concept of task and job. The task corresponds to the highlevel objective of an user (run an analysis over a defined data). The job is the traditional queue system concept, corresponding to a single instance of an application started on a worker node with a specific configuration and output. A task is generally Figure 2. CRAB in to the CMS WM composed by many jobs. A typical analysis workflow in this contest consists of: • data discovery to determine the Storage Elements of sites storing data (using DBS); • preparation of input sandbox: a package with the user application and related libraries; • job preparation creates a wrapper over the user executable, prepares the environment where the user application has to run (WN level); at the end handles the output produced; • job splitting which takes in to account the specific data information, the data distribution and the coarseness requested by the user; • Grid job configuration that consists of a file filled using the Job Description Language (JDL) which is interpreted by the WMS and which contains the job requirements; • task (jobs) submission to the Grid; • monitoring of the submitted jobs which involves the WMS checking the jobs progress; • when a job is finished from the Grid point of view, the last operation is the output retrieval which allows to handle the job output (which can also include the copy of the job output to a Storage Element) through the output-sandbox. CRAB is used by the user on the User Interface (UI), which is the access point to the Grid and where is available the client of the middleware. The user interacts with CRAB via a simple configuration file divided into main sections and then by CLI. In the configuration file there are all the specific parameters of the task and the jobs; after the user has developed and tested


locally his own analysis code, he specifies which application he wishes to run, the dataset to analyze and the general requirements on the input dataset as the job splitting parameter, the information related to how threat the output that can be retrieved back to the UI or can be directly copied to an existing Storage Element. There are also post-output retrieval operations that can be executed by the users, which include the data publication that allows to register user data into a local DBS instance, to consent an easy access on the user-registered data.

CRAB Architecture and Implementation The programming language used to develop CRAB is Python. It allows a reduced development time and an improved maintenance, included the fact that does not need to be compiled. CRAB can perform three main kinds of job submission where each one is totally transparent to the user: • the direct submission to the Grid interacting directly with the middleware; • the direct submission to local resources and relative queues, using the batch system, in a LFS (Load Sharing Facilities) environment; • the submission to the Grid using the CRAB server as a layer where to delegate the interaction with the middleware and the task/job management. Actually the major effort of the development activity is devoted to the client-server implementation. The client is on the User Interface, while the server could be located somewhere over the Grid. The CRAB client is directly used by the users and it enables to perform the operations involved in the task/job preparation and creation, as: the data discovery, the inputsandbox preparation, the job preparation (included the job splitting) and the requirement definition. Then the client makes a request, completely transparent to the user, to the CRAB server. This one fulfills each request, handling the task and performing the related flow, in any kind of Grid interaction: job submission to the Grid; automatic job monitoring; automatic job output retrieval; re-submission of failed jobs following particular rules for different kinds of job failures; every specific command requested by the user as the job killing; notify the user by email when the task reach a specified level and when it is fully ended (the output results of each jobs are ready). This operation partitioning between the client and the server allows to automatize as much as possible the interaction with the Grid reducing the unnecessary human load, having all possible actions into the server side (at minimum those on client side), centralizing the Grid interaction and then allowing to handle every kind of trouble on a unique place. This also permits to improve the scalability of the whole system[7]. The communication between the client and the server is on SOAP[8]. Selecting it has some obvious reasons: it is a de facto standard in the Grid service development community, it uses HTTP protocol, provides interoperability across institutional and application language. The client has to assume nothing about the implementation details of the server and vice versa. In this case

191

192


the SOAP based communication is developed by using gSOAP[9]. gSOAP provides a crossplatform development toolkit for developing server and client applications, allowing to not maintain any custom protocol. It does not require any pre-installed runtime environment, but using the WSDL (Web Services Description Language) it generates code in ANSI C/C++. The internal CRAB server architecture (Figure 3) is based on components implemented as independent agents communicating through an asynchronous and persistent message service (as a publish and subscribe model) based on a MySQL[10] database. Each agent takes charge of specific operations, allowing a modular approach from a logical point of view. The actual server implementation provides the following components: • CommandManager: endpoint of SOAP service that handles commands sent by the client; • CrabWorker: component that performs direct job submission to the Grid; • TaskTracking: updates information about tasks under execution polling the database; • Notification: notifies the user by e-mail when his task is ended and the output has been already retrieved; it also notify the server administrator for special warning situation; • TaskLifeManager: manages the task life on the server, cleaning ended tasks; • JobTracking: tracks the status of every job; • GetOutput: retrieves the output of ended jobs; • JobKiller: when asked kills single or multiple jobs; • ErrorHandler: performs a basic error handling that allows to resubmit jobs; • RSSFeeder: provides RSS channels to forward information about the Figure 3. CRAB Server server status; • AdminComponent: executes specific server maintenance operations. Many of the above listed components are implemented following a multithreading approach, using safe connection to the database. This allows to manage many tasks at the same time shortening and often totally removing the delay time for an single operation that has to be accomplished on many tasks. The use of threaded components is very important when interacting with the Grid middleware, where some operation (e.g.: on a bulk of jobs at the same moment) requires a not unimportant time. Two important entities in the CRAB server architecture are the WS-Delegation and a specific area on an existing Storage Element. The WS-Delegation is a


compliant service for the user proxy delegation from the client to the server; this allows to the server to perform each Grid operation for a given task with the corresponding user proxy. The SE allows to transfer the input/output-sandboxes between the User Interface and the Grid, working as a sort of drop-box area. The server has a specific interface made up by a set of API and a core with hierarchical classes which implement different protocols, allowing to interact transparently with the associated remote area, independently from the transfer protocol. The ability of the server to potentially interact with any associated storage server, independently from the protocol, allows to have a portable and scalable object, where the Storage Element that hosts the job sand-boxes is completely independent from the server. Then the CRAB server is then really adaptable to different environments and configurations. It is also complained to have a local disk area mounted on the local CRAB server instance, with a GridFTP server associated with, to be used for the sandbox transfers (instead of a remote Storage Element). The interaction with the Grid is performed using the BossLite framework included in the server core. This framework can be considered a thin layer between the CRAB server and the Grid, used to interact with the middleware and to maintain specific information about tasks/jobs. BossLite is constituted by a set of API that works as an interface to the core. The core of BossLite maps the database objects (e.g.: task, job) and allows to execute specific Grid operations over database-loaded objects.

Conclusions CRAB is in production since three years and is the only computing tool in CMS used by generic physicist. It is widely used inside the collaboration with more then 600 distinct users during the 2007 and over about 50 distinct Tier-2 sites involved in the Grid analysis activities. The CRAB tool is continuously evolving and the actual architecture allows to simply add new components to the structure to support new use cases that will come up.

References 1. The CMS experiment http://cmsdoc.cern.ch 2. The Large Hadron Collider Conceptual Design Report CERN/AC/95-05 3. LCG Project: LCG Technical Design Report,CERN TDR-01 CERN-LHCC-2005-024, June 2005 4. The Open Science Grid project http://www.opensciencegrid.org 5. The CMS Technical Design Report http://cmsdoc.cern.ch/cms/cpt/tdr 6. D.Spiga, S.Lacaprara, W.Bacchi, M.Cinquilli, G.Codispoti, M.Corvo, A.Dorigo, A.Fanfani, F.Fanzago, F.Farina, M.Merlo, O.Gutsche, L.Servoli, C.Kavka (2007). "The CMS Remote Analysis Builder (CRAB)". High Performance Computing -HiPC 2007 14th International Conference. Goa, India. vol. 4873, pp. 580-586 7. D.Spiga, S.Lacaprara, W.Bacchi, M.Cinquilli, G.Codispoti, M.Corvo, A.Dorigo, F.Fanzago, F.Farina, O.Gutsche, C.Kavka, M.Merlo, L.Servoli, A.Fanfani. (2007). "CRAB: the CMS distributed analysis tool development and design". Hadron Collider Physics Symposium 2007. La Biodola, Isola d'Elba, Italy. vol.177-178C, pp. 267-268 8. SOAP Messagging Framework http://www.w3.org/TR/soap12-part1 9. The gSOAP Project http://www.cs.fsu.edu/~engelen/soap.html 10. MySQL Open Source Database http://www.mysql.com

193

194


Fuzzy-based Adaptive Replication Mechanism in Desktop Grid Systems HongSoo Kim, EunJoung Byun Dept. of Computer Science & Engineering, Korea University {hera, vision}@disys.korea.ac.kr JoonMin Gil Dept. of Computer Education, Cathoric University of Daegu [email protected] JaeHwa Chung, SoonYoung Joung Dept. of Computer Science Education, Korea University {bigbearian, jsy}@comedu.korea.ac.kr Abstract In this paper, we discuss the design of replication mechanism to guarantee correctness and support deadline tasks in desktop grid systems. Both correctness and performance are considered important issues in the design of such systems. To guarantee the correctness of results, voting-based and trust-based sabotage tolerance mechanisms are generally used. However, these mechanisms suffer from two potential shortcomings: waste of resources due to running redundant replicas of the task, and increase in turnaround time due to the inability to deal with dynamic and heterogeneous environments. In this paper, we propose a Fuzzy-based Adaptive Replication Mechanism (FARM) for sabotage-tolerance with deadline tasks. This is based on the fuzzy inference process according to their trusty and result-return probability. Using these two parameters, our desktop grid system can provide both the sabotage-tolerance and a reduction in turnaround time. In addition, simulation results show that compared to existing mechanisms, the FARM can reduce resource waste in replication without increasing the turnaround time.

1. Introduction Desktop grid computing is a means of carrying out highthroughput scientific applications using the idle time of desktop computers (PCs) connected to the Internet [5]. It has been used in massively parallel applications composed of numerous instances of the same computation. The applications usually involve scientific problems that require a

large amounts of sustained processing capacity over long periods. In recent years, there has been increased interest in desktop grid computing because of the success of the most popular examples, such as SETI@Home [10] and distributed.net [16]. There has been a number of studies of desktop grid systems that provide an underlying platform, such as BOINC [15], Entropia [17], Bayanihan [18], XtremWeb [19], and Korea@Home [6]. One of the main characteristics of desktop grid computing is that computing resources, referred to as volunteers, are free to leave or join, which results in a great deal of node volatility. Thus, desktop grid systems (DGSs) are lack of reliability due to uncontrolled and unspecified computing resources and cannot help being exposed to sabotage by erroneous results of malicious resources. When a malicious volunteer submits bad results to a server, this may invalidate all other results. For example, it has been reported that SETI@Home suffered from the malicious behavior by some of its volunteers who faked the number of tasks completed. Other volunteers actually faked their results by using different or modified client software [2, 10]. Consequently, DGSs should be equipped with a sabotage-tolerance mechanism to protect them from intentional attacks by malicious volunteers [22, 23]. In previous studies, the verification of work results was accomplished by voting- and trust-based mechanisms. In the voting-based sabotage-tolerance (VST) mechanisms [14], when a task is distributed in parallel to n volunteers, k or more of these volunteers where k ≤ n should have the same result to guarantee result verification for the task. This mechanism is often called the k-out-of-n system. In DGSs, it can be assumed that all n volunteers are stochastically identical and functionally independent. This mechanism is simple and


straightforward, but it is inefficient because it wastes resources. On the other hand, trust-based sabotage-tolerance (TST) mechanisms [3, 4, 7, 9, 21] have a lower redundancy for result verification than voting mechanisms. Instead, a lightweight task for which the correct result is already known is distributed periodically to volunteers. In this way, a server can obtain the trust value of each volunteer by counting how many lightweight tasks are returned correctly. This trust value is used as a key factor in the scheduling phase. However, these mechanisms are based on first-come first-serve (FCFS) scheduling, which typically allocates tasks to resources when the resources become available, without any consideration of the applications whose task should be completed before a certain deadline. From the viewpoint of result verification, FCFS scheduling results in a high turnaround time because it cannot cope effectively with dynamic environments, such as volunteers leaving or joining the system due to interference with other priorities and hardware failures. If a task is allocated to a volunteer with high dynamicity, then it is susceptive to failure. Thus, it should be reallocated to other volunteers. As a result, this leads to increasing the turnaround time of the task. Furthermore, if a task must be completed within a specific time (i.e. a deadline), then its turnaround time will be high. To provide DGSs with result verification to support task deadlines, this paper proposes Fuzzy-based Adaptive Replication Mechanism (FARM) for sabotage-tolerance based on the trusty and the result-return probability of volunteers. The FARM provides a mechanism for combining the benefits of the voting-based mechanism with the trust-based mechanism. First, we devised autonomous sampling with mo-bile agents to evaluate the trusty and result-return probability of each volunteer. In this scheme, volunteers receive a sample whose result is already known. The result computed by a volunteer is compared to the original result of the sample to estimate the volunteer’s trusty. In addition, the volunteer’s result-return probability is calculated based on the availability and performance of the volunteer. These trusty and result-return probability values are used to fuzzy sets through fuzzy inference process. The characteristic function of the fuzzy set is allowed to have values between 0 and 1, which denotes the degree of membership of an element in a given set. For the transformation from requirements (i.e. correctness and deadline) of an application to fuzzy set, we provide the empirical membership functions. In this paper, fuzzy inference process is to determine the replication number through the trust probability and result-return probability of each volunteer. In addition, simulation results show that our mechanism can reduce the turnaround time compared with the voting-based mechanism and trust-based mechanism. The FARM is also superior to the other two mechanisms in terms of the effi-

195

Figure 1. Desktop grid environments. cient use of resources. The rest of this paper is organized as follows. In Section 2, we represent the desktop grid environment. Section 3 describes fuzzy-based adaptive replication mechanism for sabotage-tolerance with deadline tasks. In Section 4, we represent implementation and performance evaluation. Finally, our conclusions are given in Section 5.

2. Desktop Grid Environment Figure 1 shows the overall architecture of our DGS model. As shown in Fig. 1, our DGS model consists of clients, an application management server, a storage server, coordinators, and volunteers. A client submits its own application to an application management server. A coordinator looks after scheduling, computation group management, and agent management. A volunteer is a resource provider that contributes its own resources to process large-scale applications during CPU idle time. In this model, volunteers and a coordinator are organized into computation group as the unit of scheduling. In this group, the coordinator reorganizes work groups as the unit of executing a task according to properties of each volunteer when the coordinator’s scheduler allocates tasks to the volunteers. Our DGS includes several phases: 1. Registration Phase: Volunteers register their static information (e.g., CPU speed, memory capacity, OS type, etc.) with the application management server. The application management server then sends this information to coordinators. 2. Job Submission Phase: A client submits a large-scale application to the application management server. 3. Task Allocation Phase: The application management server splits the application into a set of tasks and allocates them to coordinators.

196


4. Load Balancing Phase: Each coordinator inspects the number of tasks in its task pool and balances the load, either periodically or on demand, by transferring some tasks to other coordinators. 5. Adaptive Scheduling Phase: Each coordinator assigns tasks in its task pool according to the properties of available resources using fuzzy inference process. 6. Result Collection Phase: Each coordinator collects results from volunteers and performs result verification. 7. Job Completion Phase: Each coordinator returns a set of correct results to the application management server. The application submitted by a client is divided into a sequence of batches, each of which consists of mutually independent tasks in the set W = {w1 , w2 , ..., wn } where n is the total number of tasks. This type of application is known as the single program multiple data (SPMD) model, which uses the same code for different data. It has been used as the typical application model in most DGSs, and thus we used an SPMD-type application in the present study.

3. Fuzzy-based Adaptive Replication This section describes fuzzy-based adaptive replication mechanism (FARM) for sabotage-tolerance in applications with specific deadlines.

3.1. System Model We assume that application has been divided into tasks and each task has independent property, and the tasks have to be returned within a deadline. Our FARM determines correctness based on a volunteer’s trust probability (TP). It also uses a result-return probability (RP) based on availability and dynamic information, such as current CPU performance, to estimate a volunteer’s ability to complete a task within a certain deadline. In the FARM, the scheduler lets tasks to execute in volunteers with high TP and high RP in high priority. For volunteers with low TP and low RP, the scheduler applies replication policy to the tasks according to fuzzy inference process. In the desktop grid computing environment, overall system performance is influenced by the dynamic nature of volunteers [12]. In order to classify volunteers into fuzzy sets according to the dynamic nature of volunteers, we represent a fuzzy inference process based on TP and RP. First of all, both TP and RP are defined as Table 1: Trust Probability (TP). The TP is a factor determined

by the correctness of the computation results executed by a volunteer. The trusty value of the ith volunteer T Pi is ½ 1 − nf , if n > 0 (1) T Pi = 1 − f, if n = 0 In Eq. (1), T Pi represents the trusty value of a volunteer vi , n is the number of correct results returned in our sampling scheme, and f is the probability that a volunteer chosen at random returns incorrect results. Result-return Probability (RP). The RP is the probability that a volunteer will complete a task within a given time under a computation failure. In a desktop grid environment, the task completion rate depends on both the availability and the performance (e.g., the number of floating point operations per second) of individual volunteers [12]. Therefore, we have defined RP as the probability that a volunteer will return a result before a specified deadline under computation failure. The average completion time ACTi of each volunteer is calculated by P γk (2) ACTi = K where γi represents the completion time of the kth sample and K is the number of samples completed by volunteer i. To determine the time taken by a volunteer to perform a task through sampling by an agent, we define the estimation time ETi of each volunteer as follows: ETi = µ · t

(3)

where µ represents the number of floating point operations for a sample by a dedicated volunteer, and t is the time required by that dedicated volunteer to execute one floating point operation. Using Eq. (3), we can estimate the completion time without computation failure for volunteer vi . Then, the average failure ratio AF Ri of volunteer i can be calculated using eqs. (2) and (3). AF Ri = 1 −

ETi ACTi

(4)

If the time taken to complete a task by volunteer i follows an exponential distribution with average AF Ri , then the probability of volunteer i completing task j before the deadline dj is calculated as follows: Z

d

RPi (D ≤ dj ) = 0

AF Ri e−AF Ri dj =1 − e−AF Ri dj (5)


Parameters TP (Trust Probability) ACT (Average Completion Time)

ET (Estimation Time)

AFR (Average Failure Ratio) RP(Result-return Probability) RD (replication degree)

197

Description The TP is a factor determining robustness for the computation result executed by a volunteer. The ACT denotes an average completion time which means the ratio of mean completion time of samples to the number of samples completed by the volunteer. The ET means an estimation time which means multiplication the number of floating point operations for the computing time of a float point operation. The AFR denotes an average failure ratio which is calculated as ET for ACT. The RT is the probability that a volunteer will complete a task within a given time under a computation failure. The RT is the probability that a volunteer will complete a task within a given time under a computation failure. Table 1. Parameters.

where D represents the actual computation time of the volunteer.

3.2. Fuzzy Inference Process A fuzzy set expresses the degree to which an element belongs to a set. The characteristic function of a fuzzy set is allowed to have values between 0 and 1, which denotes the degree of membership of an element in a given set. For the transformation from requirements (i.e. correctness and deadline) of an application to fuzzy set, we provide the empirical membership functions in Figure 2. In Fig. 2(a), a fuzzy set for trust probability are determined to group according to trust range [0, 1] of a volunteer. If the TP value of a volunteer approaches 0, the volunteer has nearly malicious behavior, and 1 denotes fully trusted resource. In Fig 2(b), as a fuzzy set for result-return probability, this figure shows membership function for five level of RPi . If RP is nearly 0, the volunteers are extremely return a result within a given time. Contrary, RP is almost 1, it will return a result within a deadline. In this paper, fuzzy inference process is to determine the replication number of each volunteer through the T Pi and RPi . These two parameters can be calculated fuzzy rule with replication degree in order to return correct results within a given time. We set up the RDi in the range [1, 5] according to our previous experiences. We determined 5 as the maximum number of replicas because the number of replicas becomes larger than 5 when we conducted an experiments for the past one month. As shown in figure 3, we can infer fuzzy set from each volunteer’ replication degree (RD), which represents the number of redundancy. In here, each set represents computa-

tional redundancy. For example, volunteers belonging to the medium set are executed with three redundancy of the set for a task. This set is determined according to fuzzy inference rules that are given in the follows: RULE 1: IF T Pi is very high and RPi is very high, THEN RDi is very good. RULE 2: IF T Pi is high and RPi is high, THEN RDi is good. ... The fuzzy inference rules determine the number of redundancy to make use of the TP and the RP in order to guarantee correctness and complete within a given deadline. In the grid system, membership function and rules should be chosen according to application requirement and user requirement.

4. Simulation and Performance Evaluation In this section, we evaluate the performance of our FARM. To show an efficiency of the mechanism, we analyze our mechanism in terms of correctness and turnaround time, and compare the performance of our mechanism with that of both VST and TST mechanisms.

4.1. Distribution of Volunteers In the performance evaluation, we obtained the distribution shown in Fig. 4 from the log data of each volunteer for one week in Korea@Home DGS. In this figure, the trust probability on the x-axis is the probability of correct results returned by each volunteer to the coordinator through autonomous sampling. The y-axis represents the result-return

198


very low

low

medium

high

very high

very low

(TPi) 1

(RPi) 1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

low

medium

high

very high

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

TPi

0.6

0.8

1

RPi

(a) Membership function for five levels of T Pi

(b) Membership function for five levels of RPi

Figure 2. Membership functions for different levels of T Pi and RPi

1.0 VERY GOOD

0.8 GOOD

0.6 MEDIUM

0.4 BAD

0.2 VERY BAD

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Figure 4. Volunteer distribution according to result-return probability and trust probability. This figure shows each fuzzy set according to fuzzy inference process.

probability, which is the probability of the returning the result before a deadline. The boxes in Fig. 4 represent fuzzy set, which is redundancy group of volunteers determined by the fuzzy rule (refer to the fuzzy inference process in Section 3). As shown in this figure, the volunteers were classified into fuzzy set as five types: very good, good, medium, bad, and very bad.

4.2. Comparison of Result Verification Mechanisms We have compared our FARM with the other result verification mechanisms (VST and TST). For VST mechanism, we used five replicas per a task in a work group. In TST mechanism, a volunteer used the randomly selected task on basis of its trust set. For this comparison, turnaround time and resource utilization were measured by different numbers of tasks. Figure 5(a) shows the resource utilization of three result verification mechanisms. In this figure, we can observe that


4000

199

20000

3500 3000

15000

2500 2000

10000

1500 1000

5000

500 0

0

10

100

1000

10

(a) resource utilization

100

1000

(b) turnaround time

Figure 5. Comparison of result verification mechanisms

very good

good

medium

bad

very bad

(RDi) 1 0.8 0.6 0.4

FARM allocates a task to the resources to be completed within a given deadline. As described previous, the other mechanisms do not essentially consider the deadline in the phase of resource selection. Accordingly, the results of some tasks may be not returned within the deadline, and thus such the tasks should be reallocated to other resources. That results in an increment of turnaround time. From the results of Fig. 5, we can see that our FARM has fast turnaround time than the other result verification mechanism, with relatively low resource redundancy.

0.2 0 1

2

3

4

5

RDi

Figure 3. Membership function for different levels of RDi .

our FARM reduces slightly compared with the TST, and it is far more efficiently than the VST. This means that our mechanism reduce the reallocation cost for the work because it selects resources having high RP and TP in computation group. As a result, the FARM has made remarkable progress for the resource utilization compared with the VST and TST mechanisms. Meanwhile, Figure 5(b) shows the turnaround time of three result verification mechanisms. In this figure, we observe that as the number of tasks increases, the turnaround time becomes slower. This is not surprised, because the VST and TST mechanisms are not considered list scheduling method based on various factors such as result-return probability and trust probability. However, our FARM has the fastest turnaround time among the other two mechanisms. This has been expected, as the scheduler in our

5. Conclusion and Future Work We have proposed the fuzzy-based adaptive replication mechanism that supports sabotage-tolerance with deadline tasks in desktop grid systems. In the mechanism, the concept of replication groups was introduced to deal with the dynamic nature of volunteers in the phase of result verification. As a criterion to organize replication groups, resultreturn probability and trust probability were used. Based on fuzzy inference process, five fuzzy sets were presented, which can employ differently according to volatility and trusty of the volunteers. Using these concepts, our result verification mechanism can guarantee tasks to return correct work results within a deadline. Performance results are evaluated through simulation the performance of our FARM, VST and TST mechanisms, from the viewpoints of turnaround time and resource utilization. The results showed that our FARM is superior to other two ones in terms of turnaround time, with relatively low resource redundancy.

200


Acknowledgment This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund)

References [1] M. O. Neary and P. Cappello, ”Advanced Eager Scheduling for Java Based Adaptively Parallel Computing,” Concurrency and Computation: Practice and Experience, Vol. 17, Iss. 7-8, pp. 797-819, Feb. 2005. [2] D. Molnar, The SETI@home Problem, http://turing.acm.org/crossroads/columns/onpatrol/ september2000.html [3] L. Sarmenta, ”Sabotage-Tolerance Mechanism for Volunteer Computing Systems,” Future Generation Computer Systems, Vol. 18, No. 4, pp. 561-572, Mar. 2002. [4] C. Germain-Renaud and N. Playez, ”Result Checking in Global Computing Systems,” Proc. of the 17th Annual Int. Conf. on Supercomputing, pp. 226-233, June 2003. [5] C. Germain, G. Fedak, V. Neri, and F. Cappello, ”Global Computing Systems,” Lecture Notes in Computer Science, Vol. 2179, pp. 218-227, 2001. [6] Korea@Home homepage, ”http://www.koreaathome.org/eng”. [7] F. Azzedin and M. Maheswaran, ”A Trust Brokering System and Its Application to Resource Management in PublicResource Grids,” Proc. of the 18th Int. Parallel and Distributed Processing Symposium, pp. 22-31, April 2004. [8] W. Du, J. Jia, M. Mangal, and M. Murugesan, ”Uncheatable Grid Computing,” Proc. of the 24th Int. Conf. on Distributed Computing Systems, pp. 4-11, 2004. [9] S. Zhao, V. Lo, and C. G. Dickey, ”Result Verification and Trust-Based Scheduling in Peer-to-Peer Grids,” Proc. of the 5th IEEE Int. Conf. on Peer-to-Peer Computing, pp 31-38, Sept. 2005. [10] SETI@Home homepage, ”http://setiathome.ssl.berkeley.edu”. [11] S. Choi, M. Baik, H. Kim, E. Byun, and C. Hwang, ”Reliable Asynchronous Message Delivery for Mobile Agent,” IEEE Internet Computing, Vol. 10, Iss. 6, pp. 16-25, Dec. 2006. [12] D. Kondo, M. Taufer, C. L. Brooks, H. Casanova, and A. Chien, ”Characterizing and Evaluating Desktop Grids: An Empirical Study,” Proc. of the 18th Int. Parallel and Distributed Processing Symposium, pp. 26-35, April 2004. [13] J. Dongarra, ”Performance of various computers using standard linear equations software,” ACM SIGARCH Computer Architecture News, Vol. 20 , pp. 22-44, June 1992. [14] M. Castro and B. Liskov. ”Practical Byzantine Fault Tolerance,” Proc. of Symposium on Perating Systems Design and Implementation, pp. 173-186, Feb. 1999. [15] D. P. Anderson, ”BOINC: A System for Public-Resource Computing and Storage,” Proc. of 5th IEEE/ACM Int. Workshop on Grid Computing, pp. 4-10, Nov. 2004. [16] distributed.net homepage, ”http://www.distributed.net”. [17] Entropia homepage, ”http://www.entropia.com”.

[18] L. F. G. Sarmenta and S. Hirano. ”Bayanihan: Building and Studying Volunteer Computing Systems Using Java,” Future Generation Computer Systems, Vol. 15, Issue 5/6, pp. 675686, Oct. 1999. [19] O. Lodygensky, G. Fedak, F. Cappello, V. Neri, M. Livny, D. Thain, ”XtremWeb & Condor : sharing resources between Internet connected Condor pool,” Proc. of 3rd IEEE/ACM Int. Symposium on Cluster Computing and the Grid: Workshop on Global and Peer-to-Peer Computing on Large Scale Distributed Systems, pp. 382-389, May 2003. [20] A. Baratloo, M. Karaul, Z. Kedem, and P. Wijckoff, ”Charlotte: Metacomputing on the Web,” Future Generation Computer Systems, Vol. 15, Issue 5-6, pp.559-570, Oct. 1999. [21] J. Sonnek, M. Nathan, A. Chandra and J. Weissman, ”Reputation-Based Scheduling on Unreliable Distributed Infrastructures,” Proc. of the 26th International Conference on Distributed Computing Systems (ICDCS’06), pp. 30, July 2006. [22] D. Kondo, F. Araujo, P. Malecot, P. Domingues, L. M. Silva, G. Fedak, and F. Cappello, ”Characterizing Result Errors in Internet Desktop Grids,” Lecture Notes in Computer Science, Vol. 4641, pp. 361-371, Aug. 2007. [23] M. Taufer, D. Anderson, P. Cicotti, and C. L. Brooks III, ”Homogeneous Redundancy: a Technique to Ensure Integrity of Molecular Simulation Results Using Public Computing,” 19th IEEE Int. Parallel and Distributed Processing Symposium - Workshop 1, pp. 119a, April 2005.


201

Implementation of a Grid Performance Analysis and Tuning Framework using Jade Technology Ajanta De Sarkar , Dibyajyoti Ghosh , Rupam Mukhopadhyay and Nandini Mukherjee Department of Computer Science and Engineering, Jadavpur University, Kolkata 700 032, West Bengal, India. Abstract – The primary objective in a Computational Grid environment is to maintain performance-related Quality of Services (QoS) at the time of executing the jobs. For this purpose, it is required to adapt to the changing resource usage scenario based on analysis of real time performance data. The objectives of this paper are (i) to present a framework for performance analysis and adaptive execution of jobs, (ii) to focus on the object-oriented implementation of a part of the framework and (iii) to demonstrate the effectiveness of the framework by carrying out some experiments. The framework deploys autonomous agents, each of which can carry out their responsibilities independently.

done using Jade framework. Jade agents can be active at same time on different nodes in Grid and they can interact with each other without incurring much overhead in real time. The current implementation of the agent framework deals with parallel Java programs written in JOMP 1.

Keywords: Grid, hierarchical agent framework, performance properties, Jade technology.

Organization of this paper is as follows. Related work is discussed in Section 2. The hierarchical organization of analysis agents is briefed in Section 3. The concepts related to performance properties and their relevance in the current work are discussed in Section 4. Severity computation for each property is introduced in Section 5. Section 6 presents Jade framework-based implementation of a part of the agentbased system. Section 7 and Section 8 give details of the experimental setup and results. Section 9 concludes with a direction of the future work.

1

2

Introduction

Performance monitoring of any application in Grid is always complex, different and challenging. In the absence of precise knowledge about the availability of resources at any point of time and due to the dynamic resource usage scenario, prediction-based analysis is not possible. Thus, performance analysis in Grid must be characterized by dynamic data collection (as performance problems must be identified during run-time), data reduction (as the amount of monitoring data is large), low-cost data capturing (as overhead due to instrumentation and profiling may contribute to the application performance), and adaptability to heterogeneous environment. In order to address the above issues, a hierarchical agent framework has already been proposed in [3] and [4]. The novelty of the framework is that it considers execution performances of multiple jobs (possibly components of an application or multiple applications) running concurrently in Grid and aims at maintaining the overall performance of these jobs at a predefined QoS level. Moreover, unlike other traditional Grid performance monitoring systems, this framework support adaptation of the jobs to the changing resource usage scenarios either by enabling local tuning actions or migrating them onto a different resource provider (host). This paper refines the design of the framework presented in [3] and [4] and uses the concept of performance properties that have been introduced in [5]. The paper also presents an object-oriented implementation of a part of the agent framework. Implementation has been

Related Work

Grid performance tools, such as SCALEA-G [15] and ASKALON [16] are usually based on the Grid Monitoring Architecture (GMA). GMA provides an infrastructure based on OGSA and supports performance analysis of a variety of Grid services including computational resources, networks, and applications. Performance tools associated with ICENI [7] and GrADS [9] also focus on performance monitoring of the applications running in Grid environment. The ICENI project highlights a component framework, in which performance models are used to improve the scheduling decisions. The GrADS project focuses on building a framework for both preparing and executing applications in Grid environment [10]. Each application has an application manager, which monitors the performance of that application for QoS achievement. GrADS monitors resources using NWS [17] and uses Autopilot for performance prediction [12]. Unlike the previous systems, our work takes into account a situation where multiple jobs are executing concurrently on different resource providers. Overall performance of all these jobs needs to be considered and monitored in order to maintain a predefined QoS level. Thus, we use a hierarchical agent structure, which is comparable to that of the Peridot project [6]. Although, unlike the framework in Peridot, here 1

JOMP implements an OpenMP-like set of directives and library routines for shared memory parallel programming in Java.

202


different categories of Analysis Agents (with various sub goals) are used. Moreover, if performance degrades, a suffered job is locally tuned (or migrated) in order to achieve the QoS level.

3

Hierarchical Organizations Analysis Agents

of

The hierarchical agent framework is part of a multi-agent system [14], which supports performance-based resource management for multiple concurrent jobs executing in Grid environment. Within the system, a group of interacting, autonomous agents work together towards a common goal. Altogether six types of agents are used, these are: Broker Agent, ResourceProvider Agent, JobController Agent, JobExecutionManager Agent, Analysis Agent, and Tuning Agent. The functions of these agents and the interactions among them have been thoroughly discussed in [14]. This work deals with the last three agents, namely JobExecutionManager Agent, Analysis Agent, and Tuning Agent.

4

Performance Properties and Agents

The main focus of our work is to capture the runtime behaviour of a job and modify its behaviour during execution so that the QoS requirement of the job as laid down in the SLA is met. Design of an NA and TA therefore depends on the concepts related to performance properties, which have been thoroughly discussed in [5]. A performance property characterizes specific performance behaviour of a program and can be checked by a set of conditions. Every performance property is associated with a severity figure, which shows how important the performance property is in the context of the performance of an application. When the severity figure crosses a pre-defined threshold value, the property becomes a performance problem and the performance property with highest severity value is considered to be a performance bottleneck.

As envisaged in [11], Grid may be considered to have a hierarchical structure with different types of Grid resources like clusters, SMPs and even workstations of dissimilar configurations positioned at its lowest level. All of these are tied together through a middleware layer. A Grid site comprises a collection of all these local resources, which are geographically located in a single site. All these Grid sites, which mutually agree to share resources located in several sites, form an enterprise Grid. Enterprise Grid provides support for multiple Grid resource registries and Grid security services through mutually distrustful administrative domains. In order to monitor the Grid at all these different levels (which is necessary for overall monitoring and execution planning of multiple concurrent jobs), a hierarchical organization of the Analysis Agents is proposed. Existing architectures (including GMA) do not extend support for such hierarchical organization. However, in our work, we make use of multiple Analysis Agents and organize them in a hierarchy. Each Analysis Agent at its own level of deployment, however, resembles with the consumer in GMA, while the JobExecutionManager Agent (JEM Agent) is the producer of the events in which the Analysis Agents are interested.

In our system, the resource provider publishes the policies and defines the specific performance properties to be checked for detecting any kind of performance bottleneck at the time of execution of a job. For each performance property, the evaluation process is also defined and stored for later use by the NA. In addition to the OpenMP properties described in [5], we have defined a new performance property, namely Inadequate Resources. This property identifies the problem related to execution of a portion of job sequentially (or on less number of processors) when additional number of processors is required in order to maintain the QoS.

In accordance with the above understanding, the Analysis Agents are divided into the following four logical levels of deployment in descending order: (1) Grid Agent (GA), (2) Grid Site Agent (GSA), (3) Resource Agent (RA) and (4) Node Agent (NA). At the four levels of agent hierarchy each agent has some specific responsibilities [3]. A block diagram presenting all the agents is shown in Figure 1. Among all the agents at different levels, the current work concentrates on the lowest level agents of the hierarchy, i.e. on Node Agents (NAs) and the Tuning Agents (TAs).

An NA monitors all the jobs running on a particular resource provider (an SMP or workstation). When execution of a job starts, NA begins collecting monitoring data. On the basis of this data and the performance property specifications (define the condition, confidence and severity for each property), it evaluates the severities of specified performance properties (using the process specifications) and generates Performance Details Report (Figure 2). In order to evaluate the severity of each performance property, the NA consults the SLA and the JobMetaData (discussed later) sent by the client.

Figure 1: Block diagram of Agents


A Tuning Agent (TA), which is invoked by the NA, is responsible for performing any local tuning action, so that any performance bottleneck or performance problem can be removed. The TA accepts the Performance Details Report from the NA, and decides what actions to be taken after consulting the Performance Property Specification and the Performance Tuning Process Specification. Performance Tuning Process Specification stores recommendations regarding actions, which may be taken for every performance problem depending upon their severity. The process specifications are basically an expert’s knowledge base, which may be created and stored on a particular resource provider. The TA generates a Performance Tuning Report (Figure 3) and sends to the GSA for future reference.

203

the BSLA and the JobMetaData, which also comes as a part of the BSLA. BSLA provides detailed information about the requirements of the job and availability of resources, while JobMetaData contains information regarding significant parts (such as loops) of the job [3].

Figure 3: Design for an Integrated Tuning Agent When a job is submitted onto a host, a JEM Agent, which is a mobile agent is associated with it and is deployed on the same host [14]. For each performance property, the JEM Agent appropriately instruments the job, gathers performance data at the time of its execution and sends the data to the NA for computation of the severity figures. Figure 2: Design for an Integrated Node Agent We create subclasses of the process specifications for every subclass of performance property. Thus, in our current implementation, Inadequate ResourcesProcessSpecification is a subclass of the class PerformancePropertyProcessSpecification, which specifies the analysis process of the Inadequate Resources Property, and LoadImbalanceProcessSpecification is another subclass, which specifies the analysis process of the Load Imbalance Property. Similarly, Inadequate ResourcesTuningProcessSpecification is a subclass of the class PerformanceTuningProcessSpecification that specifies the tuning process for the property Inadequate Resources, and LoadImbalanceTuningProcessSpecification is another subclass of the same class that specifies the tuning process for the property Load Imbalance. A class diagram containing the agent classes and the specification classes is shown in Figure 4.

5

Severity Computation

At the time of allocating a job onto a specific resource provider, a service level agreement (SLA) is established between the resource provider and the client. During this process, first the client sends Task Service Level Agreement (TSLA) to the Broker Agent and the Resource Provider Agent sends Resource Service Level Agreement (RSLA) to the Broker Agent. When a specific resource provider is selected for a job [13], the SLA is finalized and the client sends Binding Service Level Agreement (BSLA) to the Resource Provider Agent. The NA in this environment uses

A BSLA contains the expected completion time (Tect) of a job. The JobMetaData contains information about each loop, such as the start line and end line of a loop, its proportionate execution time (Lfrac) with respect to total execution time etc. These are all based on some preexecution analysis or historical information about the execution of a job. These data are later used for computing the severity of each performance property and deciding which one of these is a problem. In case of Inadequate Resources Property, the severity figure for a specific loop Li is given by, sev_resr(Li) = [Tfact f(Li) / Tfect f(Li)]

(1)

where, expected completion time of f portion of the specific loop (Li) is given by Tfect f(Li), which is computed using the information in the BSLA and JobMetaData. Thus, Tfect f(Li) = [Lfraci * Tect * f]

(2)

Actual completion time of f portion of the specific loop (Li) (as measured by the JEM Agent during execution) is given by Tfact f(Li). In case of Load Imbalance Property, execution times on each processor are measured. Thus, the severity figure for a specific loop Li is given by sev_load(Li)=[(Tmaxf(Li) - Tavgf(Li))*100] / Tmaxf(Li)

(3)

204


where, Tmaxf(Li) is the maximum time spent by a processor while executing f portion of the loop and Tavgf(Li) is the average taken over all the processors executing f portion of the loop.

6

Jade Implementation of the Agents

The entire multi-agent framework [14] has been implemented using Java RMI technology. A part of our hierarchical analysis agent framework has been implemented using Jade framework [1]. Jade is a Java Agent Development Environment, built with Java RMI registry facilities. It provides standard agent technologies. The agent platform can be distributed on several hosts; each one of them executes one Java Virtual Machine. Jade follows FIPA standard, thus its agent communication platform is FIPAcompliant. It uses Agent Communication Language (ACL) for efficient communication in distributed environment. Jade framework supports agent's mobility and agents can execute independently and parallelly on different network hosts. Jade uses one thread per agent. Agent tasks or agent interactions are implemented through the logical execution threads. These threads or behaviors of the agents can be initialized, suspended and spawned at any given time. In Jade, multiple agents can interact with each other using their own containers [1]. Containers are the actual environments for each agent. Typically multiple agents can be active at same time on different nodes with various containers. But there is only one central agent, and it needs to start first. In our implementation, the NA is first initiated as central agent, which coordinates with other agents. In a Grid environment, it is important that the performance analysis be done at run-time and tuning action is taken at run-time without incurring much overhead. Thus, the active Jade agents on a Grid resource cooperate with each other and interact in order to detect performance problems in real time. Initially, performance properties that will be checked by the NA and priority of checking these properties are decided on the basis of the nature of a job and kind of a resource provider. A job is instrumented depending on the types of data the Analysis Agent requires to collect. As soon as the job starts its execution, JEM Agent communicates with the corresponding NA sitting at that particular node. NA receives the BSLA and the JobMetaData from the JEM Agent along with a ‘Ready’ message. It then sends a ‘Query’ message to the JEM Agent with a fraction value (e.g. 0.05) indicating the portion of a significant block of the job (in the current implementation, a significant loop) to be executed before collecting any performance data. The job starts and continues its execution up to the specified percentage or fraction of the significant block. After completing this portion, JEM Agent sends execution performance data to the NA and suspends the job.

Figure 4: Class Diagram for Node Agent and Tuning Agent NA immediately starts analyzing the data on the basis of the commitment included in BSLA. While doing this, the NA computes severity of a specific performance property. If the computed severity is greater than some threshold value, NA sends a ‘Warning’ message to the JEM Agent and invokes TA. NA also sends ‘Performance_Details’ message (XML form of Performance Details Report) to TA mentioning the identified performance problem and its severity along with other details. TA decides a tuning action and directs the JEM Agent to resume the job after applying the tuning action. If no performance problem is detected by the NA, the job resumes and continues its execution. The next performance property (according to the priority list) is checked after continuing execution for another fraction of the significant block and collecting data as before. It is often possible that multiple jobs are executing on the same host (particularly on an SMP system). Consequently multiple JEM Agents are associated with these jobs, although there is only one NA. Nevertheless, in our system communication between all the JEM Agents and the sole NA may continue until the NA starts analysis of data for a particular job. Since a job remains suspended during the analysis, it is desirable that the analysis time remains as less as possible. So the NA does not entertain any communication during this time. Communication starts again when the analysis is over. Interactions among the agents during analysis and local tuning process are depicted in the sequence diagram in Figure 5.


205

In the current implementation, we have implemented the specifications, computations and tuning related to Inadequate Resources Property and the Load Imbalance Property. The next sections provide some experimental results demonstrating the analysis and tuning these properties.

7

Experimental Setup

A local Grid test bed has been set up that consists of heterogeneous nodes running in Linux. The computational nodes of the test bed include HP NetServer LH 6000 with 2 processors, HP ProLiant ML570 G4 with 4 Intel core2 Duo processors, Intel core2 Duo PCs and Intel Pentium-4 PCs. The nodes communicate through a fast local area network. The Grid test bed is built with Globus Toolkit 4.0 (GT4) [8]. The multi-agent system (which includes the hierarchical analysis agent framework) is implemented on top of GT4. The agents (NA, JEM Agent and TA) are deployed on every node. Figure 5: Agent Interaction This paper demonstrates the results of local performance tuning of multiple applications running on the same node. Exclusively one NA is responsible for analyzing the performance of multiple jobs submitted to the same node and running simultaneously. Experiments have been carried out on a HP ProLiant ML570 G4 with 4 Intel core2 Duo processors (referred as HPserver) in the Grid environment. We have used Java codes with JOMP [2] directives as test codes, and executed them in Jade framework. When a particular performance problem is identified by the NA, TA decides to locally tune the code (if possible or otherwise to migrate the code). Because we have considered only two performance properties, the TA either provides additional processors to execute the job (in order to overcome Inadequate Resources performance problem) or changes the scheduling strategy of the parallel regions of the job (in case of Load Imbalance performance problem). For example, if the job initially executes on p (>=1) processors, performance enhancement may be achieved by providing more processors (q>p) at run-time. TA decides whether the job will continue either with p processors or with additional processors q (>p) and resumes job after allocating additional resources to the job.

8

Results

Two sets of experiments have been carried out. The first set is for detecting Inadequate Resources performance property and the second set is for detecting both Inadequate Resources and Load Imbalance performance properties according to a predefined order. The following two subsections demonstrate the results of these experiments.

8.1

Detecting Inadequate Resources Property

In this case, we have experimented with two different test codes. However, in order to demonstrate the effectiveness of our system, we submitted multiple images of the same code as separate jobs. The jobs have been initiated at different times. Thus, two exepriments have been carried out and each time two Matrix Multiplication jobs and two Gauss Elimination have been executed. These jobs started at different times; but major portions of them were executed concurrently. Data sizes are different for different jobs (1000 and 2000 in Case I and 3000 and 4000 in Case II). In both cases, experimental data were collected in the following three scenarios. Scenario 1 occurs if the system continues with the job as submitted by the client, scenario 2 occurs if some tuning actions are taken at run-time (based on our algorithm) and scenario 3 is the result of running the job in a situation which best fits for a specific job and a specific resource provider. Scenario 1: Test codes execute on the HPserver with one processor without any interaction with the NA and TA. Scenario 2: Test codes execute initially on one processor and interact with NA. After computing a certain fraction of the significant loop (here 0.05), NA detects InadequateResources performance problem and the TA tunes each job to run (remaining part) on four processors of the HPserver. Scenario 3: Test code executes entirely on four processors of the HPserver without interaction with NA and TA. Figure 6 (a) and (b) compare the execution times of the Matrix Multiplication test codes in the above three scenarios. The results demonstrate that performances obtained in Scenario 2 and Scenario 3 are almost same, which signify


Figure 7(a) and 7(b) show the overheads associated with execution of all the four jobs (two Matrix Multiplications and two Gauss Eliminations) in Scenario 2. The times required for run-time analysis, tuning and communication among the agents are measured and shown. These overheads are negligible compared to the performance improvement of the jobs even though when multiple jobs are running concurrently on the same resource only one NA is responsible for performance analysis of these jobs

Time (ms)

Test Case I 300000 250000 200000 150000 100000 50000 0

scenario1 scenario2 scenario3 1000

2000

jMM1

jMM2

A LU factorization job has been used for the purpose of this experiment. Here the NA decides to check the Inadequate Resources Property first and then the Load Imbalance Property. The experiment has been carried out by comparing the following three scenarios. As before, scenario 1 occurs if the system continues with the job as submitted by the client, scenario 2 occurs if some tuning actions are taken at run-time (based on our algorithm) and scenario 3 is the result of running the job in a situation which best fits for a specific job and a specific resource provider. Test Case I Job with data size

that the overhead for run-time analysis and tuning is nominal. In case of Gauss Elimination, we obtained similar results.

jMM1 jMM2 jGE1 jGE2

206

2000 1000

5%, Proc=1 total overhead

2000

95%, Proc=4 1000 0%

50%

100%

Time (ms) Job w ith data size

Figure 7(a): Overhead Calculation for Test Case I

Figure 6(a): Performance Improvement for Test Case I

3500000 3000000 2500000 2000000 1500000 1000000 500000 0

scenario1 scenario2 scenario3 3000

4000

jMM3

jMM4

Job with data size

Time (ms)

Test Case II

jMM3 jMM4 jGE3 jGE4

Test Case II

4000 5%, Proc=1 3000

Total Overhead 95%, Proc=4

4000 3000 0%

50%

100%

Time (ms)

Job with data size

Figure 7(b): Overhead Calculation for Test Case II Figure 6(b): Performance Improvement for Test Case II

8.2

Periodic Measurements for Detection of Properties

In this experiment, we demonstrate periodic measurements of performance data and detection of more than one property according to a given priority. Thus, when the job starts executing, first f portion of its significant block is executed and performance data related to a specific performance property is gathered. If the performance problem occurs, the TA takes a tuning action and resumes the job. After executing the next f portion of the same block, again performance data is collected and the second property is checked. Again if a performance problem is detected, the TA takes another tuning action. Thus, the job continues with periodic measurements of performance data and tuning of the job based on the analysis of these performance data.

Scenario 1: Test code executes entirely on the HPserver with one processor and static schedule strategy without any interaction with the NA and TA. Scenario 2: Test code executes initially on one processor, static schedule. After computing a certain fraction of the significant loop (here 0.05), NA detects InadequateResources problem and the TA tunes the job to run (remaining portion) on four processors of the HPserver. After computing next 0.05 portion of the significant loop, NA detects LoadImbalance problem and the TA tunes job to run (remaining portion) on four processors of the HPserver with dynamic scheduling strategy. Scenario 3: Test code executes entirely on the HPserver with four processor and dynamic schedule without any interaction with the NA and TA. The results of running the job in the above three scenarios are depicted in Figure 8. It is clear from the figure that there


is a significant improvement in scenario 2 compared to scenario 1, although not much overhead has been incurred (compared to scenario 3) for performance analysis and tuning of the job.

50 40 30 20 10 0

Scenario1 Scenario2

0 10

00

00 80

00

Scenario3

60

40 00

Time (min)

Perform ance Im provem ent

Datasize

Figure 8: Performance Improvement of LU

9

Conclusion

In this paper, we have presented the design and objectoriented specification for implementation of Node Agent in a Hierarchical Analysis Agent framework for Grid environment. This framework is actually used for performance analysis of applications executing concurrently on distributed systems (like Grid) and also for dynamically improving their execution performances. This paper highlights the interaction and exchange of information among the agents for collecting data, analyzing and improving the performance by application of local tuning actions. It also discusses the implementation of some of them. The results presented in this paper highlight the effectiveness of the framework by demonstrating the performance improvement through tuning and showing that the agent control overheads are negligible even when multiple jobs are submitted concurrently onto the same resource. In future work we shall experiment in more complex system. Also the future work will focus on incorporating more categories of performance properties and implementation of other analysis agents in the hierarchy.

10 References [1] Bellifemine, F., Poggi, A. and Rimassa, G., “Developing

multi agent systems with a FIPA-compliant agent framework”, in Software - Practice & Experience. V31: 103128. [2] Bull J.M., M.E. Kambities, "JOMP - an OpenMP -like interface for java", Proceedings of the ACM2000 Java Grande Conference, pp. 44-53, June 2000. [3] De Sarkar A., S. Kundu and N. Mukherjee, “A Hierarchical Agent framework for Tuning Application Performance in Grid Environment” in the Proceedings of the 2nd IEEE APSCC 2007, Tsukuba, Japan, December 11-14, 2007, pp. 296-303.

207

[4] De Sarkar A., S. Roy, S. Biswas and N. Mukherjee, “An

Integrated Framework for Performance Analysis and Tuning in Grid Environment” in Web Proceedings of the International Conference on High Performance Computing (HiPC ’06), December 2006. [5] Fahringer, T., Gerndt, M., Riley, G.D., and Traiff, J. L., “Formalizing OpenMP Performance Properties with ASL”, In Proceedings of the Third International Symposium on High Performance Computing (October 16-18, 2000), ISHPC, pp. 428-439. [6] Furlinger K., “Scalable Automated Online Performance Analysis of Applications using Performance Properties”, Ph.D. Thesis, 2006 in Technical University of Munich, Germany. [7] Furmento N., A. Mayer, S. McGough, S. Newhouse, T. field, and J. Darlington, “ICENI: Optimization of Component Applications within a Grid environment”, Parallel Computing, 28(12): 1753-1772, 2002. [8] Globus Toolkit 4.0 – available in www.globus.org/toolkit. [9] GrADS: Grid Application Development Software Project, http://www.hipersoft.rice.edu/grads/. [10] Kennedy K., et al, “Toward a Framework for Preparing and Executing Adaptive Grid Programs”, Proceedings of the International Parallel and Distributed Processing Symposium Workshop (IPDPS NGS), IEEE Computer Society Press, April 2002. [11] Kesler J. Charles, “Overview of Grid Computing”, MCNC, April 2003. [12] Ribler R.L, H. Sinitchi, D. A. Reed, “The Autopilot Performance-Directed Adaptive Control system”, Future Generation Computer Systems 18(1), pp. 175-187, September 2001. [13] Roy S., M. Sarkar and N. Mukherjee – “Optimizing Resource Allocation for Multiple Concurrent Jobs in Grid Environment “, Accepted for publication, In Proceedings of Third International Workshop on scheduling and Resource Management for Parallel and Distributed systems, SRMPDS ’07, Hsinchu, Taiwan, December 5-7, 2007. [14] Roy S., N. Mukherjee, “Utilizing Jini Features to Implement a Multiagent Framework for Performance-based Resource Allocation in Grid Environment” Proceedings of GCA’06 – The 2006 International Conference on Grid Computing and Applications, pp-52-58. [15] Truong H.-L. , T. Fahringer, "SCALEA-G: A Unified Monitoring and Performance Analysis System for the Grid", Scientific Programming, 12(4): 225-237, IOS Press, 2004. [16] Wieczorek M., R. Prodan and T. Fahringer, “Scheduling of Scientific Workflows in the ASKALON Grid Environment”, SIGMOD Record, Vol. 34, No. 3, September 2005. [17] Wolski, R., N.T. Spring and J. Hayes, “The Network Weather service: A distributed performance forecasting service for metacomputing”, Future Generation Computer Systems 15(5), pp. 757-768, October 1999.

208


Optimizing Database Resources on the Grid Prof. MSc. Celso Henrique Poderoso de Oliveira1, and Prof. Dr. Maurício Almeida Amaral2 1 FIAP – Faculdade de Informática e Administração Paulista Av. Lins de Vasconcelos, 1264 – Aclimação – São Paulo – SP 2 CEETEPS – Centro Paula Souza Rua dos Bandeirantes, 169 – Bom Retiro – São Paulo – SP [email protected], [email protected]

Abstract. Research and development activities relating to Grid Computing are growing in the academic field and have reached corporate organizations. Now it is possible to join desktop computers in a grid environment, increasing the processing and storage capability of application systems. Although this has allowed progress in building rapidly various aspects of Grid infrastructure, the integration of different resources, including database, is fundamental. The use of relational databases and the query distribution among them are vital to develop consistent grid applications. This paper shows the planning, distribution and parallelization of database query on grid computing and how one single complex query can optimize the use of database resources. Keywords: Database, Distribution, OGSA-DAI

1

Grid

Computing,

Query,

Introduction

One of the huge obstacles to use grid environment outside the academic systems is because flat files are used more than relational databases and there is a few middleware that integrates these resources on the grid. Most corporate organizations use relational or object-relational databases to store and manage data. This paper shows a research result that used an algorithm to plan, distribute, and parallelize a SELECT statement over database management system (DBMS) on the grid. To achieve this goal, it was used the main middleware that provides access to databases: Open Grid Services Architecture - Data Access and Integration (OSGA-DAI). OGSA-DAI receive the user request, and submit it to the available database resources on the grid. The module created intercepts the user request and before OGSA-DAI submit it, it parses the complex query into simple ones. Then it asks the available resources to OGSA-DAI and finally submits it to the available databases. After the execution of each query, the module join all results into a single response file. This file is sent back to the user. The process is transparent to end-user.

This paper is divided into four sections. The second one shows the fundamentals of grid, plan and database. In the third section is showed the results and finally in the last section are listed the conclusions.

2

Grid, Plan and Database Fundamentals

Distributed processing is an important research area because the evolution of components will not follow the needs of data processing [3]. Distributed storage is important as well, because of security reasons and the amount of data to be stored. Grid computing is part of distributed processing that uses distributed resources, i.e. computers, storage, databases, that can be integrated and shared as if it is a single environment [4]. It uses a middleware that balance data and processing load, provide security access and fail control. The main aspects of a grid computing environment include decentralization, resources and services heterogeneity. The services and resources are used to provide data management and processing through the network. As resources are shared in a network, it is important keep access control and management. A group of organizations or people that share the same interests in a controlled way is called Virtual Organization (VO). The VO generally uses the resources to achieve a specific goal [4]. Databases are important resources to use and share in a VO. Companies uses to store data in databases. Grid middleware can identify, locate and submit queries to databases even if it is local or remote. One of the most important middleware used to manage database on the grid is OGSA-DAI. Watson [11] establishes a proposal to integrate databases on the grid. Some acceptable parameters are: grid and database independence, and the use of existing databases instead of create a new one. Databases use some of existing grid services and it is important to use the same services [12]: security, database programming, web services integration, scheduling, monitoring and metadata.


2.1

Grid Database Services

Database services are grid services that use database interface [5]. Open Grid Services Architecture (OGSA) grid data service is a grid service that implements one or more access or management interfaces to distributed resources [5]. A grid data service uses one of four basic interfaces to control different behavior of a database. These interfaces use specific Web Services Description Language (WSDL) ports and when a service implements one of this interfaces it is known as an Open Grid Services Infrastructure (OGSI) Web Service. Data Access and Integration Services (DAIS) group works to define the key database services that should be used on the grid. Data Access and Integration (DAI) implements these definitions on OGSA. The main goal of OGSA-DAI is to create an infrastructure to deploy high quality grid database services [1].

2.2

209

sequence of this command in different databases using a grid database service. As the complex SQL SELECT command is parsed into some simple queries, they can be submitted in parallel.

3

Results

The goal was to create a service that should make an execution plan to select available relational databases on the grid and submit queries based on SELECT command. OGSA-DAI middleware was chosen because it is the most used one. It is installed on Globus Toolkit. The proposal was to develop a grid data service to be used on a virtual organization. The developed service uses recent techniques to extract and heuristics to solve query planning problems. The metrics used to define the heuristics in a database management system were: CPU, available memory, network bandwidth, and I/O volume of stored data. These data were extracted from the databases metadata.

Plan

A planning problem should have an initial state description, a partial description goal and the tasks that map the transition of the elements from one state to the other [6]. Some algorithms, simple or complexes, can be used to achieve the goal of a plan. It is possible relate temporal aspects, uncertainties, and optimization properties. Artificial Intelligence is a planning problem that applies tasks in a workflow and allocates resources in each task [2]. Each task component is modeled as a planning operator whose effects and preconditions shows input and output of data relation. A partial order plan is created by, at least, two tasks in a plan [9]. There is no importance in task execution order. Sometimes there can be an order restriction if some data must be used in the next step of the plan. But it does not affect the task parallelization because the link between data input and output is preserved. The huge advantage in using this plan technique is that it is possible applies some basic tasks that are independent between them. When the independent tasks had finished, they are joined to return the final result. Heuristics can be used to determine the best execution plan. In a VO there are some different heterogeneous and limited resources that can be used. Some important decisive factor to schedule database resources are: resource utilization, response time, global and local access policy and scalability [8]. In this paper, the planning operator will be the SQL SELECT command. The planning will determine the submit

Most databases can parallelize and distribute queries. But it is done on the same environment and using the same product. When they use different databases, there are a lot of limitations that must be followed. Cluster databases can do the same job, but it is centralized and supervised by one single control center. Grid Computing aims to bypass this limitation. Small databases can be installed on heterogeneous resources and ordinary desktop computers.

3.1

Planning Phases

The planning system was divided into five main parts: (1) the user request is intercepted by the service; (2) ask the middleware the available databases; (3) parse complex SQL query command into simple queries; (4) distribute each simple query to the middleware; and (5) join the results and send back to the user. Figure 1 shows the planning inputs and outputs. OGSA-DAI identifies and communicate with the databases. The first information needed is the available resources and tables to execute each query command. After the parsing the planning system establishes the execution plan. After this the plan is submitted to the middleware. The middleware executes each query on the databases. The service keeps track of the submission and the final phase is receiving all rows, joins them and sends it to the user.

210


Resources SGBD

User Needs XML File Results

Metrics and Resources

Planning System XML File Results

OGSA-DAI

Figure 1 – Service Planning Inputs and Outputs

The tests were made over Shipp [10] genetic database. It is a complex model that stores a lot of DNA data. There is a lot of attributes on each table and the data analysis is a good challenge to our service.

In other hand, this model could be easily understood and can be used in non-academic institutions. Figure 2 shows the entity-relationship model of this database.

Figure 2 – Entity-Relationship Model

There are four tables on this model: LYMPH_OUTCOME, LYMPH_EXP, LYMPH_MAP e LYMPH_TEMP. The first one stores data clinic results. There are 58 rows in this table. The second one stores genetic values for each different model. DNA sequences are represented in this table. There are 7.129 rows in this table. The third one has 7.129 rows too and stores the gene

identity. The fourth one stores 413.482 rows and it is the data pivot result of the other tables. It is used to store original genetic expression. The queries used to test the algorithms are:


i)

select

a.gene_id,

b.gene_id,

ii)

211

lymph_outcome

a.accession,

b.sample_id

c,

lymph_temp

where a.gene_id = b.gene_id and

from

lymph_map a, lymph_exp b where

b.sample_id

a.gene_id = b.gene_id

a.gene_id = d.gene_id

select

a.gene_id,

b.gene_id,

a.accession, b.sample_id,

c.sample_id, lymph_map

c.status a,

from

lymph_exp

b,

lymph_outcome c where a.gene_id =

b.gene_id

and

b.sample_id

=

d

=

c.sample_id

and

The main goal of these queries was identify the service planning behavior. With these queries it was possible test the parsing phase of complex SQL SELECT command, the distribution in parallel of the resulted simple queries into resources, and time response. The queries use regular joins to test the ability to distribute queries over available databases. Query (i) uses two tables. Queries (ii) and (v) use three tables, and query (vi) uses four tables.

c.sample_id v)

select

a.gene_id,

a.accession,

b.gene_id, b.sample_id, d.dlbc1 from lymph_map a, lymph_exp b, lymph_temp d where a.gene_id = b.gene_id

and

a.gene_id

=

d.gene_id vi)

select

a.gene_id,

b.gene_id,

a.accession, b.sample_id,

c.sample_id,

c.status,

d.dlbc1

from lymph_map a, lymph_exp b,

There were only three available databases on the grid. Two servers were held by a local network. Oracle 10g R2 database was installed in one of them and the other server has one MySQL 5.0 and one Oracle 10g R1. OGSA-DAI and Globus were installed on the same server. The best metrics were on the OGSA-DAI server, because there was no need to use network. This server had more available memory and the faster processor. All databases were populated with the same tables and rows. We used a data replica because our goal was test only the power to distribute queries on the grid. Figure 3 shows the test results.

Tempo

Recursos x Tempo 0:10:22

3,5

0:10:05

3

0:09:48

2,5

0:09:30

2

0:09:13

1,5

0:08:56

1

0:08:38

0,5

0:08:21

Tempo Recursos

0 i (2 t)

ii (3 t)

v (3 t)

vi (4 t)

Consulta s

Figure 3: Query Distribution on the available resources and processing time

Figure 3 shows that there was resource usage optimization on virtual organization when the service is used. Based on this result, the more complex a query is and the more available databases on the grid, the query distribution would be best. As shown, the planning system used all available databases based on the query. When the SQL SELECT command used only two tables on the query,

the planning service uses two databases. When the query used three tables, the service used three available databases. When the query used four tables, the service used the maximum available databases (three) and waits the first available one to submit the last query.

212


Time response were better when all available databases were used, i.e., queries (ii) and (v). When there was more available database than the tables in the query, the time response was not better. When there were more tables in the query and less available databases (vi), time response was almost the same (ii). It shows that, using the virtual organization concept, it is important to use as much resources as possible. It would increase the time response of complex queries.

3.2

Related Work

There is a similar work that distributes queries on OGSA-DAI. Distributed Query Processing (OGSA-DQP) goal is implement declarative and high-level language to access and analyze data. It tries to simplify the resource management on the grid using some services. The main problem of this solution is that it does not uses all OGSADAI services. There are other problems like some databases cannot be used on DQP and the use of Polar* partitioning service [7] that is executed outside OGSA-DAI. It uses Object Query Language (OQL) instead of SQL. SQL is the relational database standard language.

4

de

Alto

[4] Foster, I. ; Kesselman, C. e Tuecke, S. “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” Em: International J. Supercomputer Applications, vol 15(3): 2001 [5] Foster, I.; Tuecke, S. e Unger, J. “OGSA Data Services”. Em: http://www.ggf.org: 2003. Acessado em julho de 2005. [6] Nareyek, A. N.; Freuder, E. C.; Fourer, R.; Giunchiglia, E.; Goldman, R. P.; Kautz, H.; Rintanen, J. e Tate, A. “Constraints and AI Planning”. IEEE Intelligent Systems 20(2): 62-72: 2005. [7] OGSA-DAI. http://www.ogsadai.org.uk/about/ogsadqp/. Acessado em fevereiro de 2006. [8] Ranganathan, Kavitha e Foster, Ian Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids. Em: Journal of Grid Computing, Volume 1 p. 53-62: 2003.

Conclusion

It is possible use a simple service to plan, parallelize and distribute complex SQL queries on the grid. The use of OGSA-DAI is and important feature because there is a lot of researchers establishing the services that will be developed to integrate databases on the grid. Thus, it will be possible aggregate new services into the middleware. Some important issues were followed, as the use of heterogeneous servers, operating systems, and databases. In our tests we used standard SQL queries with tables joined and rows restriction. The main contribution of this paper was that it shows the utilization of different and remote databases in a virtual organization. When it is used, the resources will be optimized and the queries will return faster than using only one database. Ordinary desktop computers can be used to do these tasks that would be overloading the database server.

5

[3] Dantas, M. “Computação Distribuída Desempenho”. Ed. Axcel. Rio de Janeiro: 2005.

References

[1] Alpdemir, M. N.; Mukherjee, A.; Paton, N. W.; Watson, P.; Fernandes, A. A. A.; Gounaris, A.; Smith, J. “Service-Based Distributed Querying on the Grid” : 2003. [2] Blythe, J.; Deelman, E. e Gil, Y. “Planning and Metadata on the Computational Grid”. Em: AAAI Spring Symposium on Semantic Web Services: 2004.

[9] Russell, Stuard e Norvig, Peter. Inteligência Artificial. Ed. Campus. Rio de Janeiro: 2004. [10] Shipp, M. A. et al. Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning. Em: Nature Medicine, vol. 8, nº 1, pp 68-74: 2002. [11] Watson, P.; Paton, N.; Atkinson, M.; Dialani, V.; Pearson, D. e Storey, T. “Database Access and Integration Services on the Grid” Em: Fourth Global Grid Forum (GGF4): 2002. http://www.nesc.ac.uk/technical_papers/dbtf.pdf. Acessado em julho de 2005. [12] Watson, P. “Database and the Grid” Em: Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons Inc: 2003.


213

An Efficient Load Balancing Architecture for Peer-to-Peer Systems Brajesh Kumar Shrivastava Symantec Corporation brajeshkumar [email protected]

Sandeep Kumar Samudrala,Guruprasad Khataniar,Diganta Goswami Department of Computer Science and Engineering Indian Institute of Technology Guwahati Guwahati - 781039, Assam, India {samudrala,gpkh,dgoswami}@iitg.ernet.in

Abstract - Structured peer-to-peer systems like Chord are popular because of their deterministic nature in routing, which bounds the number of messages in finding the desired data. In these systems, DHT abstraction, Heterogeneity in node capabilities and Distribution of queries in the data space leads to load imbalance. We present a 2-tier hierarchical architecture to deal with the churn rate, load imbalance thereby reducing the cost of load balancing and improve the response time. Proposed approach of load balancing is decentralized in nature and exploits the physical proximity of nodes in the overlay network during load balancing.

Keywords: P2P Network, Overlay network,Load balancing, DHT, Super node and Normal node.

1 Introduction Peer-to-Peer(P2P) systems offer a different paradigm for resource sharing. In P2P systems every node has got equal functionality. Every node provides and takes the service to and from the network. The main objective of P2P systems is to efficiently utilize the resources without overloading any of the nodes. Especially Structured P2P systems like Chord[1], CAN[2] etc., also called DHT based systems, in which data and node ids are mapped to a number space and each node is responsible for a part of data space. Since each node is mapped to a part of data space, data finding in these systems is easy. This deterministic nature and bounded number of messages in searching for data made these systems more popular. DHT abstraction assumes the homogeneity of node capabilities which is not in practice because each node possesses different capabilities. DHT’s make use of consistent hashing in mapping data items to nodes. However, consistent hashing produces a bound of O(logN ) imbalance of keys among the nodes in the system, where N is the number of nodes in the system [3]. Data that is shared to the network is not uniformly distributed and the queries are never uniformly distributed in the network. Thus, assuming homogeneity of node capabilities, DHT abstraction, Data shared to the network and Queries in the network lead to considerable amount of load imbalance in the system. We present a 2-Tier Hierarchical architecture to achieve the proximity aware load balancing to improve

the response time, handle churn rate to reduce the message traffic, deal with Free-riding to ensure that each node provides sufficient resources to deal with the greedy nodes, which take the service from the network without sharing to the network. Rest of this paper is organized as follows : Section 2 presents related work, Section 3 describes the proposed overlay system. Section 4 concludes the paper.

2 Related work DHT-based P2P systems [1][2][4][5], address the load balancing issue in a simple way by using the uniformity of the hash function used to generate object ids. This random choice of object ids does not produce perfect load balance. These systems ignore the node heterogeneity which has been observed in [6]. Chord[1] was the first to propose the concept of virtual servers to address the load balancing issue by having each node simulate a logarithmic number of virtual servers. Virtual servers donot completely solve the load balancing issue. Rao et al.[7] proposed three simple load balancing schemes for DHT- based P2P systems: One-to-One, Oneto-Many and Many-to-Many. The basic idea behind their schemes is that virtual servers are moved from heavy nodes to light nodes for load balancing. Byers et al.[8] addressed the load balancing from different perspective. They proposed using the the “power of two choices” paradigm to achieve load balance. Each data item is hashed to a small number of different ids and then is stored in the least loaded node among the nodes which are responsible for those ids. Karger et al.[9] proposed two load balancing protocols, like address-space balancing and item balancing. Proximity information has been used in both topologically-aware DHT construction[10] and proximity neighbor selection in P2P routing tables [11][12]. The primary purpose of using the proximity information in both cases is to improve the performance of DHT overlays. We use proximity for load balancing. Haiying shen et al.[13] presented a new structure called Cycloid to deal with the load balancing by considering the proximity.

214

Int'l Conf. Grid Computing and Applications | GCA'08 | Zhenyu et al.[14] proposed a distributed load balancing algorithm based on virtual servers. In this each node aggregates the load information of the connected neighbors and moves the virtual servers from heavily loaded nodes to lightly loaded nodes. Our architecture exploits the node capabilities and proximity of nodes in the network for load balancing. This architecture uses the Chord in the upper level of the hierarchy and deals with the Free-riding by constraining the minimum amount of sharing on the nodes in the lower level of hierarchy.

3 System Model In this model, we broadly categorize the nodes into two types i.e. Super Node and Normal Node, based on their capabilities like Bandwidth, Processing power, Storage, Number of nodes it can connect to. i. Super node: A Super node is exposed to the upper level DHT based network (Chord). It manages the group of Normal nodes connected to it. The data that is mapped on to a Super node is stored on the Normal nodes pertaining to it’s group. It represents a group of Normal nodes in the DHT based network. It routes the queries of the Normal nodes in it’s group. ii. Normal node: A Normal node is a node that joins the network in the second level of hierarchy. It maintains the data mapped on to the Super node with which it is connected. This is not exposed to the upper level DHT based network. These are in-turn classified into two types: i. Stable-Normal node. ii. Unstable-Normal node.

Tstab

Un stable Normal nodes Fig. 1.

Stable Normal nodes

Categorization of Normal Nodes

i. Stable-Normal node: A Normal node that is in the overlay network since Tstab or more time units. These are expected to be in the network with acceptable probability. These are considered for load balancing by a Super node. In this paper, whenever Normal node is referred it is called Stable-Normal node unless and until explicitly specified. ii. Unstable-Normal node: These are recently joined nodes and can leave the network at anytime. These are not considered by the Super node for load balancing.

3.1 Bootstrapping We assume that at the startup of the network, nodes that join the network are Super nodes and directly added

: Super Node : Normal Node Fig. 2.

2-Tier Hierarchical Overlay Structure

in the upper level (chord). If N is the complete data space, then as soon as the number of nodes in the network reach logN node addition in the upper level is stopped. Initially number of groups G = logN . Thereafter nodes that join in the second level of the network by connecting to the nearest Super node. Each Super node manages a group of Normal nodes. Maximum number of nodes in a group is Gmax . The moment a super node detects that number of nodes in it’s group go beyond Gmax it splits the group into two groups. Splitting a group invokes a node addition in the upper level of hierarchy. At the time of division of a group, number of nodes and data space are equally divided. The moment a Super node detects that number of nodes in it’s group are less than Gmin , then nodes and data space of the group are equally divided to predecessor and successor groups. This invokes node deletion in the upper level of hierarchy.

3.2 Data Distribution: Every Super node divides it’s data space into K sets and distributes each set to a group of Normal nodes. At the startup nodes are assigned to the data regions in round robin fashion. For instance, consider a Super node S1 . Let the data space maintained by it be from X1 to

Int'l Conf. Grid Computing and Applications | GCA'08 | X2 . The space is divided into K sets. Let there be 2K nodes in the group. Data is assigned in the following fashion: TABLE I

Data distribution in a group Data region X1 − P 1 P1 − P 2

Nodeid N1 , Nk+1 N2 , Nk+2

Pk−1 − X2

Nk , N2k

At the starting of the group formation, nodes are assigned to the data sub regions in the round robin fashion. As the nodes join the group, each sub range will get more number of nodes. Since the nodes in a group are physically near, it facilitates load balancing without any additional cost. Dynamically as the load pertaining to a data subrange changes, more nodes are added to the corresponding subrange.

3.3 Query Processing In the proposed architecture only Super nodes are exposed in the DHT based network. Normal nodes make use of the Super node to which they are connected in searching for data. For instance, consider two Super nodes S1 and S2 . Let S11 , S12 be the Normal nodes under S1 . Like-wise S21 , S22 be the Normal nodes under S2 . Let the data being searched by S11 belong to node of S21 . S11 sends the query to S1 and S1 runs regular data searching algorithm in the upper level Chord network. S11 finds that data is in group S2 . S2 gives the id of S21 to the node S1 which in turn sends to S11 . Then S11 directly downloads the data from S21 . This way Super nodes mediate the data search for Normal nodes.

3.4 Grading of nodes Each Super node periodically collects the load of Normal nodes in it’s group and grades them based on the following definition of load. Load of a node(LN ) is defined as the ratio of amount of data requests it receives to the amount of data it can serve. Let Ar : Amount of data requests received. As : Amount of data it can serve. r LN = A As Main objective is to ensure that the load of each node is in between 0 and 1 i.e. 0 < LN < 1. Different types of nodes in a group are: i. Lightly loaded nodes: Load of these nodes varies from 0 to 0.5 i.e. 0 ≤ LN < 0.5. ii. Normal loaded nodes: Load of these nodes varies from 0.5 to 1 i.e. 0.5 ≤ LN < 1

215 iii. Heavily loaded nodes: Load of these nodes is greater than or equal to 1 i.e. LN ≥ 1.

3.5 Load Balancing In this model load balancing is applied at two levels: i. Intra-group load balancing. ii. Inter-group load balancing. 3.5.1 Intra-group load balancing: This is applied only when a Super node has enough number of lightly loaded nodes. Data space of a Super node is divided into K intervals and uniformly distributed among the nodes in the group at the time of group creation. As the load on a particular interval increases, lightly loaded nodes which are associated with other intervals are now assigned to the loaded interval. For instance, consider a Super node S that has 5 nodes in it’s group. Let the data space (X1 − X2 ) of it be divided at point a. TABLE II Data distribution of Super node S Data region X1 − a a − X2

Nodeid N 1 , N3 , N5 N2 , N 4

Now if the region (a − X2 ) is queried more, then the load on nodes N2 and N4 increases when compared to that of the nodes in region (X1 − a). Hence nodes from that region are moved to region (a − X2 ). Since the nodes in a group are physically near, the cost of moving the load from one node to other node is ignorable. 3.5.2 Inter-group load balancing: When a Super node does not have enough number of lightly loaded nodes this is applied. There are two alternatives for this procedure: i. When a Super node detects that there are not enough lightly loaded nodes in the group, it sends a message “GET-NODE” message to all the nodes in the group. “GET-NODE” is a simple message to get more nodes to the group. Nodes in the group search for lightly loaded nodes in the proximity as shown in fig.3 and get them to the group which is in need of lightly loaded nodes. This creates a flow of ping messages in the proximity of group to get more nodes. Since the nodes are searched in the nearby proximity, cost of getting nodes from other groups is also ignorable. ii. a. If none of the nodes could find lightly loaded nodes in the nearby proximity then this is applied. Super node sends a multi-cast message to the logG Super nodes whose information is maintained in it’s finger table. b. If none of the logG nodes respond to the message then the logically connected neighbors send the message along the DHT ring to get

216

Int'l Conf. Grid Computing and Applications | GCA'08 | nodes. Here, the cost is O(G). But as described in the section 3.1 number of groups G = logN . Hence, the cost of load balancing is O(logN ).

3.7 Load of a Group

GET−NODE

GET−NODE

Fig. 3.

Inter-group Load balancing

new lighlty loaded nodes. Always a Super node chooses the node which is physically near in choosing for load balancing.

3.6 Cost of Load balancing i. Cost of load balancing for Intra-group is ignorable because lightly loaded nodes are available within the group and only cost is moving the load. ii. Cost for Inter-group load balancing is: a. In the first alternative, cost of load balancing is ignorable since the lightly loaded nodes are searched in the nearby proximity. b. In the second alternative, cost of load balancing is: Load of a node is dependent on the data it maintains, the rate at which the data space is being queried and the availability of node resources. Hence, when a Super node sends a message to get the lightly loaded nodes to the connected Super nodes, the nodes may or may not have lightly loaded nodes. Probability of a Super node providing the lightly loaded nodes is: 21 . When a Super node sends a message to the connected Super nodes, the probability of getting at least one response logG is: 2 2logG−1 , which is approximately 1. Hence, the cost of load balancing is O(logG). In worst case, if none of the nodes respond to the message, then the message is passed along the Chord ring in search of lightly loaded

Every Super node ensures that there are atleast 10% of lighly loaded nodes in the group by the mechanisms described in the section 3.5 to handle the sudden changes in the load. Each lightly loaded node shares the load of a heavily loaded node and both the nodes become Normal loaded nodes. Hence the expected proportion of Normal loaded nodes is 50%. Heavily loaded nodes form the rest of the group. Percentage of lightly loaded nodes is from 10% to 50%. Similarly heavily loaded nodes form 0% to 40% of the group. Load of a group is defined as the average load of all the nodes in the group. Let the number of nodes in a group be M . 3.7.1 Lower bound: Number of Lightly loaded nodes: 0.5 × M . Number of Normal loaded nodes: 0.5 × M . Number of Heavily loaded nodes: 0 × M . Load of Lightly loaded nodes: 0.1. Load of Normal loaded nodes: 0.5. Load of Heavily loaded nodes: 1. Load of a group G: ×0.5+0×M ×1 LG = 0.5M ×0.1+0.5M M 0.05M +0.25M ==> M ==> 0.3M M LG = 0.3 3.7.2 Upper bound: Number of Lightly loaded nodes: 0.1 × M . Number of Normal loaded nodes: 0.5 × M . Number of Heavily loaded nodes: 0.4 × M . Load of Lightly loaded nodes: 0.4. Load of Normal loaded nodes: 0.9. Load of Heavily loaded nodes: 1. ×0.9+0.4M ×1 LG = 0.1M ×0.4+0.5M M +0.4M ==> 0.04M +0.45M M 0.86M ==> M LG = 0.86. The average load of a group varies from 0.5 to 0.8 making the group to be always normally loaded. 0.3 ≤ LG ≤ 0.86

3.8 Node Joining/Failure Normal nodes join the network at the lower level of hierarchy by connecting to the nearest Super node. They are not exposed to the upper level hierarchy. Hence addition/failure of a Normal node does not effect the system. Super nodes are exposed to the upper level hierarchy. When a Super node fails one of the Normal nodes with high probability will replace the Super node. When a

Int'l Conf. Grid Computing and Applications | GCA'08 | group joins/ fails finger tables of the other Super nodes need to be updated. Let Rj : Rate of joining of nodes in the network. Rl : Rate of leaving of nodes in the network. Gj : Rate of getting the nodes from nearby groups. Gl : Rate of giving the nodes to other groups. Gmax : Maximum group size. G: Number of groups. Effective number of nodes joining the system per unit time: (Rj + Gj − Rl − Gl ). Effective number of nodes joining per group: (Rj +Gj −Rl −Gl ) . G A group is created only if the number of nodes in a group cross Gmax . Number of groups created by a group is: (Rj +Gj −Rl −Gl ) . Gmax ∗G Addition of a group is equivalent to addition of a node in the chord. Chord[1] takes atmost O(log2 N )2 messages, where N is the number of nodes in the system. Here, number of groups is G. Addition of a group takes O(log2 G)2 . Number of messages routed because of new groups (R +Gj −Rl −Gl ) × O(log2 G)2 . created by a group is: j Gmax ∗G Total number of messages routed in the upper level (R +G −R −G ) network: j Gjmaxl l × O(log2 G)2 . Since the Group joining and Group deletion are dependent on the parameters given above. These events do not occur very often. Hence the churn rate does not degrade the system performance.

3.9 Data Insertion When a node shares the data to the network, the Super node to which this data gets mapped needs to be identified. This is similar to searching for a data in the Chord. [1] proved that in chord network, number of steps to find the data is O(logN ). In the proposed network, there are G groups in the upper level hierarchy, hence the cost of data insertion is O(logG). TABLE III Analysis of the system Function Data Insertion Query Processing Node join/failure Group insertion Intra-group load balancing Inter-group load balancing

Messages O(logG) O(logG) Constant O(logG)2 Constant Avg:O(logG), Worst case:O(G).

3.10 Free-Riding Since the Normal nodes join the network by connecting to nearest Super nodes, at the time of joining Super node puts a minimum constraint on the amount

217 of sharing space it has to provide to the network. Thus nodes which do not give sufficient sharing space are not connected to the network. Nodes which do not provide enough sharing cannot be put to use for handling the data space of the Super node. Thus Free-Riding is controlled at the lower level of the network.

4 Conclusion In this paper we present efficient load balancing mechanisms by considering locality in a 2-tier hierarchical overlay network. In the proposed architecture, nodes are classified as Super nodes and Normal nodes. Number of normal nodes are limited initially to logN , where N is the size of data space. Super nodes balance the load by moving the load across the Normal nodes which are physically near. Since the number of Super nodes are limited initially and the rate at which the Super nodes join the network is very less, this architecture deals with the churn rate. Constraining the minimum amount of sharing by the Normal nodes regulates Free-Riding.

References [1] Ion Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, in Proceedings of SIGCOMM 2001. [2] Sylvia Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A Scalable Content- Addressable Network, in Proceedings of ACM SIGCOMM 2001. [3] David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, Rina Panigrahy, Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, In proceedings of Symp. Theory of computing(STOC’ 97)pp..654-663,1997. [4] Antony Rowstron and P. Druschel, Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems, in IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, pp. 329 - 350, November 2001. [5] Ben Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz, Tapestry: A Resilient Global-Scape Overlay for Service Deployment, IEEE Journal on Selected Areas in Communications, vol. 22, 2004. [6] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble, A Measurement Study of Peer-to-Peer File Sharing Systems, in Proceedings of Multimedia Computing and Networking (MMCN), 2002. [7] A.Rao, K. Lakshminarayanan, S.Surana, R.Karp and I.Stoica, Load Balancing in Structured P2P Systems, in Proceedings of Second International workshop on P2P systems(IPTPS),pp.6879,Feb 2003. [8] J.W.Byers, J.Considine and M.Mitzenmacner, Simple Load Balancing for Distributed Hash Tables, in Proceedings of Second International workshop on P2P systems(IPTPS),pp.80-87,Feb 2003. [9] D.R.Karger and M.Ruhl, Simple Efficient Load Balancing Algorithms for Peer to Peer Systems, In proceedings of Third International workshop on P2P Systems(IPTPS),Feb.2004. [10] S.Ratnasamy, M.Handley, R.M.Karp and S.Shenker Topologically-Aware Overlay construction and Server Selection, In proceedings of IEEE INFOCOM, vol.3, pp.1190-1199, June 2002. [11] Z.Xu, C.Tang and Z.Zhang, Building Topology Aware Overlays Using Global Soft-State, In proceedings of 23rd International conference on Distributed Computing Systems(ICDCS), pp.500508, May 2003.

218

Int'l Conf. Grid Computing and Applications | GCA'08 | [12] Z.Xu, M.Mahalingam and M.Karlsson, Turning Heterogeneity into an Advantage in Overlay Routing, In proceedings of IEEE INFOCOM, VOL. 2, PP.1499-1509, April 2003. [13] Haiying Shen and Cheng-Zhong Xu, Locality Aware and Churn Resilient Load Balancing Algorithms in Strructured Peer to Peer Networks, In proceedings of IEEE Transactions on Parallel and Distributed Systems, vol. 18, No.6, June 2007. [14] Zhenyu Li and Gaogang Xie, A Distributed Load Balancing Algorithm for Structured P2P Systems, In proceedings of 11th IEEE Symposium on Comnputers and Communuications(ISCC’06).


219

Grid Performance Analysis in a Parallel Environment Y.S. Atmaca1, O. Kandara1, and A. Tan2 Computer Science Department, Southern University and A&M College, Baton Rouge, Louisiana, USA 2 Center for Energy and Environmental Services, Southern University and A&M College, Baton Rouge, Louisiana, USA

1

Abstract - High performance computing is a relatively new emerging technology gaining more and more importance everyday. Super computing, clustering, and the grid computing are the currently existing technologies used in high performance computing. In this paper, we first studied the three types of emerging grid technologies, namely cluster grid, campus grid, and global grid. We use the cluster grid in our implementation. We used 11 nodes, where one node is dedicated to submit the jobs. The remaining 10 nodes are used as the computational units. Then we carried out a performance analysis based on the number of nodes in the grid. In the analysis, we observed that peak performance can be achieved with a certain number of nodes in the grid. That is to say, more nodes do not mean always more performance. Our study proves that we can create a grid for high performance computing with the optimal number of nodes for jobs with the same or similar characteristics. This will lead us to implement parallel grids with the nodes that are not cost-efficient in performance to add to the existing grids.

most applications of the Grid fall into the HPC classification. This is due to the fact that Grid computing arose out of the need for more cost-effective HPC solutions to address critical problems in science and engineering. The initial adoption of the Grid by commercial enterprises has continued to focus on HPC because of the high return on investment and competitive advantage realized by solving compute intensive problems that were previously insolvable in a reasonable period of time or cost. A simplified basic architecture of a Grid is shown in the Figure 1 with Grid middleware providing the location transparency that allows the applications to run over a virtualized layer of networked resources. The key aspect of middleware is that it gives the Grid the semblance of a single computer system, providing the coordination among all the computing resources that comprise the Grid. These functions usually include tools for handling resource discovery and monitoring, resource allocation and management, security, performance monitoring, and accounting.

Keywords: High performance computing, grid computing, grid performance analysis, cluster grid.

1

Introduction

One way of categorizing a computational problem in computer science is by its degree of "parallelism". If a problem can be split into many smaller sub-problems that can be worked on by different processors in parallel, computation can be speed up a lot by using many computers. Fine-grained calculations are better suited to big, monolithic supercomputers, or at least very tightly coupled clusters of computers, which have lots of identical processors with an extremely fast, reliable network between the processors, to ensure there are no bottlenecks in communications. This type of computing is often referred to as high-performance computing (HPC). In its current stage of evolution,

Figure 1: Basic Architecture of a Grid Key to success of Grid computing is the development of the 'middleware', the software that organizes and integrates the disparate computational facilities belonging to a Grid. Its main role is to automate all the "machine to machine" negotiations required to interlace the computing and storage resources and the network into a single, seamless computational "fabric".

220


There is a considerable amount of debate as to whether a local computational cluster of computers should be classified as a Grid. There is no doubt that clusters are conceptually quite similar to Grids. Most importantly, they are both dependent on middleware to provide the virtualization needed to make a multiplicity of networked computer systems appear to the user as a single system. Therefore, the middleware for clusters and Grids address the same basic issues, including message passing for parallel applications. As a result, the high level architecture of a cluster is essentially the same as that of the Grid.

1.1 Type of Grids Grid computing vendors have adopted various nomenclatures to explain and define the different types of grids. Some define grids based on the structure of the organization that is served by the grid, while others define it by the principle resources used in the grid. We will classify types of grid as three groups considering their regional service capability. Cluster grids are the simplest. Cluster grids are made up of a set of computer hosts that work together. A cluster grid provides a single point of access to users in a single project or a single department. Campus grids enable multiple projects or departments within an organization to share computing resources. Organizations can use campus grids to handle a variety of tasks, from cyclical business processes to rendering, data mining, and more. Global grids are a collection of campus grids that cross organizational boundaries to create very large virtual systems. Users have access to compute power that far exceeds resources that are available within their own organization.

I) What kind of grid will be implemented, Cluster/ Departmental/Global? II) Do we have a room for the grid that meets our needs? III) Which vendor’s software will be chosen as the grid engine? IV) How can we obtain the required software, what’s the budget availability? V) What documentations should be used to implement the grid successfully? VI) Is the grid software proper choice for the grid that we are about to implement? VII) Will grid be composed of heterogeneous or homogenous systems? VIII) What are the requirements for the networking among the computers? IX) What kind of naming service do we need? X) Are we doing load balancing or parallel processing? XI) What kind of parallel environment is proper for the grid? XII) Will grid system be available from outside of the network? After we found proper answers to the questions in step 1, we continued on the step 2. In planning, we passed over those steps: I) Deciding whether our Grid Engine environment will be a single cluster or a collection of sub-clusters called cells, II) Selecting the machines that will be Grid Engine hosts. Determine what kind(s) of host(s) each machine will be; master host, shadow master host, administration host, submit host, execution host, or a combination, III) Making sure that all Grid Engine users have the same user ids on all submit and execution hosts, IV) Deciding what the Grid Engine directory organization will be—for example, organizing directories as a complete tree on each workstation, directories cross mounted, or a partial directory tree on some workstations—and where each Grid Engine root directory will be located, V) Deciding on the site’s queue structure, VI) Deciding whether network services will be defined as an NIS file or local to each workstation in /etc/services, VII) Completing the installation plan.

In the cluster grid, a user's job is handled by only one of the systems within the cluster. However, the user's cluster grid might be part of the more complex campus grid. And the campus grid might be part of the largest global grid. In such cases, the user's job can be handled by any member execution host that is located anywhere in the world[1].

Last step to implement a Grid Computing System is “Verification”. After we finish step 1 and 2, we review the question and the solution that we come with if they are really feasible and logical to implement. This step is to make sure everything is fine and we are ready to start. If we skip and don’t pay attention for the step 3, we might have set up all the system again and again. Grid system also needs several softwares to run and those softwares have to support each other.

2

3

Grid System Configuration Methodology

To implement a Grid Computing System, we followed three steps. 1) Questions to ask before implementing the Grid as an elicitation process, 2) Planning, 3) Verification. We asked those questions before implementing the grid as an elicitation process.

Experimental Data of Grid Implementation

To perform the performance test, we implemented a cluster grid with 11 nodes. Grid system is composed of 1 submit host, 1 administration and master host and 10 execution hosts. Networking among the grid system


221

is made by 12 port AT&T hub with the connection of category 5 cables. On each host the Sun Solaris operating system is installed. After that we decided how hosts will know each other and we set up a Domain name Server (DNS) on the master host. Every host has to be in the DNS list, otherwise installation of daemons will give errors. Before installing the daemons, all systems should know each other, in order to make sure that, under /etc into the file called host is created and hosts are typed in here. If you skip this step, mount utility will not work and you will not be able to use Network File System (NFS). In order to have a shared folder, NFS is set. The Network File system allows you to work locally with files that are stored on a remote computer disk but appear as if they are present on your own computer. The remote system act as a file server, and local system act as the client and queries the file server. Another important step is defining the SGE_ROOT which is the root folder of the grid. On every host this SGE_ROOT has to be defined. If you try to install daemons before defining SGE_ROOT, installation will give error. In this implementation the SGE_ROOT is /opt/grid5.3. The services also have to be defined under /etc/services for the grid, before installation of the daemons. Figure 2 shows our Grid Network Architecture.

3.1

Host Daemons

Master Host Daemons: The daemon for the master host is sge_qmaster daemon which is responsible for center of the cluster’s management and scheduling activities, sge_qmaster maintains tables about hosts, queues, jobs, system load, and user permissions. It receives scheduling decisions from sge_schedd and requests actions from sge_execd on the appropriate execution hosts. Master computer has also sge_schedd daemon, as we mentioned above it makes the decisions fro which jobs are dispatched to which queues and then forwards these decisions to sge_qmaster, which initiates the required actions. Another daemon that a master runs is sge_commd, this daemon provides the communication between hosts, therefore it has to run on each hosts, not only on the master host.

Figure 2: Lab Grid Network Submit Host Daemons: The only daemon that runs on submit hosts is sge_commd in order to provide the TCP communication. We need to add this service in service file over a well known TCP port on the submit host. In this implementation tcp port 536 is used. Execution Host Daemons: Execution hosts use sge execd which is responsible for the queues on its host and for the execution of jobs in these queues. Periodically, it forwards information such as job status or load on its host to sge_qmaster. We installed execution hosts after master and submit hosts are ready to run. Before sge_execd is installed, we need to specify which services will be used for the grid. In the service file we define sge_commd, sge_execd as services for the grid and we installed sge_execd daemon. If you don’t set permissions as needed, the grid systems will not operate and can’t finish the sent jobs. The shared folders permissions should have write permission, too along with read permission. Since, the test objective is to make a comparison between an individual computer and sun grid system performance, a parallel environment has to be set for the grid. The test will be carried out by sending a job to

222


the grid which processes this job in a parallel environment and an individual system that process this job by itself. As soon as job is sent to the grid, master computer checks its table for the hosts that show which computer load is low and which host is running which job and which hosts are more available to handle the job. According to this table information, execution hosts take over the job, they run and finish and create the output. The first thing to do is choosing the environment that we work with. There are 2 choices PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). PVM is a software package that permits a heterogeneous collection of UNIX and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. The software is very portable. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming. With tens of thousands of users, PVM has become the de facto standard for distributed computing world-wide [2]. On the other hand, MPI was designed for high performance on both massively parallel machines and on workstation clusters. MPI is widely available, with both free available and vendor-supplied implementations. MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. There is no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard [3].

4

Discussion and Conclusion

A grid configuration methodology is given with a step by step procedure. Before implementation of a grid, the system administrators should follow the given procedures and should understand the core requirements in order to be successful in implementation of the grid. Section 3 provides a detailed explanation which procedures the system administrators are supposed to go through. If the grid can be set up in an appropriate way by understanding core requirements, there is no doubt about its robustness. As the goal of this paper, we are doing a performance analysis on the grid.

Figure 3: Grid Performance Table Figure 3 presents the grid performance analysis that we have done by using ten nodes. These computers have the same hardware characteristics and the same operating system which is Sun Solaris 9. If we consider the performance results, we can easily notice that the more nodes, the more performance, and from first node to second node the performance doubles itself.

Because main logic of Grid computing is to use unused resources and having high performance computing by using those resources, it is more proper and logical to use PVM in our grid. PVM is better because applications will be run over heterogeneous networks. It has good interoperability between different hosts. PVM allows the development of fault tolerant applications that can survive host or task failures. Because the PVM model is built around the virtual machine concept (not present in the MPI model), it provides a powerful set of dynamic resource manager and process control function [4] Figure 4: Grid Performance Analysis


Figure 4 is the graphical presentation of the same analysis. Job duration is more than 12 minutes for a single node and when we run two nodes for the same job, the duration decreases to 6 minutes. This is a great performance increase, when we add the other nodes for the same job completion; we see that performance is still increasing noticeably. When we add the tenth node, the performance decreases according to result of ninth analysis. By using nine nodes, the job is finished in 2.10 minutes; however it completes the job in 2.20 minutes with ten nodes. This shows the performance depends on not only the number of nodes but also the job’s characteristics. We can state that the peak performance node requirement for that particular job can be gained by nine nodes. Therefore we can say that the more nodes do not mean always more performance. We can conclude that for any particular job, this peak performance node requirement can differ

5

References

[1] http://docs.sun.com/app/docs/doc/817-6117 as of January 09, 2008 [2] http://www.csm.ornl.gov/pvm/ as of January 15, 2008 [3] http://www.llnl.gov/computing/tutorials/mpi/#What as of February 12, 2008 [4] http://www.rrsg.uct.ac.za/mti/pvmvsmpi.pdf as of January 15, 2005 [5] Performance analysis and Grid Computing, V. Getov et al, Kluwer Academic Publishers, 2004, ISBN10: 1402076932 [6] The HPCVL Working Template: A Tool for HighPerformance Programming, G. Liu et al., 19th International Symposium on High Performance Computing Systems and Applications, (HPCS`05) pp.110-116. [7] Online Performance Monitoring and Analysis of Grid Scientific Workflows, H. Truong, T. Fahringer, Advances in Grid Computing – EGC 2005: European Grid Conference, Amsterdam, pp.1154-1164.

223

224


!

"

) *

) )0 )

* ! "

.

) . 1 * 0 ) & 1

$

. . 0

+4 5.1

!"# $

%

""!%

)

) 2 )

)

/ ) * 1 0) )) / . 0 0 )0 / * 0 * 00 )0 3 0 1" * * ) )) / )) / * * &1 * * * $ ) / ) ) / * )) . 00 * ) * / 0 )) / % 2 0) . * ) *

)

0 1 ) ) * 0 ) 2

() *+ *

$ ) .

00 0*

*

' . 0) &

6 00

) . *

) 0) ) )0

0 0 .

*

)

.

1

& +,-

*

1

) . . )0 0

' $"!('

0) )) /

) )

&

0 0 2

0* . * 0 ) 0 )) * & 0 0 * 0* . 0 0 )

))

* ) 0

.) 0

%

&

& 0 )) 0

& #

# $

) )

)

)#

)

)

0 )) 0 ) / ) )

* .

/ .) )) ) 2 ) . 0 ) . * * ) ) ) 0 * )) * / ) . * ) ) . * /* * * $ ) 2 )) 1 * / 0* 0

)

) .

. & )) ) 1 # *

0

. * 0 0

* 0

0

$ 9

$ 0 * 0

0 )

0*

1

* *

) ) )

1 & *

0 ) &% 1

) . )

1 ) 7 /* 0* 0* 18 . )

) ) /* )

*

*

& 1

$ ) *

)

9 ) < 0 = * 0 0 2 0 ) ) 1 $9 0

. * )) *

1 0

. 0 * $ 0 00 &

0) .)

0 )) 0 * 0 * ) 0 )

0

: $; ) . )) * . . 0

. 0)

: . $

* 0

;

00

. $ 0 / ) 0 ) :! &;

0 :

$; / * ! .* / .* 2


$ :

0 :

$; /* 0* . 0 ) ) 0 )

$; /* 0*

225

3 0 / $ 0 1

&

6"9

)

/ 6/ * )

& *

$$ 9

6"

) . ) 0

) )

)

)

0 $ . 6" *

00

$ 0

$$ 0

0 )

. .

$ 0) 0 / *

3

0

0 @

0 / *

1

) 0)

0

.

:

$$; 1 .

9

' 0* 0 *

0 $ .) *

$

0

$6

=

.

0 ) /

) 9

)0

'2 0 1

* $

)

0 )

$

*

. &

0

.

0

) . 0

0 0 9 * 0) / * *

"

.

$$6

*

9 $ 0

)

0

0

*

1

' 0* 0 0

. 0

0 6

. 0

0

)

$6

)) / .

0

0

2 0 )

)

& )

$ '2 0

0 ) 7;

*

&

+>-

$ 0 $ 0

)) 0

9 )

)

.

0

0

. )

1 0

0 '2 0

) 0

1+> B-

. 2 6 )

6"

00

,

0

.( "

0 0

) /* 0* * *

&

.

. ) 0 )) : & ;+7?- / * )) / . . 0 0 0 )0 * 1 . / )) ) 0) . ) & & * * 0 ) 2 ) . . .1 . ) 0 . 2 0 )) ) . ) ) * 0 ) )) 0 0 ) ) 1 * 0 0 ) 2 0 ) 0 0 '2 0 ) 1 ) . 0 )) '2 0 ) @ 0 * A 0* )) ) 2 0 '2 0 ) ) 3 )) ) 2 0 /* ) 0 / * 0 ) 2 0* . ) )) ) 0 ) * . * 0 ) * '2 0 ) )) * 0 1

0

) *

9 $ 1

0)

0

. * ) .

) / *

*

0

* )

)

*

$9 & ) * & 0

)0

0 0 / .

0) 0 0 ))

0 71+>-

-

0 )

.*

0

, (

(

)

)

*

) &

1 )) / . .

* * /

*

0

)

*

1

6"9 ) 0* *

2 0 . )

) 0

0

6"1

)0 $$ 9 ))

)

)

. )

) ) &

0 * 0)

* 0

) 1

&

226


0

-

*

&

) +E- 0 ) * ) 0 )) . . ) F$& F $ ) ! . . : !; ) * ) & 0 / * 0 0 1

) 0 0 &

. &

) .

4 * / F

*

0* 0 00 0

* )0 G !1 * 0*

& & 2

)1

* *

)1

1 2 *

)

$

) *

1

/

-

0 ) ); 0 ))

.(

* / 0 . ) -1

$ ) 0 0

& 1 " * * * * / :) / * * 3 0 0 * ) ) %1 ) * & & & * /

*

) .

)

3

/* 0* 0 F$& 0 )) . * 0 / * 0) 1 * 0 ) * ) . $ ) /* 0* 2 0 )0 . 1

&

&

0

0

)

*

*

* 0

, )

.

/

*

& 0

0

0)

0

F &

)1 )

.(

0*

) ,;

0*

* 0 )) ) . 1 /

1

0

. 0C 0 C* : . 0C 0 C* * ) 0* D * C 0*

0

0 F$& .

D 0C . 0C/ C D & ;A

. 0C0 )) : . 0C 0 C* ;A

) C

;A :

0

*

)

*

D

8

) C D

. 0C0 ))C 0: . 0C 0 C* ) C D ;A . 0C/ : ;A

) .

) C

0 ) * & . 0C0 ))C 0 * & ) 0 . / * * 0 * ) . 1 * 0 )) ) 0 ) & 0 ) 1 & ) 0 .: 0* & / * * 0 * ) . 1 8 0 ) 0 1

&

;

* *

' 4;

4 ( $ ) )* ) *

+H. . 0

2

* )1 * 8

&

)

) )0

0 )) 0

. 00

. *


0

227

0 1 * $ ) 2 * ) . 0 1 )) / ) ) * )0 ) * ) 1 $ 0 . ) 0 0 )0 * 0 ) . 0 ) 2 * 1 . 5 * / * 0* 0 $ 1 * 0 ) 0 ) /1

*

"

(

* 0

*

"

)0 $

#

,

$

*

0

00

.

1

& 2

) * F

00

*

@

0 0

*

3 0 F$& ) ) )0

8

0

* 0 0 &

)

' ) * 0 1 = 00 )

0 0

9 )

* ) 0 * = 0

*

. )0

. & 2

0)

.

) ) ) ) ) . 18 2

) 1

*

/

) 1 0

0 0

.

*

*

)

)0

0

*

0 1 * G ! 0 /* 0* 3 0 +B/ F * 0 G ! 0 1 * * F 3 0 /* 0* ) . 1+I-

0 F .

* ))

)

0

*

0 F

0 )0

&5

1

*

* 0 .

*

) 0

) 1

)

.

))

*

&

0 0 )) )

. .

&

*

* *

& )

/ *

.

* 0

0 )

) )) / * 0 ) . * 0) ) 1 & ) * ) * : ) . 0 0 ; 0 )) /* * / 0 * 0 & * )

0 * ) * / $ ) 00 * ) * 0 0* 1

* ) / .

&& +7- * 0* ) . +,-

$ *

.9 )0 9660

+4-$ . * .

5; *

0

0

0* 0

0

00

)0

)

0

)0 ) G !

) 0

* / 0

1 ' 0*

0

1

0 . ))

0* . .

0

1

9

)6

$1 $ /

. 0* ) )

0

' *

1 $

1*

. 0* $ 9 / ) / ) / 0 0 . $0 0

) 1 / . :! 0 7,,>;1 $ . 7IIE1 +5- J1 )

0* 0 $

10*68 )0

J $ . J1 .) J .* &

0

)

0 1

0 G ! $0*

$

10

& 0 0 ) ) #:,??5;

.

+BL

* & .

. )1

. . . $

.$ 0 )

+># L. 0 / )) :,??4;

$

.

"0

* F1 . * .)

0* ) .&

7III1 1J'K )

% & .

1 0

)

0 .

2 A M

) 9

228


)8

.

+E- $

M ) 0* 5 E ,??7

1

)

)1

0 0 . 8 * . +H-

! $ *

) F

$ 1

,??>

2 #

) / 0* 0

,??,1 * * ,??,

& & $

1

966 )

+I* ) 8 !'K ,??> +7?- J *

7

1 .61 * .

0

0*

) )) )

0

) .

0 . 8 !'K

)9 )1 )


SESSION LATE PAPERS Chair(s) TBA

229

230



231

Implementation of Enhanced Reflective Meta Object to Address Issues Related to Contractual Resource Sharing within Ad-hoc Grid Environments Syed Alam, Norlaily Yaacob and Anthony Godwin Faculty of Engineering and Computing Coventry University, Priory Street, Coventry CV1 5FB, UK {aa117, n.yaacob, a.n.godwin} @coventry.ac.uk

Abstract ⎯ Grid technologies allows for policy based resource sharing. The sharing of these resources can be subject to constraints such as committed consumption limitation or on timing basis. These ad hoc grid environments where non dedicated resources are shared, introduces several interesting challenges interims of scheduling, monitoring, auditing resource consumption and the issues of migration of tasks. This work attempts to propose a possible framework for non-dedicated adhoc grid environments based on Computational Reflection’s Meta object programming paradigm. The framework uses the Grid Meta Object with enhanced properties to aid in monitoring and migration related tasks.

In the above scenario a possible strategy may be to immediately abort the grid job since the committed resource level has been reached. However, an immediate abortion will result in an incomplete grid job. A fair policy in such scenario needs to look at the contract of resource sharing as being hard or soft. We refer to hard resource sharing as a fixed resource commitment and consumption model where no further resource access is possible beyond the committed level. On the other hand, soft resource sharing refers to a less restrictive resource sharing variant allowing consumption beyond the committed level.

Index Terms- Grid Computing, Middleware, Meta Object, Monitoring, Migration

An employable grid computing model for both the hard and soft resource sharing requires monitoring and auditing of committed resources. Apart from monitoring, such models also need to provide support for job migration so that after consuming committed resources on the execution node the job may migrate to another node if possible.

1. INTRODUCTION Sharing of resources within grid environment[1,2] can be based on various computing factors such as CPU Cycles, Memory Consumption and Disk usage etc. With recent adoption of grid technology in various non-dedicated computing environments such as mobile computing devices, this resource sharing can also be based upon non-computing factor such as time. An example of this can be an enterprise using their Local Area Network computers within a grid environment during the non business or some other dedicated hours. Sharing of such nature introduces several interesting challenges in terms of controlling and managing resource allocation and consumption. If a grid job is requesting further resource consumption and the node has reached the committed grid resource level then a suitable strategy needs to be introduced.

Migration of a job to another node will require identification of a suitable grid node and validation of its pre-requisites on the available node. These pre-requisites may be availability of specific runtime or a specific version of an operating system. The status of the job also needs to be archived and the job needs to be started from the last point of migration. The rest of the paper discusses how computational reflection aids in addressing the contractual resource sharing issues within ad-hoc grid environments: Section 2 provides a discussion on ad-hoc resource sharing. Section 3 addresses the properties of Grid Meta object and how they can be used in a larger framework. Section 4 comments on possible costs and overheads of the proposed framework and finally the conclusions and future work is presented in section 5.

232


2. RESOURCE SHARING ISSUES IN AD-HOC GRID ENVIRONMENTS A major issue within ad-hoc grid environment is that the resources are shared on a temporal basis. For example a non dedicated grid user may become the member of the grid community for a specific duration of time willing to commit a limited quantity of resources. This contributes towards issues relating to control and consumption of resource availability and monitoring during this ad-hoc grid membership. It also leads to overheads of monitoring and auditing of these temporal nodes for their availability and the status of any jobs being submitted to them. Due to the ad-hoc nature of the grid membership, a binding contract for resource sharing needs to be addressed in order to secure the availability of the temporal node for the duration on the contract. The grid infrastructure should cater for these contractual needs of the negotiating nodes. These contractual needs can be of diverse nature. For example, a node may wish to commit a specific amount of disk space or a specific amount of CPU cycles during specific hours. These resource sharing constraints need to be addressed by the grid framework.

3. GRID META OBJECT Computational Reflection [3] and its implementation techniques [4] allow support for monitoring of objects by allowing a base object to be monitored by a connected Meta object. The Meta object maintains a casual connection with the base object and utilizes a data structure to store the monitoring information. Our implementation of the Grid Meta Object [13] supports the JAVA [6] platform with builtin monitoring support (See figure 1)

to favor grid environments. We have proposed the following enhanced set of properties: • Ability to maintain base object status information: This is same as the classic base and Meta object model where the ability to maintain base object status information is acquired through a suitable structure and a casual connection with the base object. •Ability to acquire node identification/environment information: An interesting property is base object's access to its execution environment. This is implicitly known to the base object and is not modeled explicitly. However, for grid Meta object the Meta object must have some kind of property to identify its execution node by a unique identifier such as node's IP. This will assist in migration of Meta object to identify the source and destination nodes of the Meta object. • Ability to send and receive through broadcast channel: The Meta object should be allowed to send and receive data through some kind of broadcast channel so that exact replica of the Meta object on other grid nodes can maintain replica Meta objects for specific purpose. • Ability to maintain migration, special action flag methods: A meta object should be allowed to store specific action flag methods and be able to store associated action, such as, if an exception occurs in the base object or a specific function is called on base object, the meta object can perform some special kind of action, such as broadcasting a message to associated remote meta object or node or start negotiating migration process to another node.

• Ability to maintain specific call and action structure: This is the same case as above with the only difference that there may be a stack of action and an alternate action defined.

Fig 1. Base and Meta Object with the Implementation Framework

The major components [7, 8 and 9] of the implementation framework are based on JAVA technology to support heterogeneous grid environments. Our work has proposed several enhancements to the traditional Meta object structure

• Ability to serialize: Meta object should have the ability to be serialized on streams so that they can have persistence and their life span is not only restricted to the in-memory execution life cycle. Persisting Meta object will help further analysis of Meta objects after execution life cycle is finished and broadcast of Meta object through streams such as disk or socket can become a possibility. 3.1 ADVANTAGES OF GRID META OBJECT The proposed Meta object properties offer the following advantages to the grid environment:


Fault Tolerance The base and Meta object reflective model allows full fault tolerance. A Meta object constantly maintains the base object state related information and can be used as a check point mechanism in the event of a base object crash. This model can be further extended beyond a physical machine if the Meta object is allowed to broadcast its status information to another machine. Dynamic Adaptability Reflective Systems are considered self aware and self healing. Systems based on computational reflection can adapt to changes as they happen within the execution boundaries. An example of self-aware and self healing systems can be reviewed in [5].

233

In this model a grid Meta object is allocated to a base object representing a grid job. The grid Meta object maintains the base job object’s status related information in such a manner so that it can be utilized later. The grid Meta object can be serialized to a suitable object store should the job be aborted due to a resource constraint. The broadcast feature of Meta object can assist in keeping a remote replica of the executing job’s Meta object. Both serialization and broadcast can utilize encryption to maintain security aspects. The serialized Meta object can be later used to re-create the base object and set its status to a specific point of interest, thus, allowing for the job to continue from the last check point at a different node (See figure 3) . Depending on the nature of submitted grid job and the capabilities of the Meta object bind to that job, there may exist several other possible usage of these Meta object as well.

Persistence of State Specific Monitoring Data Meta object serialization can allow archiving of state information for various analytical purposes. The serialization on streams can be used for broadcast and remote monitoring of Meta objects within the grid environments. Backup Objects Utilizing computational reflection, backup Meta objects can be created allowing reliability and fault tolerance to classes of computer application where execution reliability is a must property. 3.2 USAGE OF GRID META OBJECT The implemented Meta object supports the Java platform and with the incorporated properties it can be used within a larger distributed computing framework. The proposed usage model allows the Meta object to be used by grid job management components such as scheduler and its sub-components such as monitor. (See figure 2)

Fig. 3. Serialized Grid Meta Object utilization within grid environment

4. COST AND OVERHEADS

Fig. 2. Grid Meta Object utilization within a larger framework

The Gird Meta object adds some overheads towards excessive CPU and Memory usage. These overheads attribute to the monitoring of base object through intercepting the method calls to the base object and maintaining this information within a suitable data structure. For certain types of grid jobs the Meta object can be configured to maintain the last call rather then the complete call trace, this, results in smaller memory overhead but such usage is only suitable for applications which can continue from last method call. We are currently investing these overheads for different types of data and computation incentive grid applications.

234


5. CONCLUSION AND FUTURE WORK This paper has presented a proposed model for efficient Grid Meta object usage as part of a larger framework to address the issues related to contractual resource sharing with in adhoc Grid platform. The work is a continuation of our progress as reported in [10, 11, 12 and 13]. The work attempts to identify suitable Meta object properties to favor grid computing environments and argues that Meta object should be allowed their own specific properties other than their conventional usage limitations. We are currently investigating the overheads and possibilities of Grid Meta object usage towards various kinds of data and computational intensive grid applications.

REFERENCES [1]

I. Foster, C. Kesselman, and S. Tuecke. "The Anatomy of the Grid: Enabling Scalable Virtual Organizations". In the International Journal of High Performance Computing Applications, vol. 15, pp. 200 - 222, 2001.

[2]

I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit.” In the International Journal of Supercomputer Applications, vol.11, no. 2, pp. 115-128, 1997.

[3]

P. Maes, “Concepts and Experiments in Computational Reflection”. In the OOPSLA Conference on Object-Oriented Programming: Systems and Applications, pp. 147 – 154, 1987.

[4]

N. Yaacob. “Reflective Computation in Concurrent Object-based Languages: A Transformational Approach”. PhD Thesis, University of Exeter, UK, 1999.

[5]

G. S. Blair, G. Coulson, L. Blair, H. Duran-Limon, P. Grace, R. Moreira and N. Parlavantzas. “Reflection, Self-Awareness and Self-Healing”. In Proceedings of Workshop on Self-Healing Systems '02, Charleston, SC, 2002.

[6]

James Gosling, Bill Joy, Guy Steele and Gilad Bracha. “The Java Language Specification, Third Edition”. Addison-Wesley, Reading, Massachusetts, 2004.

[7]

S. Chiba “Load-time Structural Reflection in Java”. In the ECOOP 2000 – Object-Oriented Programming, LNCS 1850, Springer Verlag, pp. 313 –336, 2000.

[8]

A. W. Keen, T. Ge, Justin T. Marin, R. Olsson. “JR: Flexible Distributed Programming in Extended Java”. In the ACM Transactions on Programming Languages and Systems, vol. 26, no. 3, pp. 578- 608, 2004.

[9]

www.hyperic.com/products/sigar.html

[10] Yaacob N, Godwin A.N. and Alam S., “Reflective Approach to Concurrent Resource Sharing for Grid Computing”, In Proceedings of the 2005 International Conference on Grid Computing and Applications, GCA '05, Las Vegas, USA, ISBN: 1-932415-57-2.

[11] Norlaily Yaacob, Anthony Godwin and Syed Alam, “Meta Level Architecture for Management of Resources in Grid Computing”, International Conference on Computational Science and Engineering (ICCSE 2005) June 2005, Istanbul, pp: 299-304, ISBN: 975-561-266-1. [12] N. Yaacob, A. Godwin and S. Alam, “Resource Monitoring in Grid Computing Environment Using Computational Reflection and Extended JAVA – JR”, In 2nd International Computer Engineering Conference Engineering the Information Society, ICENCO'2006 , Faculty of Engineering, Cairo University, Cairo, EGYPT December 26-28,2006. [13]

Yaacob N, Godwin A.N. and Alam S., “Developing a Reflective Framework for Resource and Process Monitoring within Grid Environments”, In Proceedings of the 2007 International Conference on Grid Computing and Applications, GCA '07, Las Vegas, USA.


235

Enhanced File Sharing Service among Registered Nodes in Advanced Collaborating Environment for Efficient User Collaboration Mohammad Rezwanul Huq, Prof. Dr. M. A. Mottalib, Md. Ali Ashik Khan, Md. Rezwanur Rahman, Tamkin Khan Avi Department of Computer Science and Information Technology (CIT), Islamic University of Technology (IUT), Gazipur, Dhaka, Bangladesh. [email protected], [email protected], [email protected], [email protected], [email protected], Abstract – Now-a-days, it is extremely necessary to provide file sharing and adaptation in a collaborative environment. Through file sharing and adaptation among users at different nodes in a collaborative environment, highest degree of collaboration can be achieved. Some authors have already proposed file sharing and adaptation framework. As per their proposed framework, users are allowed to share adapted files among them, invoking their file sharing and adaptation service built on the top of advanced collaborating environment. The goal of this adaptation approach is to provide the best possible adaptation scheme and convert the file according to this scheme keeping in mind the user’s preferences and device capabilities. In this paper, we propose some new features for file sharing and adaptation framework to have faster and more efficient collaboration among users in advanced collaborating environment. We are proposing a mechanism that enables the other slave ACE nodes along with the requesting node to share the adapted files, where the nodes should have device capabilities similar to the requesting slave ACE node as well as they should be registered to the master ACE node. This approach will radically reduce not only the chance of redundant requests from different slave ACE nodes for sharing the same file in adapted form but also the adaptation overhead for adapting and sharing the same file for different slave ACE nodes. The registered nodes would also have the privilege to upload files to the server via master node. We distinguish each file according to their hit ratio that has been known from historical data so that only the frequently accessed files can be shared automatically among all other authenticated slave ACE nodes. This approach leads towards better collaboration among the users of ACE. Keywords: File Sharing, Registered Node, Hit Ratio, User Collaboration

1

Introduction

The concept of advanced collaborating environment is essential to provide interactive communication among a group of users. The traditional video conferencing becomes obsolete now-a-days due to the advancement in the field of networking and multimedia technology. The 3R factor that is- Right People, Right data and Right time, is the major concern of ACE, in order to perform a task, solve a problem, or simply discuss something of common interest [1]. Figure 1 depicts the concept of Advanced Collaborating Environment (ACE), where media, data, applications are shared among participants joining a collaboration session via multi-party networking [2]. Some early prototypes of ACE have been mainly applied to large-scale distributed meetings, seminar or lectures and collaborative work sessions, tutorials, training etc [3], [4]. Advanced Collaborating Environment (ACE) has been realized on the top of Access Grid which is a group togroup collaboration environment with an ensemble of resources including multimedia, large-format displays, and interactive conferencing tools. It has very effectively envisioned the implementation of ACE in real life scenario. The term venue server and venues come from the Access Grid multi party collaboration system [4], [5]. In [6], a file sharing and adaptation framework has been proposed for ACE. Through this framework, users at slave ACE nodes can share adapted files through master ACE node. Master ACE node has the capability to directly communicate to venue through venue client as well as venue server [6]. Slave ACE node has less capability compared to Master ACE Node in terms of device configuration as well as it cannot communicate to Venue and Venue Server directly [6]. Figure 2 shows the connectivity among master ACE

236


requests from other slave ACE nodes for sharing the same file but also the adaptation overhead to adapt the same file from master node. Moreover, to provide better and intelligent collaboration among the users by advertising about frequently accessed files through multicast messages to registered slave ACE nodes.

3

Figure 1 : Advanced Collaborating Environment (ACE)

nodes, slave ACE nodes and venue server. For file adaptation, a hybrid approach has been mentioned for adapting files which considers user’s preferences as well as user’s device capabilities.

Related Work

Much research has been initiated in the area of contextaware computing in the past few years. Many projects have been initiated for developing interactive collaboration. These projects enable users to collaborate with each other for sharing files and other media types. The Gaia [7], [8], [9] is a distributed middleware infrastructure that manages seamless interaction and coordination among software entities and heterogeneous networked devices. A Gaia component is software module

In our work, we give emphasize to build some extended feature on the top of [6] so that the degree of user collaboration can be radically increased. Our extended feature will allow any slave ACE node to share adapted files depending on its own device capabilities where the request has been originated from another slave ACE node. Moreover, we provide some privilege to registered slave ACE nodes to upload file in venue server. For providing intelligent collaboration only the frequently accessed files will be automatically shared among slave ACE nodes. To serve this purpose, we use hit ratio of accessing files to distinguish the files in two categories: hot and cold. The rest of the paper is organized as follows. In Section 2, the Problem Statement is specified clearly. Section 3 begins the discussion of related work followed by the contribution of our work is evidently described in section 4. In section 5, we describe our proposed mechanism followed by the current implementation status as well as different implementation issues to be addressed in section 6. Finally, we draw conclusion and talk about some of our future works in section 7.

2

Problem Statement

In this paper, we have tried to enhance the efficiency of file sharing and adaptation framework described in [6] by introducing slave node registration to facilitate file uploading to the venue server, reducing redundancy of file adaptation requests and applying hit ratio analysis for better collaboration. These new features allow users at Slave ACE nodes to share files in an efficient and faster way which will definitely increase user collaboration. Thus our problem statement may be summarized as follows: To provide file uploading privilege as well as efficient and faster file sharing capabilities to the users at slave ACE nodes by reducing not only the chance of redundant

Figure 2 : Master and Slave ACE nodes

that can be executed on any device within an Active Space. Gaia a number of services, including a context service, an event manager, a presence service, a repository and context file system. On top of these basic services, Gaia’s application framework provides mobility, adaptation and dynamic binding of components. Aura [10], [11] allows a user to migrate the application from one environment to another such that the execution of these tasks maximizes the use of available resources and minimizes user distraction. Two middleware building blocks of Aura are Coda and Odyssey. Coda is an experimental file system that offers seamless access to data [10] by relying heavily on caching. Odyssey includes application aware adaptations that offer energy-awareness and bandwidth-awareness to extend battery life and improve multimedia data access on mobile devices. The work on user-centric content adaptation [12] proposed a decision engine that is user-centric with QoS awareness, which can automatically negotiate for the appropriate adaptation decision to use in the synthesis of an optimal adapted version. The decision engine will look for the best


237

trade off among various parameters in order to reduce the loss of quality in various domains. The decision has been designed for the content adaptation in mobile computing environment.

requests, execution of adaptation process as well as communication overhead. And finally we devise a mechanism to facilitate the users with an intelligent and easy experience of collaboration based on hit ration of files.

The work described in [6] proposed a file sharing and adaptation framework. In [6], users are allowed to share adapted files among them, invoking their file sharing and adaptation service built on the top of advanced collaborating environment. The goal of this adaptation approach is to provide the best possible adaptation scheme and convert the file according to this scheme keeping in mind the user’s preferences and device capabilities. But in their proposed framework, they didn’t allow slave ACE nodes to upload file. Moreover, only the requesting slave ACE node receives the adapted files but other ACE nodes may initiate another request to share the same file which incurs a lot of processing overhead to the master ACE node. We have tried to extend the service provided by [6] in our work. In later sections, we discuss our proposed mechanism to enhance the file sharing service proposed in [6] in detail.

Thus, the comparison shows that our work will definitely encompass a meaningful advancement over the aforementioned work in the issues of increasing slave node activities to an acceptable extent, reducing the redundant complexity of data adaptation requests, decreasing the communication overhead at the master ACE node with both venue Server and Slave nodes and enhancement of overall collaboration by considering the hit ratio.

4

Our Contribution

To the best of our knowledge there is not much work in the issues like Data Adaptation and File Sharing in Advanced Collaborating Environment, though other related fields had been explored as we described in the previous section. File Sharing and Adaptation Framework illustrated in [6] includes file sharing and data adaptation service. File sharing service is demonstrated by the realization of data adaptation service. These two services are very much necessary to provide effective collaboration among users in advanced collaborating environment. But there are some problems associated with this approach. As for example, in the existing framework, the slave ACE nodes cannot upload any data or file in the venue server which hinders to achieve maximum collaboration among ACE nodes. Again another drawback is, if there were multiple requests from different nodes of compatible device capabilities, the data adaptation technique must be repeated several times, which is an absolute overhead for the overall system. Furthermore, no automated features for enhancing user satisfaction have been found in the existing framework. Our target is to provide some extended features on the top of the existing framework so that the highest degree of collaboration among users can be realized. In this paper, we have tried to identify these problems and provide some effective solutions to address these issues.To enhance user collaboration, we propose node registration, which ensured a minimal privilege for the slave ACE nodes to upload files to the venue server via master node. We also propose for a data analysis approach based on user preference along with device capability records, which drastically reduces the chance of redundant file adaptation

We believe that our effort will certainly play a leading role for overcoming the deficiencies of this framework and also break new ground for more advancement in this field of research.

5

Proposed Mechanism

In this section, we describe in detail the mechanism for our proposal. Before going into the detail, we give some definitions of related terminologies. • •

• •

• • • •

User Registration: In ACE, the user registration is normally done by providing e-mail address along with other necessary information. Node registration: Node registration is a proposed feature of our paper, where the network admin of a venue server will authenticate some of the slave ACE nodes as registered node. Thus these nodes would have the privilege of uploading files to the venue server. Requesting node: The slave ACE node that requests for a file to master ACE node is termed as requesting node. Hit ratio: Hit ratio is the ratio between the number of times a file is requested and the total number of files requested. A mathematical representation of this term can be: Hit ratio, Hf1 = Number of times f1 is requested/ Number of total requests of files. where, f1 is any particular file. File counter: A process that counts the number of times a file is requested. Threshold Value: It is a dynamic value that is used in our framework to determine the ranking of files according to their hit ratio. Hot listed files: The list of files for which the number of requests exceeds the threshold value is termed as hot listed files. Cold listed files: The list of files for which the number of requests remains below the threshold value is termed as cold listed files.

238


Figure 3 : Overall Block Diagram of Proposed File Sharing Service

•

File Cache: It is a temporary storage where the files are stored for a limited period of time after successful adaptation.

Figure 3 depicts the overall block diagram of the proposed file sharing service. We explain the overall mechanism by decomposing it into several parts. The next section describes the total mechanism of our proposed enhanced file sharing service.

5.1

File Uploading Mechanism

At first, user connects from slave ACE node to master ACE node. S/he needs to provide e-mail address and master ACE node address. Then, the system checks whether the user is already registered with the system or not. If user is not registered, the system checks whether the user fulfills a minimum node status for being registered. If it is eligible then the network admin registers the node for uploading files, otherwise the node is not permitted for uploading. If the node has already been registered then it may request master ACE node to upload one or more file/s to the venue server. Lastly, the master ACE node takes the file/s from the Slave node and uploads it to the server. Figure 4 depicts the File Upload mechanism of the system.

Figure 4 : Flow Chart of File Uploading Mechanism


5.2

File Sharing Mechanism:

Firstly, user connects from slave ACE node to master ACE node by providing e-mail address, user preference and master ACE node address. After being connected, the slave ACE node requests for a file. The master ACE node increases the hit counter for that particular file. After this, two types of checking are being executed.

239

deleted from the cache. Figure 6 depicts the File Sharing mechanism of the system.

The system checks whether the hit counter exceeds the threshold value. If yes, an advertise for sharing this adapted file is multicasted to all the slave nodes of compatible device capabilities. The other checking is for whether any adapted version of the requested file of same preference already exists in the cache. If it is available then the file is simply sent by retrieving from the cache. If not then the normal file adaptation approach is followed. Figure 5 depicts the file sharing mechanism of the system.

Figure 6 Flow Chart of Cache Updating Mechanism

6

Current Implementation and Future Issues

Status

Master file sharing service had been implemented as an AG shared application. That is why, it was implemented in python. This module retrieves files from data store at venue server and sends this file for appropriate adaptation. Figure 7 shows the user interface for connecting master ACE node and choosing specific file.

Figure 5 Flow Chart of File Sharing Mechanism

5.3

Cache Updating Mechanism

At the beginning of this process the system checks whether the periodical time for updating cache has been elapsed or not. If yes then the master ACE node analyzes the hit list data by comparing the file counter with the threshold value. If the file counter of a file is more than the threshold value then that particular file would be treated as hot file and it remains in the cache. If the file counter does not exceed threshold then it is treated as cold item and is

Figure 7: Connecting Master ACE Node and Choosing File

240


Slave file sharing service had been implemented as stand-alone application. It was implemented in python. It provides the user interface for entering venue URL and selecting the desired file. It also shows (see figure 8) the confirmation of successful reception of adapted file to the users at slave ACE nodes.

figure 9, the registered_node table would have node_id as Primary key and device_id as Foreign key referred from the device_profile table. In fact device_MAC field of this table is the main factor to distinguish each machine separately. For enhanced file sharing feature, there are two main concerns. Whether any adapted version of the file exists in the cache, and whether the device capability of the requesting node matches as one of the compatible nodes listed in the table. We explain each of these concerns one by one. The existence of any adapted version of the file in the cache mainly depends on the hit ratio based analysis. A dynamic threshold value is used for ranking the files according to the number of time the file has been requested. That is if the file counter is reaches the threshold value the file is known as Hot File and otherwise it is labeled as cold file. Now the formula for calculating hit ratio devised as follows: • Hit Ratio, T = (File counter value / Total Number of requests arrived) × 100%

Figure 8: Confirmation Window of Successful Reception

Data adaptation service is a stand-alone application.This module has a decision engine which provides the appropriate adaptation scheme for converting the original file. At the beginning, the user enters his/her email address and preferred file type for sharing. Then, he/she presses the appropriate button for connecting Master ACE node (see figure 7). After pressing the connect button the master file sharing service will connect to the venue data store and retrieves file of user mentioned type. Then, the user will select one of the files. The specified file will be downloaded through the master file sharing service, the decision engine takes the decision of selecting appropriate adaptation method and then it will be passed to the data adaptation service. The function DataAdapter() takes the file and convert it based on the decision provided by the decision engine. Then adapted file will be sent to Slave ACE node by master file sharing serviceandtheadaptedfile will bestoredintothelocal storage of slave ACE node. Figure 8 shows the confirmation screen that adapted file has been successfully received by slave ACE node. There are a lot of future implementation issues which will be discussed in this portion of text. We try to implement each of these features very soon and incorporat eit with our current prototype. The basic requirement of the file upload feature is the node registration. The network admin would register an eligible slave ACE node to provide with the capabilities of uploading files to the venue server. As depicted in the

If T > 30% for any particular file, we termed that file as hot file. They are not deleted from the file cache. The files that have T value below the mentioned level are named as Cold files. To update the file cache as well as to obtain free space we need to drain out some Cold files. If T < 5% for any particular file, that cold file is deleted from the cache. One thing must be mentioned here, the deletion of files is done periodically. And this periodical time is a variable measure. For our system we may consider its value equal to 1 week. Now the later concern for file sharing feature is, in case of compatible device capability, we emphasize on lowest possible capability. Whenever we have any particular adapted file in the file cache, then the system searches for the slave nodes those are compatible with the device capability used for adapting that file. To find out a compatible node D (d1, d2……dn), D is acceptable, if and only if di ≥ vi ( i = 1…n) where, di = device configuration in ith dimension vi = Lowest possible device capability in ith dimension We will try to implement the aforementioned feature as soon as possible considering all implementation alternatives. Our proposed enhanced feature can be useful for efficient file sharing. We will implement our module providing enhanced and extended feature in python so that they can be easily pluggable into ACE which has been built on the top of access grid.


241

Figure 9: Schema Design for Backend Database

7


In this paper, we have presented the enhanced features for file sharing and data adaptation framework in ACE. Our proposed features enable users at Slave ACE nodes to share adapted files faster and intelligently. Moreover, file uploading mechanism allows users at slave ACE nodes to upload files in the venue server as well via master ACE node. These features together realize the improved file sharing service. There is a lot of interesting work to be done in the near future for efficient file sharing and adaptation service. We plan to implement P2P file sharing among the slave ACE nodes. We may also explore the issues like automated device identification for the user and node registration purpose as well as user requirement prediction which will ensure higher degree of user collaboration. We believe our proposal for enhanced file sharing service and the prototype implementation to realize this service will be considered as a leading work in this domain in near future.

References [1] B. Corri, S. Marsh and S. Noel, “Towards Quality of Experience in Advanced Collaborative Environments”, In Proc. of the 3rd Annual Workshop on Advanced Collaborative Environments, 2003. [2] Sangwoo Han, Namgon Kim, and JongWon Kim, “Design of Smart Meeting Space based on AG Service

Composition”, AG Retreat 2007, Chicago, USA, May 2007. [3] R. Stevens, M. E. Papka and T. Disz, “Prototyping the Workspaces of the Future”, IEEE Internet Computing, pp. 51-58, 2003. [4] L. Childers, T. Disz, R. Olson, M. E. Papka, R. Stevens and T. Udeshi, “Access Grid: Immersive Groupto-Group Collaborative Visualization”, In Proc. of Immersive Projection Technology Workshop, 2000. [5] Access Grid, http://www.accessgrid.org/ [6] Mohammad Rezwanul Huq, Young-Koo Lee, Byeong-Soo Jeong, Sungyoung Lee: Towards Building File Sharing and Adaptation Service for Advanced Collaborating Environment. In the International Conference on Information Networking (ICOIN 2008), Busan, Korea, January 23-25, 2008. [7] Anand Ranganathan and Roy H. Campbell, “A Middleware for Context-Aware Agents in Ubiquitous Computing Environments”, In ACM/IFIP/USENIX International Middleware Conference, Brazil, June,2003. [8] The Gaia Project, University of Illinois at UrbanaChampaign, http://choices.cs.uiuc.edu/gaia/, 2003. [9] M. Roman, C. Hess, R. Cerqueira, A. Ranganathan, R. Campbell and KNahrstedt, “A Middleware Infrastructure for Active Spaces”, IEEE Pervasive Computing, vol. 1, no. 4, 2002. [10] M. Satyanarayanan, Project Aura, http://www2.cs.cmu.edu/~aura/, 2000. [11] M. Satyanarayanan, “Mobile Information Access”, IEEE Personal Communications, http://www2.cs.cmu.edu/~odyssey/docdir/ieeepcs95.pdf,Feb. 1996. [12] W.Y. Lum and F.C.M. Lau, “User-Centric Content Negotiation for Effective Adaptation Service in Mobile Computing”, IEEE Transactions on Software Engineering, Vol. 29, No. 12, Dec. 2003.

242


Overbooking in Planning Based Scheduling Systems ∗ Georg Birkenheuer #1 , Matthias Hovestadt ∗2 , Odej Kao ∗3 , Kerstin Voss #4 #

Paderborn Center for Parallel Computing,

Universität Paderborn Fürstenallee 11, 33102 Paderborn, Germany 1

[email protected]

∗

4

[email protected]

Technische Universität Berlin

Einsteinufer 17, 10587 Berlin, Germany 2

[email protected]

Abstract Nowadays cluster Grids encompass many cluster systems with possible thousands of nodes and processors, offering compute power that was inconceivable only a few years ago. For attracting commercial users to use these environments, the resource management systems (RMS) have to be able to negotiate on Service Level Agreements (SLAs), which are defining all service quality requirements of such a job, e.g. deadlines for job completion. Planning-based scheduling seems to be well suited to guarantee the SLA adherence of these jobs, since it builds up a schedule for the entire future resource usage. However, it demands the user to give runtime estimates for his job. Since many users are not able to give exact runtime estimates, it is common practice to overestimate, thus reducing the number of jobs that the system is able to accept. In this paper we describe the potential of overbooking mechanisms for coping with this effect.

Keywords: Grid-Scheduling, Overbooking, Resource Management, SLA

1

Introduction

Grid Computing is providing computing power for scientific and commercial users. Following the common evolution in computer technology, the system and network performance have constantly increased. The latest step in this ∗ The

authors would like to thank the EU for partially supporting this work within the 6th Framework Programme under contract IST-031772 Advanced Risk Assessment and Management for Trustable Grids (AssessGrid).

3

[email protected]

process was the introduction of multiple cores per processor, making Grid nodes even more powerful. This evolutionary process particularly affects the scheduling components of resource management systems that are used for managing cluster systems. On the one hand, the increasing number of nodes, processors, and cores results in an increased degree of freedom for the scheduler, since the scheduler has more options of placing jobs on the nodes (and cores) of the system. On the other hand, also the requirements of users have changed. Commercial users ask for contractually fixed service quality levels, e.g. the adherence of deadlines. Hence, the scheduler has to respect additional constraints at scheduling time. Queuing is a technique used in many currently available resource management systems, e.g. PBS [1], LoadLeveler [2], Grid Engine [3], LSF [4], or Condor [5]. Since queueing-based RMS only plan for the present, it is hard to provide guarantees on future QoS aspects. Planning-based RMS make functionalities like advance reservations trivial to implement. If a new job enters the system, the scheduling component of the RMS tries to place the new job into the current system schedule, taking aspects like project-specific resource usage limits, priorities, or administrative reservations into account. In planning-based systems it is mandatory for the user to specify the runtime of his job. If thinking of the negotiation of Service Level Agreements, this capability is essential for the provider’s decision making process. As an example, using fixed reservations, specific resources can be reserved in a fixed time interval. In addition to plain queuing, the Maui [6] scheduler also provides planning capabilities. Few other RMS like OpenCCS [7] have been developed as planning-based systems from scratch.


However, (fixed) reservations in planning-based systems potentially result in a high level of fragmentation of the system schedule, preventing the achievement of optimal resource utilization and workload. Moreover, users tend to overestimate the runtime of their jobs, since a planningbased RMS would terminate jobs once their user-specified runtime has expired. This termination is mandatory, if succeeding reservations are scheduled for being executed on these resources. Overestimation of job runtime inherits an earlier availability of assigned resources as expected by the scheduler, i.e. at time tr instead of tp . Currently, mechanisms like backfilling with new jobs or rescheduling (start, if possible, an already planned job earlier) are initiated to fill the gap between tr and the planned start time ts ≥ tp of the succeeding reservation. Due to conflicts with the earliest possible execution time, moving of arbitrary succeeding jobs to an earlier starting time might not be possible. In particular the probability to execute a job earlier might be low if users have strict time intervals for the job execution since a planning-based scheduler rejects job requests which cannot be planned according to time and resource constraints in the schedule. An analysis of cluster logfiles revealed that users overestimated the runtime of their jobs by a factor of two to three [8]. For the provider this might result in a bad workload and throughput, since computing power is wasted if backfilling or rescheduling cannot start other jobs earlier as initially planned. To prevent poor utilization and throughput, overbooking has proven its potential in various fields of application for increase the system utilization and provider’s profit. As a matter of fact, overbooking results in resource usage conflicts if the user-specified runtime turns out to be realistic, or if the scheduler is working with an overestimation presumption that is too high. To compensate such situations, the suspension and later restart of jobs are important instruments of the RMS. To suspend a running job and without losing already performed computation steps, the RMS makes a snapshot of the job, i.e. storing the job process environment (including memory, messages, registers, and program counter), to be able to migrate them to another machine or to restart them at a later point of time. In the EC-funded project HPC4U [9] the necessary mechanisms have been integrated in the planningbased RMS OpenCCS to generate checkpoints and migrate jobs. These fault-tolerance mechanisms are the basis for a profitable overbooking approach as presented in this paper since stopping a job does not imply losing computation steps performed. Hence, gaps between advance reservations can be used from jobs, which will be finished before the next reservation has to be started. In the next section we discuss the related work followed by our ideas for overbooking, which are described in Section 3. In Section 4 we conclude the paper with a summary of our ideas and presenting plans for future work.

243

2

Related Work

The idea of overbooking resources is a standard approach in many fields of application like in flight, hospital, or hotel reservations. Overbooking beds, flights, etc. is a consequence of that a specific percentage of reservations are not occupied, i.e. usually more people reserve hotel rooms [10] or buy flight tickets [11, 12] than actually appearing to use their reservation. The examples of hotels and aeronautical companies show the idea we are following also in the provisioning of compute resources. Overbooking in the context of computing Grids slightly differs from those fields of applications since therein the assumption is made that less customers utilize their reservations than booked. In Grid computing all jobs that have been negotiated will be executed, however, in planning-based systems users significantly overestimate the job duration. Comparing the usage of a compute resource and a seat in an aircraft is not meaningful since generally in computing Grids no fixed intervals for the resource utilization exist whereas a seat in an aircraft will not be occupied after the aircraft had taken off. As a consequence, results and observations from overbooking in the classical fields of applications cannot be reused in Grid scheduling. As a non-classical field of application [13] presents an overbooking approach for web platforms, however, the challenges also differ from the Grid environment. In the Grid context, consider overbooking approaches is most sensible for planning based scheduling since in queuing based systems even the runtime have to be estimated and thereby an additional uncertainty has to be taken into account. Other work concerning overbooking in Grid or HPC environments is rare. In the context of Grid or HPC scheduling the benefits of using overbooking are pointed out, but no solutions are provided [14, 15]. Overbooking is also foreseen in a three layered protocol for negotiation in Grids [16]. Here, the restriction is made that overbooking is only used for multiple reservations for workflow sub-jobs which were made by the negotiation protocol for optimal workflow planning.

2.1

Planning Approaches

Some work had been done in the scope of planning algorithms. Before showing in Section 3 how overbooking can be integrated, different approaches already developed are described in the following. 2.1.1

Geometry based approaches

Theoretical approaches for planning-based systems identify that the scheduling is a special case of bin packing: the width of a bin is defined as the number of nodes generally available and its height equals the time the resource can be

244


used. As the total usage time for an arbitrary number of jobs does not end, the height of the bin is infinite. Consequently it is not a bin, rather defined as a strip. Jobs are considered as rectangles having a width equal to the number of required nodes and a height equal to the execution time determined by the user. The rectangles have to be positioned in the strip in such a way that the distances between rectangles in the strip are minimal and the jobs must not overlap themselves. Since strip packing is an NP-hard problem, several algorithms have been developed which work with heuristics and are applicable in practice. Reference [17] gives a good overview of strip packing algorithms. Two kinds, online and offline, of strip packing algorithms are differed. An offline algorithm has information about all jobs to be scheduled a priori whereas an online algorithms cannot estimate which jobs arrive in future. It is obvious that offline algorithms can achieve better utilization results since all jobs are known and can be scheduled by comparing among each other. The approaches could be divided into several main areas: Bottom-left algorithms which try to put a new job as far to the bottom of the strip and as left as possible, leveloriented algorithms [18], split algorithms [18], shelf algorithms [19], and hybrid algorithms which are combinations of the above mentioned ones. 2.1.2

Planning for Clusters

In practice, most planning based systems use first-come first-serve (FCFS) approaches. Grid scheduling has to use an online algorithm and consequently planning optimal schedules is not possible. The job is scheduled as soon as possible according to the current schedule (containing all jobs previously scheduled) as well as its resource and time constraints. The FCFS approach might lead to gaps which could be prevented if jobs would have scheduled in a different order. To increase the system utilization, backfilling [20] had been introduced to avoid such problems. Conservative Backfilling follows the objective to fill free gaps in the scheduled produced by FCFS planning without delaying any previously planned job. Simulations show that the overall utilization of systems is increased using backfilling strategies. If effects/delays of not started jobs are acceptable, the more aggressive EASY backfilling [21] can further improve the system utilization. However, [22] shows that for systems with high load the EASY approach is not better than the conservative backfilling. Furthermore, EASY backfilling has to be used with caution in systems guaranteeing QoS aspects since delays of SLA bound jobs might lead to SLA violations implying penalties. Concluding, there is much work done in the scope of planning based scheduling. Good resource utilization in Grid systems can be achieved by using backfilling. However, applying conservative backfilling does not result in

a 100% workload since only gaps can be assigned to jobs whose duration is less than the gap length. The more aggressive EASY backfilling strategy does not necessarily provide a better utilization of the system and implies hazards for SLA provisioning. Combining conservative backfilling with overbooking should further increase the system utilization and does not affect the planned schedule. Consequently, using these strategies combined has no disadvantages for not overbooked jobs and offers the possibility to schedule more jobs than with a simple FCFS approach.

3

Planning-Based Scheduling and Overbooking

This chapter explains the basic ideas to use overbooking in planning-based HPC systems. A user sends an SLA bound job request to the system. The SLA defines job type (batch or interactive), number r of resources required, estimated runtime d, as well as an execution window [tstart , tend ], i.e. earliest start-time and latest completion time. The planning-based system can determine before agreeing the SLA whether it is possible to execute the job according to time and resource constraints. Fixed Reservations Planning-based scheduling is especially beneficial if users are allowed to define time-slots (r, h) in the SLA for interactive sessions, i.e. reserving compute nodes and manually start (multiple) jobs during the valid reservation time. The difference from an interactive session and a usual advance reservation is that the reservation duration equals tend − tstart . For example r = 32 nodes should be reserved from h = [11.12.08 : 900 , 11.12.08 : 1400 ] for a duration of h = 5 hours. Such so called interactive or fixed reservations increase the difficulty of the planning mechanism as these are fixed rectangles in the plan and cannot be moved. This will have worse effects on the system utilization than planning only advance reservations less strict timed. Consequently, supporting fixed reservations step up the demand for additional approaches like overbooking to ensure a good system utilization. However, such fixed reservations appreciate the value of using Grid computing for end-users if these have either interactive applications or need to run simulations exactly on-time, like for example for presentations. For example, a resource management system (RMS) operates a cluster with 32 nodes and the typical jobs scheduled need 32 nodes and run 5 hours. During the day researchers make two fixed reservations from 9am to 2pm and from 2pm to 7pm. All other jobs are scheduled as batch jobs. In this scenario during the night, in the 14 hours between the fixed reservations, only two 5 hours batch jobs could be scheduled since these could be totally completed. Consequently, the cluster would be idle for 4 hours. To achieve


a better system utilization, either the user has to shift the fixed reservations one hour every day. Since this not feasible because of working-hours, assuming that the batch jobs finishes after 4 hours and 30 minutes enables to overbook resources and execute the three batch jobs.

3.1

Overbooking

Overbooking benefits from the fact that users overestimate their jobs’ runtime. Consequently their jobs finish before the jobs’ planned completion time. Taking advantage of this observation will increase the system utilization and thereby the provider’s profit. This section shows the process of integrating overbooking in planning-based systems following conservative backfilling as basic scheduling strategy. At first, aspects are highlighted which have to be considered when declaring jobs as usable for overbooking. Afterwards the overbooking algorithm is described which is followed by remarks concerning fault-tolerance mechanisms, which should prevent job losses in case of wrong estimations of actual runtime made. An example forms the end of this section. 3.1.1

Runtime-Estimations for Overbooking

The prevention of job losses caused by overbooking is one important task of the scheduler. Further, good predictions of the overestimated runtime forms the key factors for profitable overbooking. On the first glance, users overestimate the job duration in average by two to three times of the actual runtime [8]. Unfortunately, job traces show that the distribution of the overestimation seems to be uniform [22] and depending on the trace, 15% to nearly 30% of jobs are underestimated and have to be killed in planning-based systems after the planned completion time. Obviously, more not completed jobs could be killed when using overbooking. For instance, using the average value of overestimation from the statistical measure (which is 150% up to 500%) in scope of overbooking would lead to conflicts since half of the jobs would be killed. Instead of exhausting overestimation to their full extend, it will be more profitable to balance between the risk of a too high overestimation and the opportunity to schedule an additional job. Hence, it might be often beneficial to not subtract the average overestimated runtime from the estimated one in order to use this additional time for overbooking. In many cases only using 10% of the overestimation can be sufficient. Given a uniform distribution, this would force 10% of the overbooked jobs to be lost, but 90% would be finished and increase the utilization. These in addition to the default strategy executed jobs increase the providers profit. To use good predictions of the overestimated runtime, historical observations on the cluster and of the users

245

are necessary. A detailed analysis has to be performed about the functional behavior of the runtime since an average or median value is not as meaningful as needed for reducing the risk for the provider to cause job losses. If enough monitoring data is available, the following question arises: How can statistical information about actual job runtime be used to effectively overbook machines? The answer is to analyze several different aspects. First of all, a user-oriented analysis has to be performed since users often utilize computing Grids for tasks in their main business/ working area which results in submitting same applications with similar input again and again [23, 24]. Consequently, analyzing estimated and actual runtime should whether and how much overestimations are made. If the results show that a user usually overestimates the runtime by factor x, the scheduler can use xo < x of the overestimated time as a time-frame for overbooking. If the statistical analysis shows, that the user made accurate estimations, the scheduler should not use her jobs for overbooking. If the user underestimates the runtime the scheduler might even plan more time to avoid job-kills at the end of the planned execution time. An application-oriented statistical analysis of monitoring data should be also performed in order to identify correlations of overestimations and a specific application. Performed studies show, that automatically determined runtime estimations based on historical information (job traces) can be better than the user’s estimation [25, 26, 27]. The condition for its applicability is that enough data is available. In addition to these separated foci, a third analysis should combine the user-oriented and application-oriented approach in order to identify whether specific users over- or underestimate the runtime when using a specific application. This analysis should result in valuable predictions. 3.1.2

Algorithmic Approach

This paragraph provides the algorithmic definition of the scheduling strategy for conservative backfilling with overbooking. When a new job j with estimated duration d, number of nodes n, and execution window [tstart , tend ] arrives in the system, the following algorithm is used inserting the request into the schedule which has anchor point ts where resources become available and points where such slots end tendslot : 1. Select, if available, statistical information about the runtime of the application and the runtime estimation of the user. Compare them with the given runtime for the new job j. If • the estimated runtime d is significant longer than the standard runtime of the application • or the user is tending to overestimate the runtime of jobs

246


– then mark the application as promising for overbooking. Assuming a uniform distribution, the duration of the job d can be adjusted d where maxP oF is the to d0 = 1+maxPoF maximum acceptable probability of failure. The time interval oj = d − d0 can be used for overbooking. • else the job should not be used for overbooking d0 = d, oj = 0.

planned completion time if it had not finished at time ts −a. Considering SLA bound jobs, this might be doubtful if fulfilling the SLA of job j would be more profitable than of job k. However, the reservation duration of job k is only reduced after curtaining the duration of job j. Hence, the provider has no guarantee that the SLA of job j would be not violated if stopping the job execution of job k. Consequently, the scheduler should act conservatively and provide for job k the resources as required and prevent an SLA violation of job k.

2. Find starting point ts for job j, set ts as anchor point: • Scan the current schedule and find the first point ts ≥ tstart where enough processors are available to run this job. • Starting from this point, check whether ts + d ≤ tend and if this is valid, continue scanning the schedule to ascertain that these processors remain available until the job’s expected termination ts + d ≤ tendslot . • If not, – check validity of ts + d0 ≤ tend and whether the processors remain available until the job’s expected termination reduced by the time usable for overbooking ts + d0 ≤ tendslot . If successful, mark the job as overbooked and set the job duration d0 = tendslot − ts . – It not, check, if there are direct predecessors in the plan, which are ending at ts and are usable for overbooking. Then reduce ts by the time a = mink {ok } of those jobs k and try again. ts −a+d0 ≤ tendslot . (In this case, other jobs are also overbooked; nevertheless their runtime is not reduced. If they do not finish earlier than excepted, they can still finish and the overbooked job will be started after their initially planned completion.) If successful, mark the job as overbooked and set the job duration d0 = tendslot − (ts − a). • If overbooking was not possible, return and continue the scan to find the next possible anchor point. 3. Update the schedule to reflect the allocation of r processors by this job j with the duration d0 for the reservation h = [ts , min(tendslot , ts + d)], starting from its anchor point ts , or earlier ts − a. 4. If the job’s anchor is the current time, start it immediately. The algorithm defines that a job k which was overbooked by a job j should be resumed until its completion or its

3.1.3

Checkpointing and Migration

By using overbooking the likelihood of conflicts increases and consequently the need of preventing job losses becomes more important. Which contractor (end-user or provider) has to pay the penalty in case of a not completed job depend on the responsibilities. In conservative planning-based systems, the user is responsible for an underestimated runtime of a job. Hence, if the job is killed after providing resources for the defined runtime, the provider does not care about saving the results. The provider is responsible if it the requested resources have not been available for the requested time. Hence, violating an SLA caused by overbooking results in that the provider has to pay the penalty fee. If the scheduler overbooked a schedule with a job which is planned for less than the user’s estimated runtime and has to be killed/ dispatched for another job, the provider is responsible since resources had not been available as agreed. The RMS can prevent such conflicts by using the fault-tolerance mechanisms checkpointing and migration [9]. If the execution time had been shortened by the RMS, at the end of the reservation a checkpoint can be generated of the job, i.e. making a snapshot of its memory, messages, the registers, and program counter. The checkpoint can be stored in a file system available in the network. This allows to restart the not completed job in the next free gap before the job latest completion time. To be efficient, the free gap should be at least as long as the remaining estimated runtime. Note that filling gaps by partly executing jobs should not be a general strategy since checkpointing and migration requires resources and result in additional costs for the job execution. As result, planning with checkpointing and migration allow pre-emptive scheduling of HPC systems. 3.1.4

Example

To exemplify the approach, in the following a possible overbooking procedure is explained. Assume a Grid cluster with X nodes, each job on the cluster will be assigned to the same fixed reservations with a duration of 5 hours [700 − 1200 , 1200 − 1700 ]. Assume further that the fixed and some usual advance reservations are already scheduled,


directly beneath each other. Thus, the resources are occupied for 20 hours of the schedule: [700 − 1200 , 1200 − 1700 , 1700 − 2100 , 2100 − 300 ]. This schedule is the same every day in the week considered. Then another job j with h = 5 hour reservation for X nodes should be inserted in the schedule in the next two days. However, the resources are only free for 4 hours [300 − 700 ]! Consequently, the scheduler has to reject the job request, in case it cannot overbook the schedule 300 + 5 hours = 800 ¬ ≤ 700 . We assume that the scheduler has statistics about the estimated and actual runtimes of applications. and users that propound an overestimation by 40%. Assume the scheduler can takeo = d − d0 of the statistic over-estimated runtime of a job for overbooking, let the maximum PoF be maxP oF = 13%. For a five-hour job these are 34 minutes. (As h = 5 hours = 300 minutes and maxPoF = 0, 13 → 300 d = 1,13 ≈ 265, 5minutes → o = d − d0 = d0 = 1+maxPoF 300 minutes − 266 minutes = 34 minutes .) If we overbook the advance reservation h for 34 minutes, the schedule is still not feasible (300 + 4:26 hours = 726 ¬ ≤ 700 ) since the gap is only 4 hours and j would be given a runtime of 4 hours and 26 minutes. If the predecessor is also overbooked by 34 minutes, each job is reduced by half an hour and reservation h can be accepted (300 −0:34 hours)+ 4:26 hours = 652 ≤ 700 . Thus the job j with an user estimated runtime d = 5 hours has a duration d0 = 4:26 hours and an estimated earliest start-time from ts = 226 in the overbooked time-slot [300 − 700 ] The complete schedule is [700 −1200 , 1200 −1700 , 1700 −2100 , 2100 −300 , 300 −700 ]. Note that, overbooking is possible only, if neither j itself nor the predecessor (in case of its overbooking) is a fixed reservation. Hence, in our example avoiding an idle time of 4 hours can be achieved by using overbooking only, if the job j and the reservations before [2100 − 300 ] are not fixed. For all reservations which are not fixed, a greater flexibility exists if the start time ts of all those jobs could be moved forward to their earliest start time tstart , if the jobs before are ending earlier than planned. In this case, if the other execution times are dynamically shifted, any overbooked reservation from the schedule could be straighten out before execution. This approach has the big advantage that an 1 hour overbooked reservation h could be finished even it would use the estimated time, if in total the predecessor reservations in the schedule require all in all 1 hour less time than estimated.

4


The paper first motivated the need for Grid systems and in common with the management of supercomputers the advantages of planning-based scheduling for SLA provisioning and fixed reservations. However, advance reservations have the disadvantage to decrease the utilization of

247

the computer system. Using overbooking might be a powerful instrument to re-increase the system utilization and the provider’s profit. The paper presented the concept and algorithm to use overbooking. Since overbooking might lead to conflicts because of providing resources for a shorter time than required, fault-tolerance mechanisms are crucial. Checkpointing and migration can be used for preventing job losses and SLA violations and support the applicability of overbooking in Grid systems. Future work focus on statistical analysis of runtime and implementing the overbooking algorithm in the backfilling scheduling algorithms. At last we will develop a simulation process for testing and evaluating how much overbooking increases the system utilization and provider’s profit.

References [1] C. Ressources, “Torque resource manager,” 2008. [Online]. Available: http://www.clusterresources.com/pages/products/ torque-resource-manager.php [2] IBM, “Loadleveler,” 2008. [Online]. Available: http://www03.ibm.com/systems/clusters/software/loadleveler/ index.html [3] Gridengine, “Sun,” 2008. [Online]. Available: http://gridengine.sunsource.net/ [4] Platform, “Lfs load sharing facility,” 2008. [Online]. Available: http://www.platform.com/Products/ Platform.LSF.Family [5] “Condor,” 2008. [Online]. http://www.cs.wisc.edu/condor/

Available:

[6] D. Jackson, Q. Snell, and M. Clement, “Core Algorithms of the Maui Scheduler,” Job Scheduling Strategies for Parallel Processing: 7th International Workshop, JSSPP 2001, Cambridge, MA, USA, June 16, 2001: Revised Papers, 2001. [7] “Openccs: Computing center software,” 2008. [Online]. Available: https://www.openccs.eu/core/ [8] A. Streit, “Self-tuning job scheduling strategies for the resource management of hpc systems and computational grids,” Ph.D. dissertation, Paderborn Center for Parrallel Computing, 2003. [Online]. Available: http://wwwcs.unipaderborn.de/pc2/papers/files/422.pdf [9] “Hpc4u: Introducing quality of service for grids,” http://www.hpc4u.org/, 2008.

248

[10] V. Liberman and U. Yechiali, “On the hotel overbooking problem-an inventory system with stochastic cancellations,” Management Science, vol. 24, no. 11, pp. 1117–1126, 1978. [11] J. Subramanian, S. Stidham Jr, and C. Lautenbacher, “Airline yield management with overbooking, cancellations, and no-shows,” Transportation Science, vol. 33, no. 2, pp. 147–167, 1999. [12] M. Rothstein, “Or and the airline overbooking problem,” Operations Research, vol. 33, no. 2, pp. 237– 248, 1985. [13] B. Urgaonkar, P. Shenoy, and T. Roscoe, “Resource overbooking and application profiling in shared hosting platforms,” ACM SIGOPS Operating Systems Review, vol. 36, no. si, p. 239, 2002. [14] M. Hovestadt, O. Kao, A. Keller, and A. Streit, “Scheduling in hpc resource management systems: Queuing vs. planning,” Job Scheduling Strategies for Parallel Processing: 9th International Workshop, Jsspp 2003, Seattle, Wa, Usa, June 24, 2003: Revised Papers, 2003. [15] A. Andrieux, D. Berry, J. Garibaldi, S. Jarvis, J. MacLaren, D. Ouelhadj, and D. Snelling, “Open Issues in Grid Scheduling,” UK e-Science Report UKeS-200403, April 2004. [16] M. Siddiqui, A. Villazón, and T. Fahringer, “Grid allocation and reservation—Grid capacity planning with negotiation-based advance reservation for optimized QoS,” Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006. [17] N. Ntene, “An algorithmic approach to the 2d oriented strip packing problem,” Ph.D. dissertation. [18] E. Coffman Jr, M. Garey, D. Johnson, and R. Tarjan, “Performance bounds for level-oriented twodimensional packing algorithms,” SIAM Journal on Computing, vol. 9, p. 808, 1980. [19] B. Baker and J. Schwarz, “Shelf algorithms for twodimensional packing problems,” SIAM Journal on Computing, vol. 12, p. 508, 1983. [20] D. Feitelson and M. Jette, “Improved Utilization and Responsiveness with Gang Scheduling,” Job Scheduling Strategies for Parallel Processing: IPPS’97 Workshop, Geneva, Switzerland, April 5, 1997: Proceedings, 1997. [21] D. Lifka, “The ANL/IBM SP Scheduling System,” Job Scheduling Strategies for Parallel Processing:


IPPS’95 Workshop, Santa Barbara, CA, USA, April 25, 1995: Proceedings, 1995. [22] A. Mu’alem and D. Feitelson, “Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP 2 with backfilling,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 6, pp. 529–543, 2001. [23] A. Downey and D. Feitelson, “The elusive goal of workload characterization,” ACM SIGMETRICS Performance Evaluation Review, vol. 26, no. 4, pp. 14– 29, 1999. [24] D. Feitelson and B. Nitzberg, “Job Characteristics of a Production Parallel Scientific Workload on the NASA Ames iPSC/860,” Job Scheduling Strategies for Parallel Processing: IPPS’95 Workshop, Santa Barbara, CA, USA, April 25, 1995: Proceedings, 1995. [25] R. Gibbons, “A Historical Application Profiler for Use by Parallel Schedulers,” Job Scheduling Strategies for Parallel Processing: IPPS’97 Workshop, Geneva, Switzerland, April 5, 1997: Proceedings, 1997. [26] A. Downey, “Using Queue Time Predictions for Processor Allocation,” Job Scheduling Strategies for Parallel Processing: IPPS’97 Workshop, Geneva, Switzerland, April 5, 1997: Proceedings, 1997. [27] W. Smith, I. Foster, and V. Taylor, “Predicting Application Run Times Using Historical Information,” Job Scheduling Strategies for Parallel Processing: IPPS/SPDP’98 Workshop, Orlando, Florida, USA, March 30, 1998: Proceedings, 1998.


249

Germany, Belgium, France, and Back Again: Job Migration using Globus D. Battré1 , M. Hovestadt1 , O. Kao1 , A. Keller2 , K. Voss2 1 Technische Universität Berlin, Germany 1 Paderborn Center for Parallel Computing, University of Paderborn, Germany

Abstract The EC-funded project HPC4U developed a Grid fabric that provides not only SLA-awareness but also a software-only based system for checkpointing sequential and MPI parallel jobs. This allows job completion and SLA-compliance even in case of resource outages. Checkpoints are generated transparently to the user in the background. There is no need to modify the applications in any way or to execute it in a special manner. Checkpoint data sets can be migrated to other cluster systems to resume the job execution. This paper focuses on the job migration over the Grid by utilizing the WS-Agreement protocol for SLA negotiation and mechanisms provided by the Globus-Toolkit. Keywords: Checkpointing, Globus, Job-Migration, Resource-Management, SLA-Negotiation, WS-Agreement

I. Introduction The developments of the recent years in scope of Grid technologies have formed the basis for an invisible Grid whose distributed services can be easily accessed and used. Using Grids to execute compute-intensive simulations and applications in the scientific community is very common nowadays. Founded by national and international organizations, large grids like ChinaGrid, D-Grid, EGEE, NAREGI, TeraGrid, or the UK-eScience initiative have been established and are successfully used in the academic world. Additionally, companies like IBM or Microsoft have identified the commercial potential of providing Grid services as reflected in their participation and funding of Grid initiatives. However, while the technological basis for the Grid utilization is mostly available, some obstacles still have to This work has been partially supported by the EC within the 6th Framework Programme under contract IST-031772 “Advanced Risk Assessment and Management for Trustable Grids” (AssessGrid) and IST511531 “Highly Predictable Cluster for Internet-Grids” (HPC4U).

be solved in order to gain a wide commercial uptake. One of the most challenging issues is to provide Grid services in a manner that fulfils the consumer’s demands of reliability and performance. This implies that providers have to agree on Quality of Service (QoS) aspects which are important to the end-user. QoS in the scope of Grid services addresses primarily resource and time constraints, like number and speed of compute nodes as well as earliest execution time or latest completion time. To contractually determine such QoS requirements between service providers and consumers, Service Level Agreements (SLAs) [1] are commonly used. They allow the formulation of all expectations and obligations in the business relationship between the customer and the provider. The EC-funded project HPC4U (Highly Predictable Cluster for Internet Grids) [2] developed a Grid fabric that provides not only SLA-awareness but also a software-only based system for checkpointing sequential and MPI parallel jobs. This allows job completion and SLA-compliance even in case of resource outages. Checkpoints are generated transparently to the user in the background (i.e. the application does not need to be modified in any way) and are used to resume the execution of an interrupted application on healthy resources. Normally, the job is resumed on the same system, since spare resources have been provided to enhance the reliability of the system. However, in case there are not enough spare resources available to resume the execution, the HPC4U system is able to migrate the job to other clusters within the same administrative domain (intra-domain migration) or even over the Grid (inter-domain migration) by using WSAgreement [3] and mechanisms provided by the Globus Toolkit [4]. This paper focuses on the inter domain migration. In the next section we highlight the architecture of the HPC4U software stack. Section III discusses some issues related to the negotiation protocol. Section IV describes the phases of the job migration over the Grid in a more detailed way. An overview about related work and a short conclusion completes this paper.

250


Fig. 1. The HPC4U software stack

III. The Negotitation Protocol II. The Software Stack Architecture The software stack depicted in Figure 1 consists of a planning based [5] Resource Management System (RMS) called OpenCCS [6], a Negotiation Manager (NegMgr), and the Globus Toolkit (GT). OpenCCS: The depicted modules (related to job migration) are the Planning Manager (PM), the Sub-System Controller (SSC) and dedicated subsystems for checkpointing of processes (CP), network (NW), and storage (ST). The PM computes the schedule whereas the SSC orchestrates the mentioned subsystems for periodically creating checkpoints. OpenCCS communicates with the NegMgr through a so called ProtocolProxy (PP). It is responsible for translating between the SOAP and Java based NegMgr and the TCP-stream and C based OpenCCS. The Negotiation Manager: The NegMgr facilitates the negotiation of SLAs between an external entity with the RMS by implementing the WS-Agreement specification version 1.0. The NegMgr has been developed in a close collaboration between the EC funded projects HPC4U and AssessGrid [7]. The Negotiation Manager is implemented as a Globus Toolkit service because this allows to use and support several mechanisms that are offered by the toolkit, such as authentication, authorization, GridFTP, monitoring services, etc. Another major decision for implementing the Negotiation Manager as a Globus Toolkit service was the potential impact by this move, as the GT is among the de-facto standard Grid middlewares used throughout the world. The consumer of the negotiation service can either be an end-user interface (e. g. a portal or command line tools) or even a broker service. The consumer accesses the NegMgr for creating or negotiating new SLAs and for checking the status of existing SLAs. WS-Notification is used to monitor changes in the status of an agreement and the WS-ResourceFramework provides general read access to created agreements but also write access to the database of agreement templates.

The WS-Agreement protocol is limited in terms of negotiation, basically resembling a one-phase commit protocol. This is insufficient for scenarios with more than two parties, e.g. an end-user, who is interested in the consumption of compute resources, a resource broker, and a resource provider. Three different usage scenarios are the main subjects. • Direct SLA negotiation between an end-user and a provider. • The end-user submits an SLA request to the broker, which then looks for suitable resources and forwards the request to suitable providers. The broker returns the SLA offers to the end-user who is then free to select and commit to an SLA offer by interacting directly with the corresponding provider. • The broker acts as a virtual provider. The end-user agrees an SLA with the broker, which in turn agrees SLAs with all providers involved in executing the end-user’s application. From these scenarios we can conclude that the current negotiation interface of WS-Agreement does not satisfy our needs. According to WS-Agreement, by the act of issuing an SLA request to a resource provider, a resource consumer is already committed to that request. The provider can only accept or reject the request. This has certain shortcomings. It is common real-world practice that a customer specifies a job and asks several companies for offers. The companies state their price for the job and the customer can pick the cheapest offer. This is not possible in the current WS-Agreement specification. By submitting an SLA request, the user is committed to that request. We neglect that assumption. A user can submit a non-binding SLA request and the provider is allowed to modify the request by answering with an SLA offer that has a price tag. The provider is bound to this offer and the user can either commit to the offer or let it expire. Therefore, the interface and semantics of the WS-Agreement implementation were slightly modified. A createAgreement call is not binding for the issuer and the WS-Agreement interface is extended by a commit method. As the WS-GRAM implementation


of Globus is designed for queuing based schedulers, a new state machine was implemented that supports external events from the RMS (over the PP), GridFTP, and time events to trigger state transitions. For standardizing the execution requirements of a computational job, the Job Submission Description Language (JSDL) [8] has been introduced by the JSDL-working group of the Global Grid Forum. By means of JSDL all parameters for job submission can be specified, e.g. name of executable, required application parameters, or file transfer filters for stage-in and stage-out. Our current WS-Agreement implementation supports only a subset of JSDL (namely POSIX jobs). The Globus Resource Allocation Manager (GRAM) forms the glue between the RMS and the Globus Toolkit. It is responsible for parsing requests coming from upper layer Grid levels and transferring it to uniform service requests for the underlying RMS. Unfortunately, the GRAM is not able to handle (advance) reservations. Therefore we bypass the GRAM layer. If an SLA request has been found to be template compliant, it is passed to the resource management system which decides if it is able to provide the requested ServiceLevel. The job is integrated into a tentative schedule which forms the decision base. In case the job fits into the schedule and no previously agreed SLAs need to be violated, an offer is generated. First, the SLA describes the time by which the user guarantees to provide files for stage-in. At this time, the negotiation manager’s state machine triggers the file staging via GridFTP. Next, the SLA specifies the earliest job start and finish time as well as the requested execution time. The RMS scheduler is free to allocate the resources for the requested time, anywhere between earliest start and latest execution time. It is considered the user’s task to estimate the duration of the stage-in phase and set the earliest job start accordingly. In order to compensate small underestimations, a synchronization point is introduced. In case the stage-in is not completed when the resources are allocated, the actual execution pauses until the stage-in is completed. Note that this idle-time is part of the guaranteed CPU time and may prevent the successful execution of the program.

IV. Grid Migration Phases Initializing the Migration Progress: The first step of the migration process is triggered by the detection of a failure (e.g. a node crash) by the monitoring functionalities of OpenCCS. The PM is signaled about this outage and tries to create a new schedule that pays attention to this changed system situation. If it is not possible to generate a schedule where the SLAs of all jobs are respected, the PM initiates

251

the job migration. Packing the Migration Data Set: Before any migration process can be initiated, it has to be ensured that the migration data set is available. This is signaled to the SSC which is then generating such a data set. Beside the checkpoint itself it includes information provided by the RMS like number of performed checkpoints or the layout of MPI processes on the nodes. Once the migration data set has been generated, the PP is finally transferring all necessary data to the NegMgr for starting the query for potential migration targets. This includes also the path to a transfer directory where the SSC has saved the migration data and where the results should be written to. Finding the Migration Target: The NegMgr is now in charge of finding appropriate migration targets. For this, it uses for instance resource information catalogues in the Grid. This query is the static part of the retrieval process. In a second step, the NegMgr contacts all other NegMgr services of the returned Grid fabrics and initiates the negotiation process with them. This SLA negotiation is started with a getQuote message to the remote NegMgr, asking for an SLA offer. The remote system can either reject this message, or answer with OK. It has to be noted, that this OK is nonbinding. The source NegMgr now has to choose among the offerings, where the remote Negotiation Managers replied with an OK message. An important aspect of this ranking could be the price. At the end, the initiating NegMgr will send one of these remote NegMgrs a createAgreement message. If the remote site still agrees, it replies with OK, which makes the SLA binding for both the local and the remote site. Otherwise the remote site will terminate the process (e.g. because free resources have been allocated by other jobs meanwhile). Transferring the Migration Data Set: The transfer process is driven by the remote site. This is possible, since the SSC has generated the migration data set in a transfer directory accessible by the Globus Toolkit. Since the data transfer is secured by proxy credentials, the source site only provides these rights to the respective remote site, so that no other party may access the data. After transferring the data set to the remote site both RMSs are notified by their associated NegMgr. This is triggered by a WSNotification event. The source PM then changes the state of the migrated job to remote execution. Similar to this, the remote NegMgr signals the remote PM that the migration data set is available, so that the job execution could be resumed. Resuming the Job Execution: Once the migration data has been transferred and the PM decides to restart the job from the migrated checkpoint, the SSC is first signaled to re-establish the process environment by regarding the situation of the storage working container. During the

252


runtime, the SSC creates checkpoints in regular intervals. Transferring the Result Data Set: If everything goes right, the job finally finishes its computation and the job results can be transferred back to the source. The results are packed by the remote SSC component and saved to a transfer directory. The PP then signals its NegMgr, that the result data set is available. The remote NegMgr initiates the data transfer to the location that has been originally provided by the source site at negotiation time. Once the data transfer has been completed successfully, both NegMgrs inform their PM components. On remote site, all used resources are now released. On source site, the PM initiates the stage-out of the result data to the user specified target. This way, the user retrieves the result data, not noticing any difference to a solely local computation. Job Forwarding: In case the job also fails on the remote site, it may be migrated again to another site to resume the execution. The steps are the same as described above. However transferring the result data back to the originating site is done along the whole migration chain. All intermediate sites are then acting as a proxy, forwarding the incoming result data. This also simplifies the clean-up.

VI. Conclusion

V. Related Work

[1] A. Sahai, S. Graupner, V. Machiraju, and A. van Moorsel, “Specifying and Monitoring Guarantees in Commercial Grids through SLA,” Internet Systems and Storage Laboratory, HP Laboratories Palo Alto, Tech. Rep. HPL-2002-324, 2002. [2] “Highly Predictable Cluster for Internet-Grids (HPC4U), EU-funded project IST-511531,” http://www.hpc4u.eu. [3] A. Andrieux, K. Czajkowski, A. Dan, K. Keahey, H. Ludwig, T. Nakata, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu, “Web Services Agreement Specification (WS-Agreement),” www.gridforum.org/Public Comment Docs/Documents/Oct-2005/ WS-AgreementSpecificationDraft050920.pdf, 2004. [4] “Globus Alliance: Globus Toolkit,” http://www.globus.org. [5] M. Hovestadt, O. Kao, A. Keller, and A. Streit, “Scheduling in HPC Resource Management Systems: Queuing vs. Planning,” in Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP, Seattle, WA, USA, 2003. [6] “OpenCCS,” http://www.openccs.eu. [7] “Advanced Risk Assessment and Management for Trustable Grids (AssessGrid), EU-funded project IST-031772,” http://www. assessgrid.eu. [8] “Job Submission Description Language (JSDL),” www.gridforum. org/documents/GFD.56.pdf. [9] “UNICORE Forum e.V.” http://www.unicore.org. [10] GGF Open Grid Services Architecture Working Group (OGSA WG), “Open Grid Services Architecture: A Roadmap,” http://www. ggf.org/ogsa-wg, 2003. [11] K. Jeffery (edt.), “Next Generation Grids 2: Requirements and Options for European Grids Research 2005-2010 and Beyond,” ftp://ftp.cordis.lu/pub/ist/docs/ngg2 eg final.pdf, 2004. [12] I. Foster, C. Kesselman, C. Lee, B. Lindell, K. Nahrstedt, and A. Roy, “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation,” in 7th International Workshop on Quality of Service (IWQoS), London, UK, 1999.

The worldwide research in Grid computing resulted in numerous different Grid packages. Beside many commodity Grid systems, general purpose toolkits exist such as Unicore [9] or the Globus Toolkit [4]. Although the Globus Toolkit represents the de-facto standard for Grid toolkits, all these systems have proprietary designs and interfaces. To ensure future interoperability of Grid systems as well as the opportunity to customize installations, the OGSA (Open Grid Services Architecture) working group within the Open Grid Forum (OGF) aims to develop the architecture for an open Grid infrastructure [10]. In [11], important requirements for the Next Generation Grid (NGG) were described. An architecture that supports the co-allocation of multiple resource types, such as processors and network bandwidth, was presented in [12]. The Grid community has identified the need for a standard for SLA description and negotiation. This led to the development of WS-Agreement/-Negotiation [3]. The Globus Architecture for Reservation and Allocation (GARA) provides “wrapper” functions to enhance a local RMS not capable of supporting advance reservations with this functionality. This is an important step towards an integrated QoS aware resource management. The GARA component of Globus currently does not support the definition of SLAs nor malleable reservations, nor does it support resilience mechanisms to handle resource outages or failures.

In this paper we described the mechanisms provided by the HPC4U software stack for migrating checkpoint data sets of sequential or MPI parallel applications over the Grid. We are using the Globus toolkit for finding appropriate resources and for migrating the checkpoint and result data sets. We also developed an implementation of the WS-Agreement protocol to be able to negotiate with the local RMSs. Thanks to the transparent checkpointing capabilities, these mechanisms also apply for the execution of commercial applications, where no source code is available and recompiling or relinking is not possible. The user even does not have to modify the way of executing the job. We have shown the practicability of the presented mechanisms by migrating jobs from a site in Germany to Belgium and from there to a site in France. The results were automatically transferred back again to Germany. The developed software can be found on the project pages.

References


253

Development of Grid Service Based Molecular Docking Application 1

HwaMin Lee1, DooSoon Park1, and HeonChang Yu2 Division of Computer Science & Engineering, Soonchunhyang University Asan, Korea 2 Department of Computer Science Education, Korea University, Seoul, Korea

Abstract - A molecular docking is the process of reducing an unmanageable number of compounds to a limited number of compounds for the target of interest by means of computational simulation. And it is one of a large-scale scientific application that requires large computing power and data storage capability. Previous applications or software for molecular docking were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and the molecular docking using a workstation or a cluster-computer requires a long execution time. Thus we propose Grid servicebased molecular docking application. We designed a resource broker and a data broker for supporting efficient molecular docking service and developed various services for molecular docking. Our application can reduce a timeline and cost of drug or new material design. Keywords: Molecular docking, grid computing, grid services.

1

Introduction

Drug discovery is an extended process that may take heavy cost and long timeline from the first compound synthesis in the laboratory until the therapeutic agent or drug, is brought to market [1, 2]. The molecular docking as shown figure 1 is to search the feasible binding geometries of a putative ligand with a target receptor whose 3D structure is known. Molecular modeling methodology combines computational chemistry and computer graphics. And molecular docking has emerged as a popular methodology for drug discovery [3]. Docking methods in virtual screening can dock the large number of small molecules into the binding site of a receptor, allowing for a rank ordering in terms of strength of interaction with a particular receptor [4]. Docking each molecule in the target chemical database is both a compute and data intensive task [3]. Tremendous efforts are underway to improve programs aimed at the automated process of docking or positioning compounds in a binding site and scoring or rating the complementarity of small molecules. The challenge in the applications of molecular docking is that it is very compute intensive and requires a very fast computer to run.

Figure 1. Molecular docking In the mid 1990s, Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and high-performance orientation [5, 6]. Grid computing system [6] consists of large sets of diverse, geographically distributed resources that are grouped into virtual computers for executing specific applications. Today, Grid computing offers the strongest low cost and high throughput solutions [2] and is spotlighted as the key technology of the next generation internet. Grid computing is used in fields as diverse as astronomy, biology, drug discovery, engineering, weather forecasting, and highenergy physics. A molecular docking is one of large-scale scientific applications that require large computing power and data storage capability. Thus we developed a Grid service-based molecular docking application using Grid computing technology which supports a large data intensive operation. In this paper, we constructed a 3-dimensional chemical molecular database for molecular docking. And we designed a resource broker and a data broker for supporting efficient molecular docking service and proposed various services for molecular docking. We implemented Grid service based molecular docking application with DOCK 5.0 and Globus toolkit 3.2. Our application can reduce a timeline and cost of drug discovery or new material design. This paper is organized as follows: Section 2 presents related works and section 3 explains constructing chemical molecular

254


database. In section 4, we present architecture of Grid service based molecular docking system. Section 5 explains services for molecular docking. Section 6 describes the results of implementation. Finally, the paper concludes in Section 7.

2

Related Works

Previous applications or software for molecular docking such as AutoDock, FlexX, DOCK, LigandFit, Hex were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and molecular docking using a workstation or a cluster-computer requires a long execution time. Recently several researches have been made on molecular modeling in Grid computing such as the virtual laboratory [3], BioSimGrid [7], protein structure comparison [8], and APST [9].

However [3], [8], [9] did not support web service and [8], [9] did not provide integrated database service. Although [3] provided basic database service, it did not provide integrated database management of heterogeneous and distributed database/data in Grid environment. Therefore we construct integrated 3-dimensional chemical database and propose Grid service based molecular docking system.

3

Constructing 3D chemical database for molecular docking

In Grid computing, many applications use a large scale database or data. In existing chemical databases, order, kinds and degrees of fields are heterogeneous. Because kinds of chemical compounds become various and size of database increases, a data insertion, a data deletion, a data retrieval, and an integration of data in chemical databases become more difficult. Thus we construct a database that integrates the existing various chemical databases using MySQL. Table 1 shows Protomer table in our chemical database. Our chemical database contains 32,889 chemical molecules. Our Grid service based molecular docking system retrieves data fields for virtual screening and the retrieved data automatically composes a file mol2. And our database service provides a Query Evaluation Service. Query Evaluation Service inquires information of data nodes for selecting optimal data node.

The virtual laboratory [3] provided an environment which runs grid-enable the molecular process by composing it as parameter sweep allocation using the Nimrod-G tools. The virtual laboratory proposed the Nimrod Parameter Modeling Tools for enabling DOCK as parameter sweep application, Nimrod-G Grid Resource Broker for scheduling DOCK jobs on the Grid, and Chemical Database (CDB) Management and Intelligent Access Tools. But the virtual laboratory did not support web service because it was implemented using Table 1. Protomer table in our chemical database Globus toolkit 2.2 and did not provide integration functionality of heterogeneous CDBs and retrieval Attribute Description functionality in distributed CDB. Prot_ID Molecular_ID BioSimGrid [7] project provides a generic database for comparative analysis of simulations of biomolecules of biological and pharmaceutical interest. The project is building an open software framework system based on OGSA and OGSA-DAI. The system has a service-oriented computing model and consists of the GUI, service, Grid middleware, and database/data. A grid-aware approach to protein structure comparison [8] implemented a software tools for the comparison of protein structures. They proposed comparison algorithms based on indexing techniques that store transformation invariant properties of the 3D protein structures into tables. Because the method required large memory and intensive computing power, they used Grids framework. Parameter sweeps on the Grid with APST [9] designed and implemented APST (AppLeS Parameter Sweep Template). APST project investigated adaptive scheduling of parameter sweep applications on the Grid, and evolved into a usable application execution environment. Parameter sweep applications are structured as sets of computational tasks that are mostly independent. APST software consists of a scheduler, a data manger, a compute manager, and a metadata manager.

Type Mole_name LogP Apolar_desolvation Polar_desolvation H_bond_donors H_bond_acceptors Charge Molecular_weight Rotable_bond Content

4

Subset Type ex) Fragment-like, Drug-like The name of molecular Log of the octanol/water partition coefficient Apolar desolvation Polar desolvation The number of H bond donors The number of H bond acceptors Total charge of the molecule Molecular weight with atomic weights taken from The number of rotable bonds Whole contents of Mol2 file

Architecture of Grid service molecular docking system

based

The current methodology in Grid computing is service oriented architecture. In this section, we explain the components of Grid service based molecular docking system and the steps involved in molecular docking execution. Figure 2 shows an architecture of Grid service based molecular docking system. Grid service based molecular docking system consists of broker, computation resources, and data resources. And it is composed of multiple individual services located on different heterogeneous machines and administered by different organizations.


255

z Database information service Database information service integrates and manages information required for selecting and accessing data resources. When data broker sends a query about data resources, it replies the results of the query to data broker. Resource selection algorithms of data broker can select suitable data resources using replica catalogues and database information service. z Computation resources Computation resource provides Dock Evaluation Service Factory that performs docking with receptor and ligand. z Data resources

Figure 2. Architecture of Grid service based molecular docking system

z Resource broker The resource broker is responsible for scheduling docking jobs. For scheduling, it uses information about CPU utilization and available memory capacity and dispatches docking jobs to selected computation nodes. And it monitors the execution state of jobs and gathers computation results. Our resource broker uses MDS (Metacomputing Directory Service) for resource selection. It provides Dock Service Group Registry and Dock Service Factory. z Data broker In Grid computing, data intensive applications produce data sets on terabytes or petabytes, and these data sets are replicated within Grid environment for reliability and performance. The data broker is responsible for selecting suitable CDB services and replica for efficient docking execution. It uses replica catalogue and database information service. And it uses information about network bandwidth to select optimal data resource which provides chemical data for molecular docking execution. Our data broker provides CDB Query Service Registry and CDB Query Service Factory. z Replica catalogue Replica catalogue manages replicas information for CDB resource discovery. It maintains mapping between logical name and target address. Logical name is unique identifier for replicated data contents and target address is physical location of replica.

Data resources consist of various file systems, databases, XML databases, and hierarchical repository system. Data resource provides Query Evaluation Service Factory. In our Grid service based molecular docking system, services that want to use data resources can access heterogeneous data resources by uniform method using data broker. And Dock Service can access data resources using OGSA-DAI and OGSA-DQP.

5

Grid service specification for molecular docking

In this section, we explain service specification for molecular docking. The service specification is orthogonal to the Grid authentication and authorization mechanisms. z Dock Service Group Registry Dock Service Group Registry (DSGR) is persistent service, which registers Dock Evaluation Service Factory (DESF) to execute docking in computation resource. DSGR provides DockServiceGroupRegistry PortType derived from GridService PortType, NotificationSource PortType, ServiceGroup PortType, and Service Group Registration PortType in OGSI. DESF registers and deletes services in DSGR using Service Group Registration PortType. DSGR creates Service Group Entry service and manages duration time of registered services. Dock Service can inquiry information about computation resources for resource scheduling using DSGR. z Dock Service Factory Dock Service Factory (DSF) is persistent service, which provides DockServiceFactory PortType derived from GridService PortType of OGSI. The primary function of DSF is to create Dock Service instance at the request of client. Any Client wishing to execute molecular docking first connects to DSF and creates an instance. Dock Service

256


instance discoveries suitable computation resources through DSGR and requests Dock Evaluation Service. z Dock Evaluation Service Factory Dock Evaluation Service Factory (DESF) executes docking application with given receptor and ligand. DESF creates Dock Evaluation Service (DES) instance at the request of Dock Service instance. DES instance inquires ligands through CDB Query Service Factory (CDBQSF). DESF provides DockEvaluationServiceFactory PortType derived from GridService PortType of OGSI.

Figure 3 shows screenshot running the globus-start-container command. The list shown in figure 3 is the list of Grid services that are started along with the container. DockServiceFacoty, DockEvalServcieFactory, DockService GroupRegistry, CDBQueryServiceGroupRegisty, CBDQuery ServiceFactory, and QueryEvaluationServiceFactory that we defined and implemented are shown in figure 3 red box.

z CDB Service Group Registry CDB Service Group Registry (CDBSGR) is persistent service, which provides CDBServiceGroupRegistry PortType derived from GridService PortType, NotificationSource PortType, ServiceGroup PortType, and ServiceGroupRegistration PortType of OGSI. CDBSGR registers Query Evaluation Service Factory (QESF) that has inquiry functionality of molecules in CDB. CDBESF creates ServiceGroupEntry service and manages a lifespan using ServiceGroupEntry service. And it inquires information about registered services using GridService PortType and provides functionality that notifies updated services using NotificationSource PortType. z CDB Query Service Factory CDB Query Service Factory (CDBQSF) is persistent service, which inquires information about ligand that DES instance requests. When CDB is available from more than one source, data broker selects one of them using CDBQSF. CDBQSF provides CDBQueryServiceFactory PortType derived from GridService PortType of OGSI. CDBQSF creates a CDB Query Service (CDBQS) instance at the request of DESI. CDBQS instance inquires information about ligand through QESF.

Figure 3. Screenshot running the globus-start-container command We implemented Dock Client for user ease. Figure 4 shows interface of Dock Client and window searching ligands. The interface of Dock Client is divided into toolbar, process state display section, and screened molecules section. The toolbar consists of Login, Search Receptor, Search Ligand, and Docking. If more than two molecules are searched, client can select some of them in table.

z Query Evaluation Service Factory Query Evaluation Service Factory (QESF) is persistent service, which inquires ligand that CDBQS instance requests in CDB. QESF provides QueryEvaluation ServiceFactory PortType derived from GridService PortType of OGSI and GDS PortType of OGSA-DAI. QESF creates Query Evaluation Service (QES) instance at the request of CDBQS instance. QES instance inquires database using OGSA-DAI.

6

Implementation

We implemented Grid service based molecular docking system using Globus Toolkit 3.2 and DOCK 5.0 developed by researchers at UCSF. And we constructed CDB using MySQL and defined services with XML and WSDL.

Figure 4. Interface of Dock Client and window searching ligands Figure 5 shows screenshot that executes docking with selected receptor and ligand. The energy score as result of docking execution is shown in figure 5 red box. If client clicks column of energy score, energy score is sorted by ascending order.


257

Wide Grid”. Concurrency and Computation: Practice and Experience. Vol. 15, 2003. [4] Philip Bourne, Helge Weissig. Structural Bioinformatics. WILEY-LISS, 2003. [5] I. Foster, C. Kesselman, S. Tuecke. “The Anatomy of the Grid : Enabling Scalable Virtual Organizations”. International J. Supercomputer Applications, 15(3), 2001. [6] Ian Foster, Carl Kesselman. The Grid : Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1998.

Figure 5. Screenshot that executes docking with selected receptor and ligand

7

Conclusion

In development of drug discovery, molecular docking is an important technique because it reduces an unmanageable number of compounds to a limited number of compounds for the target of interest. Previous applications or software for molecular docking were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and the molecular docking using a workstation or a cluster-computer requires a long execution time. Thus we proposed a Grid service based molecular docking system using Grid computing technology. To develop a Grid service based molecular docking system, we constructed a 3-dimensional chemical molecular database and defined 6 services for molecular docking. We implemented Grid service based molecular docking system with DOCK 5.0 and Globus toolkit 3.2. Our system can reduce a timeline and cost of drug discovery or new material design and provide client the ease of use. In the future, we plan to make various experiments with data sets on terabytes or petabytes for measuring efficiency of our molecular docking system.

References [1] Elizabeth Lunney. “Computing in Drug Discovery: The Design Phase”. IEEE Computing in Science and Engineering Magazine, 2001. [2] Ahmar Abbas. “Grid Computing: A practical guide to technology and applications”. Charles River Media, 2003. [3] Rajkumar Buyya, Kim Branson, Jon Giddy, David Abramson. “The Virtual Laboratory: a toolset to enable distributed molecular modeling for drug design on the World-

[7] Bing Wu, Kaihsu Tai, Stuart Murdock, Muan Hong Ng, Steve Johnston, Hans Fangohr, Paul Jefferys, Simon Cox, Jonathan Essex, Mark Sansom. “BioSimGrid: towards a worldwide repository for biomolecular simulations”. Organic & Biomolecular Chemistry, Vol. 2, 2004. [8] Carlo Ferrari, Concettina Guerra, G. Zanotti. “A Gridaware approach to protein structure comparison”. Journal of Parallel and Distributed Computing, 2003. [9] H. Casanova, F. Berman. “Parameter Sweeps on the Grid with APST”. Grid Computing: Making the Global Infrastructure a Reality, Wiley Publishers, 2003.

258


A Reliability Model in Wireless Sensor Powered Grid Saeed Parsa, Azam Rahimi, Fereshteh-Azadi Parand Faculty of Computer Engineering, Iran University of Science and Technology, Narmak, Tehran, Iran

Abstract - The widespread use of sensor networks along with growing interest on grid computing and data grid infrastructures have led to integrate them as a unique system. To measure the reliability of such integrated system named Sensor Powered Grid (SPG), a reliability model has been proposed. Considering a semi-centralized architecture, comprising grid clients, resource manager, computing resources and sensor networks, a time distribution model has been developed as a reliability measure. Using such distribution model and the universal moment generating function (UMGF), the performance of each element like sensors, resources and their communication paths with resource management system (RMS) was formulated distinctly resulting to the system's total performance. Keywords: Grid Computing, Sensor Networks, Fault Tolerance, Reliability, UMGF

1

Introduction

With the advent of technology in micro-electromechanical systems (MEMS) and creation of small sensor nodes which integrate several kinds of sensors, sensor networks are used to observe natural phenomena such as seismological, weather conditions and traffic monitoring. These tiny sensor nodes can easily be deployed into a designated area to form a wireless network and perform a specific function. In the other hand, the availability of increasingly inexpensive different embedded computing and wireless communication technologies, such as IEEE 802.11 and Bluetooth is now making mobile computing more affordable. The increasing reliance on wireless networking for information exchange makes it critical to maintain reliable and fault-tolerant communication even in the instance of a component failure or security breach. In such wireless networks, mobile application systems continuously join and leave the network and change location with the resulting mobility impacting the degree of survivability, reliability and fault-tolerance of the communication. In the other hand Grid computing provides widespread dynamic, flexible and coordinated sharing of geographically distributed heterogeneous network resource among dynamic user groups. By spreading workload across a large number of computers, the grid computing user can take advantage of enormous computational, storages and bandwidth resource that would otherwise be prohibitively expensive to attain within traditional multiprocessor supercomputers.

The combination of sensor networks and grid computing enables the complementary strengths and characteristics of sensor networks and grid computing to be realized on a single integrated platform. This integrated platform provides realtime data about the environment for computational resources which enables the construction of real-time models and databases of the environment and physical processes as they unfold, from which high-value computation like decision making, analytics, data mining, optimization and prediction can be carried out. As the unreliable nature of the sensor network and grid environment, the investigation of reliability of sensor powered grid has great importance. Due to the low cost and the deployment of a large number of sensor nodes in an uncontrolled environment, it is not uncommon for the sensor nodes to become faulty and unreliable. On the other hand using sensor networks as eyes and ears of grid, necessitates the deployment of more reliable and fault tolerant resource management system, in other words sensor networks employment in grid means to insert more potentially faulty decisions to the grid environment, thus the fault-tolerance in sensor powered grid is one of the most important factors in resource management. Although some fault-tolerant models has been proposed for grid and sensor networks distinctly, but a comprehensive model for integrated grid and sensor network is not proposed yet. The important of a complementary fault tolerance model will be presented more, when the lack of middle layer between the fault tolerance models in grid and sensor network is observed.

2

Related Works

In providing fault tolerance, four phases can be identified: error detection, damage confinement, error recovery, and fault treatment and continued system service [1]. some fault tolerance models are involved in error detection. Globus HBM [2] provides a generic failure detection service designed to be incorporated into distributed systems and grids. The applications notify the failure and takes appropriate recovery action. MDS-2 [3] in theory can support the task crash failure detection functionality via its GRRP notification protocol and GRIS/GIIS framework. Legion [4] uses pinging and time out mechanism since to detect task failures. Condor-G [5] adopts ad-hoc failure detection mechanisms since the underlying grid protocols ignore fault tolerance issues. In the case of error recovery the basis of most of researches are check pointing and replication. Job replication is a common method that aims to provide fault tolerance in distributed environments by scheduling redundant copies of


eaach job, so ass to increase thhe probabilityy of having att least a siimple job executed. In [66] a very intteresting anallysis of reequirements foor fault toleraance in the griid is presentedd, along w a failure detection with d servvice and a flex xible failure handling frramework. In this case moost of researcches have theiir main foocus on the provision of a single fault tolerance t mecchanism taargeting their system speciffic domains. Inn Globus [2] there is an n application to take apppropriate recovery action. Legion [33,4] provides mechanism tto support fauult tolerance such s as ch heck-pointingg. Remaining grid systemss like Netsollve [7], M Menlat [8] annd Condor-G [9] have thheir failure reecovery failure m mechanisms. T They providee a single user-transport u reecovery mechanism (e.g. reetrying in Netssolve and in CondorC G and replication in Menlat)). In the field of sensor nettworks; there are many prroposed faault tolerance models. It was w first studieed by Marzulllo [10] w proposed a model that ttolerates indivvidual sensor failures. who f Prrasad et al. [111] extended Marzullo's M moodel by considerably reeducing the output o intervaal estimate. Another A extennsion of M Marzullo's appproach is prresented in [12]; the prroposed soolution relaxes the assumpttion on the number of faultyy nodes an nd uses statistics theory too obtain a fauult tolerant inttegrated esstimate of thhe parameter being measuured by the sensors. s T There are manny other FT sschemes for sensor s networrks that haave been suggested. In [13 3] an algorithhm is developped that guuarantees reliable and fairlly accurate ouutput from a number n off different typpes of sensorrs. A fault toolerance technnique is prroposed in [114] where a single s type of resource baacks up diifferent types of resources. The work in [15] considders the faault tolerance in sensor nettworks from the t point of view v of noode placemennt. The Geogrraphic Hash Table T (GHT) [16] [ for daata disseminattion uses dataa replication fo or tolerating faults f in seensor networkks. Saleh etall [17] introduuce the conceept and prresented a schhema for eneergy-aware fauult-tolerance. Neogy [118] presents a fault tolerannce model baased on (i) A Triple M Modular Redunndancy (TMR R) technique and (ii) Checck-point an nd recovery teechnique to provide p a wireeless TMR (W WTMR) ch heck-pointingg technique. Integrating I thee grid with wireless w seensor networkk has made itt necessary too propose a capable c m model for fauult detectionn and recoveery. Proposedd fault toolerance moddels are conncentrated onn grids and sensor neetworks distinnctly and no comprehensiive model haas been prroposed to address faullt tolerance in wireless sensor inntegrated grid.. In this paperr we address thhis problem.

3

Arranggement Model M

Some arranngement moddels have been b proposed for inntegrating the grid computinng and sensorr networks whhich are noot comprehennsive and neeed to be moddified to avoidd some drrawback. Thaam and Buyyaa [19] have considered c two types off centralized and decentrralized sensorr-grid compuuting in suuch a way that in the former to achieve senssor-grid co omputing, sennsors and senssor networks connect c and innterface too the grid and all computatiions take place in the grid, no data fuusion is execuuted out of ggrid. Howeverr in the later sensorgrrid computingg is executed on a distribuuted architectuure in a

259

maanner that thiis approach innvolves proceessing and deecision maaking within the sensor neetwork and att other levels of the sennsor-grid archhitecture. Thiis method is designed foor data griids where deccision makinggs are took plaace by raw daata and baase on the deccisions actuatoors will be acctivated. Theyy have considered the decentralized d method as a better b choice. There aree some shorttcomings in ttheir model. Their decentrralized moodel is more suitable forr sensor-actuaator systems where decision makingg is simple annd it is possib ble to make thhem in sennsor network. While in gridd computing systems s like weather w forrecasting and so on the coollected senso or data is the initial feeed for compleex computing iin the grid wh hich will be innitiated aftter collecting the raw dataa by sensor neetwork. So it is not poossible that alll computationns take placee in sensor neetwork annd planning a comprehenssive architectuure consideriing all reqquired factorr is so impportant. We consider a semicenntralized moddel comprisingg grid clients, resource maanager, computing resoources and ssensor netwo orks named sensor poowered grid (SPG) wherre some leveels of inform mation filttering and dellivering the reeliable data is conducted in sensor network and otther complex computationss are took pllace in griid. Fig. 1 illusstrates the genneral architectu ure of the SPG G.

RMS

R Resource

S Sensor Networkk

F 1. Utilized A Fig. Architecture forr SPG

4

Fault Tolerance T Model

T To explain thee fault tolerannce model we propose a seeries of faiilure definitionns which will be used in ou ur fault managgement sysstem. It is a failure if annd only if on ne of the folllowing conditions is sattisfied: 1. The resource processinng or sensor data d collectionn stops due to a resource r or seensor crash. 2. The avaiilability of a rresource or seensor does noot meet the minim mum levels off Quality of Seervice (QoS). S Sensor powerred grid shall support vary ying degree of o QoS forr sensor dataa delivery. Foor example, certain sensoor data miight require low-latency, highly reliabble delivery, while othher data can tolerate certtain degrees of o network loss or delay. In otherr hand failurres include thhree general types: proocess failuress, resource or o sensor faillures, and neetwork faiilures. Fig. 2 illustrates i genneral fault typees. The reliabiility of thee SPG is equaal to the reliabbility of compu uting resourcees plus relliability of datta resources inn which data resources r are sensor networks. The complementaary FT moddel shall cover the connections betw ween grid andd sensor netwoork.

260


Sensor Powered Grid Failures Process Failures

Resource or Sensor Failure

Process Stop Failure 1

(If the resource does not fail) And Network Failure

Network Disconnect Failure

Process QoS Failure 2

Network QoS Failure

Resource or Sensor Stop 3

6

(If the sensor does not fail)

(2)

And Tkj= ∞ or = ∞ if resource or sensor fails. In case of resources with constant failure rate the probability that resource k does not fail during processing of subtask j is:

Resource or Sensor QoS Failure 4

Fault type examples in complementary FT model: 1- Resource processor failure. 2- Lack of availability of required resource accuracy. 3- Sensor network proxy stops. 4- Unavailability of sensor network proxy. 5- Disconnect between resource manager and sensor network or disconnect between resource manager and resource. 6- Slow communication between resource manager and sensor network.

(3) The amount of data should be transmitted between RMS and the resource k that is processing subtask j (input data from the resource to RMS). Therefore the time of communication between RMS and resource k that process subtask j can take value: (4)

Fig. 2. Classification of fault types in SPG

5

For constant failure rate the probability that communication channels k does not fail during delivering the initial data and sending the results of subtask j can obtained as:

The Model : Reliability functions of time t

τ : Random time of data i detection by sensor l

: Probability that subtask j is correctly completed by resource k

: Data quantity should be transmitted between RMS and the resource k a : Data quantity should be transmitted between the sensor l and RMS : Communication channel speed for resource k : Communication channel speed for sensor δ : Communication time of resource k for subtask j δ : Communication time of sensor l for data vector i : Probability that data quantity aj transported from resource k without failure

: Probability that data i is correctly detected by sensor l : Task complexity : Data Vector Complexity : Computational complexity of subtask j : Detection rate of data vector i : Processing speed of resource k : Detection speed of sensor l : Random time of subtask j processing by resource k

: Probability that data quantity ai transported from sensor l

Our proposed reliability model consists of two sectors: The data sector and the computational sector. Thus the SPG reliability consists of: SPG reliability = reliability of sensor network Λ reliability of computational grid According to our assumptions the entire task is divided into m subtasks which should be done in resources [20] and n data vector which should be provided by sensor network in a way that: ,

: Number of detected data

(1)

Where C is the entire computational task complexity and cj is the complexity of each subtask j and D is the entire sensor detection complexity and di is the detection rate for data vector i. The subtask processing time and data vector detection time are as follows:

(5)

These give the distribution of the random subtask processing time in computational resources: And ∞

1

(6)

In case of sensor network with variable failure rate, the probability that sensor l does not fail during detection of vector data i is: (7) In which is the average number of failures. The amount of data should be transmitted between the sensor l that collect data vector i and RMS (input data from sensor network to


RMS). The time of communication between sensor network and RMS is:

261

determines the expected service execution time given that the system does not fail. It can be obtained as:

(8) For constant failure rate the probability that communication channels l does not fail during detect of data i can obtained as: (9) These give the distribution of the random data detection time: And ∞

1

(10)

Since the subtask is completed when its output reaches the RMS, the random completion time , for the detected data vector i subjecting to subtask j assigned to resource k is equal to . It can be easily seen that the distribution of this time is: And ∞

(11)

1

We assume that each sensor detect a vector data i and send it to RMS and then RMS divides it to m subtasks. Each subtask j is assigned to resources comprising set . In this case the random time of subtask j completion is:

(15)

∞

∞

The procedure used in this paper for the evaluation of service time distribution is based on the universal moment generating function (UMGF) technique. In this paper the procedure used is based on the universal z-transform, which is a modern mathematical technique introduced in [21]. This method, convenient for numerical implementation, is proved to be very effective for high dimension combinatorial problems. In the literature, the universal z-transform is also called UMGF or simply u-transform. The UMGF extends the widely known ordinary moment generating function [22]. The UMGF of a discrete random variable y is defined as a polynomial: (16)

Where the variable y has J possible values and Pj is the probability that y is equal to yj. To obtain the u-transform representing the performance of a function of two independent random variables φ y , y , composition operators are introduced. These operators determine the utransform for φ y , y using simple algebraic operations on the individual u-transform of the variables. All of the composition operators take the form: (17)

,

,

(12)

The entire task is completed when all of the subtasks (including the slowest one) are completed. Therefore the random task execution time takes the form: ,

5.1

,

,

,

,

,

(13)

Reliability and Performance

In order to estimate both the service reliability and its performance, different measures can be used depending on the application. The system reliability is defined as a probability that the correct output is produced in time less than . This index can be obtained as:

can In the case of grid system, the u-transform , define performance of total completion time for data , vector i resulted to subtask j assigned to resource k. This utransform takes the form: (18) 1 And

,

. Θ

,

θ

(14) ,

Where Θ Θ and is critical time. The service performance (the number of executed tasks over a fixed time) is another point of interest. The service performance is defined as the probability that it produces correct outputs without respect to the total task execution time. This index can be referred to as . The conditional expected task execution time W (given the system produces correct output) is considered to be a measure of its performance. This index

6

1

∞

∞

for resources

for sensors

An Analytical Example

Consider a sensor powered grid service that uses data vectors of two sensor and four computational resources. Assume that each data vector's computational process is divided into two subtasks by RMS. The first subtask of each data vector is assigned to resources 1 and 2; the second

262


subtask is assigned to resources 3 and 4. The reliability of the sensors and resources and assigned subtask completion times for available sensors and resources are presented in tables 1 and 2. About communication times in table 1 and 2, it is assumed that resources start the computation process as soon as the sensor networks start data detection. Also, it is assumed that the communication time of sensor networks is included the start up time before data transmittal. The estimated arrangement of the possible network is illustrated in Fig. 3.

The u-transform representing the performance of completion times θ ij ,{k } are obtained as follows:

( ) Ω 2 (u 2 j ,{k }, u ′2 j ,{k } ) = 0.3Z 7 + 0.7 Z ∞ (S & P ) Ω 3 (ui1,{1}, ui′1,{1} ) = 0.42Z 7 + 0.58Z ∞ (R &P ), Ω 4 (ui1,{2}, ui′1,{2} ) = 0.45Z 8 + 0.55Z ∞ (R & P ) Ω 5 (ui 2,{3}, ui′2,{3} ) = 0.28Z 9 + 0.72Z ∞ (R &P ), Then: Ω1 u1 j ,{k }, u1′ j ,{k } = 0.21Z 6 + 0.79Z ∞ (S1&P1), 2

2

1

3

2

R1 P3

R2

P4

S1 RMS S2

P6

R3

, of SPG resources and resources’ Table 1. Reliability communication paths and Subtask completion time Tkj + δ kj include computation and communication times (s) Completion Time No of subtask j 1 2 7+1 8+2 9+1 10+3

of SPG sensors and sensors’ Table 2. Reliability , communication paths and data detection completion time , include detection and communication times (s) No of sensor i 1 2

, 0.7, 0.3 0.5, 0.6

, 3+6 2+7

In order to determine the completion time distribution for both sensors and both subtasks, the u-transform is , defined: u1 j, {k }(Z ) = 0.7 Z 3 + 0.3Z ∞ , u1′ j, {k }(Z ) = 0.3Z

5

∞

Γ3 (Ω 5 , Ω 6 ) = 0.28Z + 0.252 Z

8

10

For data vector

∞

+ 0.468Z

∞

For subtask 1 For subtask 2

And finally:

R4

Reliability No of subtask j 1 2 0.6, 0.7 0.5, 0.9 0.4, 0.7 0.7, 0.5

7

7

Fig. 3. Schematic arrangement of possible network, S: Sensors, R: Resources & P: Paths

1 2 3 4

Γ1 (Ω1 , Ω 2 ) = 0.21Z + 0.237 Z + 0.553Z 9

P2

No of resource k

3

6

Γ2 (Ω 3 , Ω 4 ) = 0.42Z + 0.261Z + 0.319 Z

P5

P1

4

6 + 0.7Z ∞ S and P 1 1

u 2 j ,{k }(Z ) = 0.5Z 2 + 0.5Z ∞ , u 2′ j ,{k }(Z ) = 0.6Z 7 + 0.4Z ∞ S2 and P2 ui1,{1}(Z ) = 0.6Z 7 + 0.4Z ∞ , ui′1,{1}(Z ) = 0.7 Z 1 + 0.3Z ∞ R1 and P3 in s1

ui1,{2}(Z ) = 0.5Z 8 + 0.5Z ∞ , ui′1,{2}(Z ) = 0.9Z 2 + 0.1Z ∞ R2andP4 in s1 ui 2,{3}(Z ) = 0.4 Z 9 + 0.6 Z ∞ , ui′2,{3}(Z ) = 0.7 Z 1 + 0.3Z ∞ R3andP5 in s2 ui 2,{4}(Z ) = 0.7 Z 10 + 0.3Z ∞ , ui′2,{4}(Z ) = 0.5Z 3 + 0.5Z ∞ R4andP6in s2

Ω 7 (Γ1 , Γ2 ) = 0.187 Z 7 + 0.115Z 8 + 0.698Z ∞ Ω 8 (Ω 7 (Γ1 , Γ2 ), Γ3 ) = 0.089 Z 9 + 0.075Z 10 + 0.836 Z ∞

Now this u-transform represents the performance of Θ :

Pr (Θ = 9) = 0.089 , Pr (Θ = 10) = 0.075 , Pr (Θ = ∞ ) = 0.836

From the obtained performance we can calculate the service reliability as follows: R(θ * ) = 0.089 for 9 < θ * ≤ 10 , R(∞ ) = 0.164 W = (0.089 × 9 + 0.075 ×10) / 0.164 = 9.457

7

Conclusions

Use of sensor networks is greatly spread in computational grids. The requirement of online data in computational grids has mandated the necessity of using the integrated sensor network powered computational grid. Although some fault tolerance models has been developed in grids, little investigations has been developed on FT models in sensor powered grids. In this paper using the UMGF technique, a reliability model for SPG has been developed. Other results are outlined hereunder: - A diagram showing the classified SPG failures has been represented. In one hand all failures are classified into two groups of crash related and QoS level related failures, and on the other hand failures can be classified into three sectors of process, resource or sensor and network failures. - Using the distributions of random processing, detection and communication times a performance measure for different SPG arrangements has been developed. - Using the UMGF technique, a model for measuring the total time of task execution has been represented. - In the represented model the total time has been divided into processing time in resources, detection time in sensors and communication times in sensorRMS and resource-RMS paths.


-

263

The performance of sensors, resources and paths are measured distinctly. [17]

8

References

[1] Jalote, P. “Fault tolerance in Distributed Systems”; Prentice-Hall, Inc, 1994. [2] Stelling, P., Foster, I., Kesselman, C., Lee, C., Laszewski, G. von. “A Fault Detection Service for Wide Area Distributed Computations”; In: Proceedings of the Seventh IEEE Symposium on High Performance Distributed Computing. 268—278, 1998. [3] Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C. “Grid Information Services for Distributed Resource Sharing”; In Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing, 2001. [4] Grimshaw, A., Wulf, W., Team, T.L. “The Legion Vision of a Worldwide Virtual Computer”; Communications of the ACM, 1997. [5] Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S. “Condor-G: A Computation Management Agent for Multi-Institutional Grids”; Cluster Computing, Vol. 5, No. 3, 2002. [6] Hwang, S., Kesselman, C. “A flexible framework for fault tolerance in the grid”; J. Grid Comput. 1, 251— 272, 2003. [7] The Globus project, http://www-fp.globus.org/hbm [8] Grimshaw, A.S., Ferrari, A., West, E.A., Mentat. In. Wilson, G.V., Lu (Eds.), P. “Parallel Programming Using C++”; 382—427 (Chapter10), 1996. [9] Gartner, F.C. “Fundamentals of fault-tolerant distributed computing in asynchronous environments”; ACM Comput. Surv. 31 (1), 1999. [10] Marzullo, K. “Tolerating Failures of Continuous Valued Sensors”; In ACM Transactions on Computer Systems. Vol. 8, 284—304, Nov 1990. [11] Prasad, L. , Iyengar, S. S., Kashyap, R. L. , Madan, R. N. “Functional Characterization of Fault Tolerant Integration in Distributed Sensor Networks”; In IEEE Transactions on Systems. Man and Cybernetics, vol. 21, 1082—1087, Sep/Oct 1991. [12] Liu, H.W, Mu, C.D. “Efficient Algorithm for Fault Tolerance in Multi-sensor Networks”; In International Conference on Machine Learning and Cybernetic, Vol. 2, 1258—1262, August 2004. [13] Jayasimha, D. N. “Fault tolerance in multi-sensor networks”; In IEEE Transactions on Reliability, June 1996. [14] Koushanfar, F., Otkonjak, M., Sangiovanni, A., Vincentelli, “Fault Tolerance in Wireless Ad-Hoc Sensor Networks”; In IEEE Sensors, 1491—1496, June 2002. [15] Ishizuka, M., Aida, M. “Performance Study of Node Placement in Sensor Networks”; In Proceedings of the 24th International Conference on Distributed Computing Systems Workshops, 598—603, 2004. [16] Ratnasamy, S., Karp, B., Shenker, S., Estrin, D., Govindan, R., Yin, L., Fang Yu. “Data-Centric Storage

[18]

[19] [20] [21]

[22]

in Sensor nets with GHT, a Geographic Hash Table”; Mob. Netw. Appl., 8(4): 427—442, August 2003. Saleh, I., Agbaria, A., Eltoweissg, M. “In-Network Fault Tolerance in Networked Sensor Systems”; DIWANS’06, Sept 2006. Neogy, S. “WTMR – A new Fault Tolerance Technique for Wireless and Mobile Computing Systems”; Department of Computer Science & Engineering, Jadavpur University, India, 2007. Tham, C. K., Buya, R. “SensorGrid: Integrating Sensor Networks and Grid Computing”; CSI Communications, 24, July 2005. Levitin, G., Dai, Y. S. “Service reliability and performance in grid system with star topology”; Reliabilty Engineering & System Safety, 2005. EL-Neweihi E., Proschan F., Degradable systems “A survey of multi states system theory”; Common. Statis. Theory Math., 13(4), 405—432, 1984. Ushakov I. A., “Universal generating function”; Sov. J. Computing System Science, 24(5), 118—129, 1986.

264


G-BLAST: A Grid Service for BLAST Purushotham Bangalore, Enis Afgan Department of Computer and Information Sciences University of Alabama at Birmingham 1300 University Blvd., Birmingham AL 35294-1170 {puri, afgane}@cis.uab.edu Abstract – This paper described the design and implementation of G-BLAST – a Grid Service for one of the most widely used bioinformatics application Basic Local Alignment Search Tool (BLAST). G-BLAST uses the factory design pattern to provide application developers a common interface to incorporate multiple implementations of BLAST. The process of application selection, resource selection, scheduling, and monitoring is completely hidden from the end-user through the web-based user interfaces and the programmatic interfaces enable users to employ G-BLAST as part of a bioinformatics pipeline. G-BLAST uses an adaptive scheduler to select the best application and the best set of resources available that will provide the shortest turnaround time when executed in a grid environment. G-BLAST is successfully deployed on a campus and regional grid and several BLAST applications are tested for different combinations of input parameters and computational resources. Experimental results illustrate the overall performance improvements obtained with G-BLAST. Keywords: grid, BLAST, scheduling, usability

1 Introduction Basic Local Alignment Search Tool (BLAST) is a sequence analysis tool that performs similarity searches between a short query sequence and a large database of infrequently changing information such as DNA and amino acid sequences [8, 9]. With the rapid development of sequencing technology of large genomes for several species, the sequence databases have been growing at exponential rates [11]. Facing the rapid expanding target databases and the length and number of queries used in the search, the BLAST programs take significant time to find a match. Parallel computing techniques have helped BLAST to gain speedup on searches by distributing searching jobs over a cluster of computers. Several parallel BLAST search tools [13, 19] have been demonstrated to be effective on improving BLAST’s performance. mpiBLAST [19] and TurboBLAST [13] use database segmentation to distribute a portion of the sequence database to each cluster node. Thus, each cluster node only needs to search a query against its portion of the sequence database. On the other side, some researchers apply query segmentation to alleviate the burden of searching jobs [16, 18]. In query segmentation, a subset of

queries, instead of the database, is distributed to each cluster node, which has access to the whole database. As for as the end-user of the BLAST application is concerned the final outcome and turnaround time is of interest and typical users do not really care which of the above techniques were used to generate the final results. The majority of parallel BLAST applications, however, cannot cross the boundary of computer clusters, i.e., the communication among parallel instances of BLAST algorithms are limited among computing nodes with homogeneous system architectures and operating systems. This limitation heavily encumbers the development of cooperative BLAST applications across heterogeneous computing environments. Particularly when many universities and research institutes have started to build grids to take advantage of various computational resources distributed across the organization. The emerging Grid computing technology [12] based on Service Oriented Architecture (SOA) [20] and Web Services provides an ideal development platform to take advantage of the distributed computational resources. Grid computing not only presents maximum available data/computing resources to BLAST search, but also shows its powerful ability on some critical issues, such as security, load balancing, and fault tolerance. Grid services [20] provide several unique features such as statefull services, notification, and uniform authentication and authorization across different administrative domains. The focus of this paper is to develop a grid service for BLAST by exploiting these unique features of grid services and provide ubiquitous access to distributed computational resources as well as hide the various details about application and resource selection, job scheduling, execution, and monitoring. One of the goals of this work is to provide a web-based interface through which users can submit queries, monitor their job status, and access results. The portal could then dispatch the queries to all the available computing resources according to a well-planed scheduling scheme that takes into account the heterogeneity of the resources and performance characteristics of BLAST on these resources based on the query length, number of queries, and the database used. An additional goal of this effort is to provide applications developers a common interface so that different implementations of BLAST could be easily

Int'l Conf. Grid Computing and Applications | GCA'08 | incorporated within the G-BLAST service. A scheduler can then select the best application (multithreaded, query split, or database split) for the available resources and dispatch the job to the appropriate resource(s). There is no need for the end-user to be concerned about which version of BLAST application was used as well as the computational resource(s) that where used to execute the application. The rest of this paper is organized as follows. Section 2 presents an overall architecture of G-BLAST and described the various components. The experimental setup and deployment details used to test G-BLAST are provided in section 3. Other related work is described in section 4 and summary and conclusions are provided in section 5.

2 G-BLAST Architecture The overall architecture of G-BLAST is illustrated in Figure 1. G-BLAST has the following four key components: (a) G-BLAST Core Service: Provides a uniform interface through which a specific version of BLAST could be instantiated. This enables application developers to extend the core interface and incorporate newer versions of BLAST applications. (b) User Interfaces: Provides web and programmatic interface for file transfer, job submission, job monitoring and notification. These interfaces support user interactions without exposing any of the details about the grid environment and the application selection process. (c) Scheduler: Selects the best available resource and application based on user request using a two-level adaptive scheduling scheme. (d) BLAST Grid Services: Individual grid services for each of the BLAST variations that are deployed on each of the computational resource. Users

Client Program

Notify

Query

Grid Service Interface Invoker

Web Interface

Application Information

AIS

Query

Scheduler Response GIS

Dispatch

Resource Information

Result Grid Service

BLAST

BLAST

… …

265

2.1 G-BLAST Core Service A BLAST Grid Service with a uniform Grid service interface is deployed on each of the computing resources. It is located between the Invoker and each implementation of BLAST programs. No matter what kind of BLAST programs are deployed on each resource, the BLAST Grid service should cover the differences and provide fundamental features. To facilitate developers to integrate individual BLAST instances into the G-BLAST framework, the BLAST Grid service defines the following methods for each instance: 1. UploadFile: Upload query sequences to a compute node. 2. DownloadFile: Download query results from the compute node. 3. RunBlast: Invoke corresponding BLAST programs on the compute node(s). 4. GetStatus: Return current status of the job (i.e., pending, running, done). 5. NotifyUser: Notify the user once the job is complete and the results are available. With G-BLAST developers can easily add new BLAST services (corresponding to the BLAST programs and the computing resources supporting it) without modifying any G-BLAST core source code. In addition, developers can add new BLAST services on the fly, without interrupting any of the other G-BLAST services. To accommodate for such functionality, G-BLAST employs the creational design pattern “factory method” [22] to enable the invoker to call newly-built BLAST services without changing its source code. To integrate their corresponding BLAST programs into this framework, developers should create and deploy Grid services on each of the computing resources in the Grid. As described in Figure 2, Invoker and BLASTService are two abstract classes representing the invoker in the GBLAST service core and the BLAST services on computing resources, correspondingly. When a new BLAST service (e,g, mpiBLAST) is added into the system, the relevant invoker (mpiInvoker) for that service must be integrated as a subclass of the class Invoker. When the invoker wants to call the new BLAST service, it can first create an instance of mpiInvoker, then let the new invoker generate an instance of mpiBLAST by calling the member function CreateService(). Thus, the invoker does not need to hard-code instantiation of each type of BLAST services. BLASTService

Invoker

UploadFile() DownloadFile() ……

CreateService() SendQuery() ……

BLAST mpiInvoker mpiBLAST CreateService()

Fig 1. Overall architecture of G-BLAST return new mpiBLAST

The rest of this section describes each of the key components of G-BLAST in detail.

Fig 2. Factory method for BLAST service

266 This design pattern encapsulates the knowledge of which BLAST services to create and delegate the responsibility of choosing the appropriate BLAST service(s) to the scheduler (described in Section 2.C). The Invoker could invoke more than one BLAST service based on the availability of resources to satisfy user requirements.

2.2 User Interfaces G-BLAST framework provides unified, integrated interfaces for users to invoke BLAST services over heterogeneous and distributed Grid computing environment. The interfaces summarize the general functionalities that are provided by each individual BLAST service, as well as cover the implementation details from the end users. Two user interfaces are currently implemented to satisfy different users’ requirements. For users who want to submit queries as part of workflow, a programmable interface is furnished through a Grid Service. Service data and notification mechanism supported by Grid Services are integrated into the BLAST Grid service to provide statefull services with better job monitoring and notification. For users who want to submit each query with individual parameter settings and familiar with traditional BLAST interface, like NCBI BLAST [26], a web interface is implemented for job submission, monitoring, and file management. G-BLAST exploits the notification mechanism [20] provided by grid services in two aspects. One aspect is the notification of changes by BLAST services to the scheduler. The other aspect is the notification of job completion to the end users. Both of these two instances strictly follow the protocol of notification. In notification of service changes, the BLAST services are the notification source, and the scheduler is the notification subscriber. Whenever the BLAST service on the computing node has any changes, the service itself will automatically notify the scheduler with upto-date information. This mechanism keeps the scheduler updated with the most recent status of the BLAST service, therefore helps scheduler make informed decisions on the selection of computing resources. Notification for job completion has a similar implementation, except that the notification sink is the registered client program. To facilitate users’ using G-BLAST, a programming template is also provided to guide users’ code their own client program for G-BLAST service invocation. Figure 3 demonstrates the major part of a client program that invokes G-BLAST service by creating Grid service handler, uploading query sequence(s) to the back-end server, submitting a query job, checking the job status, and finally retrieving the query results. In addition to providing a programmatic interface for the end user, the framework also provides a web workspace that supports the needs of a general, non-technical Grid user who prefers graphical user interface to writing code. The most common needs of a general user are file management, job submission, and job monitoring. File management is supported through a web file browser allowing users to upload new query files or download search result files. It is a

Int'l Conf. Grid Computing and Applications | GCA'08 | simplified version of an FTP client that is developed in PHP. Job submission module is made as simple as possible to use, the user after naming the job for easy reference only provides or selects a search query file and chooses the database to search against. Application selection, resource selection, file mapping, and data transfer are handled automatically by the system. Finally, the job monitoring module presents the user with the list of his or her jobs. It includes a date range allowing the user to view not only the currently running jobs, but the completed jobs as well. When viewing the jobs, the user is given the name of the job, current status (running, done, pending, or failed) and job execution start/end time. Upon clicking on the job name, the user can view more detailed information about the query file, database used, and start and end time. The user is also given the option to resubmit a job with the same set of parameters or after changing one of the parameters. // Get command-line argument as Grid Service Handler URL GSH = new java.net.URL(args[0]); // Get a reference to the Grid Service instance GBLASTServiceGridLocator gblastServiceLocator = new GBLASTServiceGridLocator(); GBLASTPortType gblast = gblastServiceLocator.getGBLASTServicePort(GSH); ……….. //Query sequence uploading gblast.FileTransfer(inputFile, src, remote); ……….. //Submit query as a job gblast.BLASTRequest(blastRequest); jobid=gblast.JobSubmit(); ……….. //Check query (job) status gblast.JobStatus(jobid); ……….. //Retrieve back the query result gblast.ResultRetrive(jobid); Fig 3. Client program to invoke G-BLAST service

2.3 Two-level Adaptive Scheduler Due to the heterogeneity of available resources in the Grid, system usability as well performance of the application can be drastically affected without an efficient scheduler service. Rather than developing a general purpose metascheduler that tries to schedule any application equally by using the same set of deciding factors, we have created a two-level application-specific scheduler that uses application and resource specific information to provide a high-level service for the end user (either in turnaround time, better resource utilization, or usability of the system). The scheduler collects application specific information in the Application Information Services (AIS) [3], initially from the developer through Application Specification Language (ASL) [3] and later from application runs. ASL assists

Int'l Conf. Grid Computing and Applications | GCA'08 | developers to describe the application requirements and the AIS acts as a repository of application descriptors that can be used be a resource broker to select the appropriate application. ASL is much like RSL but from the application point of view. It is a language that provides a way for the application developer to specify the requirements imposed by the application. It specifies deployment parameters that have to be fulfilled during runtime such as required libraries, specific operating system, minimum/maximum number of processors required to run, specific input file(s) required, specific input file format, minimum/maximum amount of memory and disk space, type of interconnection network and so on. Unlike RSL, where only the end user specifies the requirements for their job, ASL allows the application developer or owner to specify requirements for allowing the application to be run (e.g., licensing, subscription fee) and thus is creating a contract between the user and the developer. The scheduler uses this information for each of the subsequent decision makings when selecting the best available resource (say, resource resulting in shortest turnaround time, or the cheapest resource, or the most reliable resource). Once the user provides the necessary job information (as described by the JDF), the scheduler obtains a snapshot of the available resources in the Grid, and based on the information obtained from AIS, it automatically performs a matching between the JDF and ASL as to which of the available algorithms and available resources will result in desired performance. For the more details on the inner workings of the scheduler please refer to [5, 6].

3 Deployment and Results 3.1 UABGrid UABGrid is a campus grid that includes computational resources belonging to several administrative and academic units at the University of Alabama at Birmingham (UAB). The campus-wide security infrastructure required to access various resources on UABGrid are provided by Weblogin [28] using the campus-wide Enterprise Identify Management System [27]. There is diverse pool of available machines in UABGrid, ranging from mini clusters based on Intel Pentium IV processors and Intel Xeon based Condor pool to several of clusters made up of 64-bit AMD Opteron and Intel Xeon CPU’s. Each of the departments participating has complete autonomy over local resource administration, resulting in a true grid setup. Access to individual resources was traditionally made through SSH or command line GRAM tools, but more recently we have added a general purpose resource broker [7, 23] and a portal that facilitates resource selection based on user’s job requirements. Since we are focusing our work to BLAST, a common feature of all the resources in UABGrid is that they have BLAST and/or mpiBLAST installed and available for use. The sequence databases on local resources are updated daily by a cron job and formatted appropriately to speedup user query searches.

267

3.2 Experimental Setup and Results The scheduler has to select not only the best resource among a set of available resources but also select the best version of the BLAST application to deliver the shortest turnaround time for any given user request. In order to develop the knowledgebase required to test the capabilities of the scheduler in delivering these goals we have executed these applications on diverse computer architectures that are representative of actual resources on UABGrid [4]. On each of these computer architectures three different versions of BLAST algorithm: multithreaded, query split, and database split are executed. Each version of the BLAST application is in-turn tested with three protein databases of varying sizes and three different query file sizes (varying number of queries and query lengths). In this section we provide some of key performance results to illustrate the impact of these various parameters on the overall performance of BLAST queries and describe how the adaptive scheduler uses these performance results to decide the appropriate BLAST application and the computational resources. Three protein databases available from NCBI’s website (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) are used as part of these experiments. The smallest database selected was yeast.nt, a 13 MB representing protein translations from yeast genome. As the medium size database, we selected the non-redundant protein database (nr). It is an 821 MB database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq [2] and finally, the largest database selected was 5GB est database, or Expressed Sequence Tags. This is a division of GenBank, which contains sequence data and other information on "single-pass" cDNA sequences from a number of organisms [1]. These databases represent a wide range of possible sizes and have been selected to reflect the most commonly used databases by the scientists in order to provide a solid base for the scheduling policies developed as part of this application. 10,000 protein queries were run against three NCBI's protein databases. Query input files are grouped into three groups based on the number of queries; small, medium and large. Small number of queries is anything less than 100 queries, medium is between 100 and 1,000, while large is anything over 1,000 queries. For any given computer architecture and BLAST version, the performance depends on the following input parameters: individual query lengths, total number of queries, and the size of the database against which the search is performed. Results indicate that the execution time increases linearly as the length of individual queries increase. Figure 4 provides BLAST execution time for queries of varying lengths using nr database on three different architectures. Similar experiments with other databases indicate that the execution time increases correspondingly when the database size increases. These experiments also highlight the importance of CPU clock frequency on the overall execution time of a BLAST query as the query length increases. The performance of query splitting and database splitting approaches were also compared on different architectures

268


with different query files and databases. Figure 5 provides the comparison between query split and database split approaches for the nr database using 10,000 query input file. From the diagram, we can observe that BLAST keeps a nearly linear speedup up-to 16 CPUs regardless of the algorithm used. Overall performance results indicate that the query-splitting algorithm outperforms database-splitting algorithm by almost a factor of two. Use of different database shows similar results as long as the size of the database is less than the amount of main memory available. In the later case, the database-splitting algorithm outperforms the querysplitting algorithm due to the reduced I/O involved in trying to keep only a portion of the original database in memory. Execution time vs. Query length with nr database on 1000 queries 120

100

Time (seconds)

80

60

40

20

0 0

500

1000

1500

2000

2500

Query Length Xeon EM64T 3.2 GHz Linear (Xeon EM64T 3.2 GHz)

Xeon 32-bit 2.66 GHz Linear (Xeon 32-bit 2.66 GHz)

Opteron 1.6 GHz Linear (Opteron 1.6 GHz)

Fig 4. BLAST application performance as a function of query length.

Testing the validity of resource selection involved submitting a number of equal jobs and varying resource availability. We varied the number of available CPUs as well as availability of resources of different architectures and capabilities. Table 1 shows the performance of multithreaded BLAST on different architectures for 10 queries of varying lengths with the nr database. These results indicate that CPUs with higher clock frequencies and larger caches, along with more memory, outperform their slower counterparts. In addition, hyper-threading seems to offer significant performance improvements on the Intel Xeon EM64T architecture. 600,000

Run time (in seconds)

10,000-db-split 500,000

10,000-query-split 400,000 300,000 200,000 100,000 0

2

4

8

10

16

Number of Processors

Fig 5. Direct comparison of execution time of query splitting and database splitting versions of the BLAST with varying number of processors.

Table 1. Performance comparison on different processor types against nr database (1.1 GB) and input file with 10 queries. Processor Type Intel Xeon (2.66 GHz, 32 bit, 512 Kb L2, 2 GB RAM) – Dual processors Intel Xeon (3.2 GHz, 64 bit, 2 MB L2, 4 GB RAM) – Dual processors AMD Opteron (1.6 GHz, 64 bit, 1 MB L2, 2 GB RAM) – Dual processors Macintosh G5 (2.5 GHz, 64 bit, 512 KB L2, 2 GB RAM) – Dual processors Sun Sparc E450 (400 MHz, 64 bit, 4 MB L2, 4 GB RAM) – Quad Processors Sun Sparc V880 (750 MHz, 64 bit, 8 MB L2, 8 GB RAM) – Quad Processors

1

Threads 2

3

508

265

266

426

231

180

471

243

242

382

198

---

2318

1183

590

1211

615

318

The use of grid technologies and the described scheduler has enabled G-BLAST to move beyond executing on any single resource and execute on users’ jobs on multiple resources simultaneously, thus realizing a shorter job turnaround time. The aim of the scheduler is to minimize job’s overall turnaround time. This is achieved through selection of resources to execute the job from the pool of available resources and selection of which algorithm to employ on each resource. These two main directives are further complicated by the need to minimize load imbalance across selected resources. Details on the scheduler implementation can be found in [6], while the results in paper focus on showing the ‘value added’ to a user employing GBLAST. G-BLAST enables execution of a job across multiple resources simultaneously, and because this is typically not available, it is hard to provide a direct comparison of obtained results. As such, provided results focus on internal functionality of the scheduler while overall job runtimes of G-BLAST jobs and standard BLAST jobs can be derived from Figure 4. Figure 6 shows execution of a G-BLAST job across multiple resources using 100 queries and nr database. Following a job submission, G-BLAST selects resources for execution and the input data is automatically distributed and submitted to those resources. The figure shows execution of the same job using 16, 8, 4, and 2 processors among selected machines. As can be seen, the load imbalance across resources is minimized, but not eliminated. This is generally due to the inconsistencies of performance of individual resource that was not predicted by the scheduler, as well as the contention of any one fragment with other jobs currently being submitted to the given resource and thus competing with the one another. All of the results presented above indicate the different intricacies that a typical end-user has to handle while executing BLAST in a grid environment. The scheduler encapsulates all these details and makes it easier for the enduser to take advantage of a grid environment. By analyzing these experiments, we were able to confirm that the choice the scheduler was making during algorithm selection was indeed accurate. Under the constraints of resource availability, it can be inferred from the above figures the overall time saved by the user when performing searches

Int'l Conf. Grid Computing and Applications | GCA'08 | using G-BLAST. We observed that with average resource availability of 8 CPUs on the UABGrid, the maximum time saved by a user was around 75%, when compared to executing the same job on a scientist’s local, single processor workstation.

269 the Grid through application and resource parameter pooling. Due to the mentioned heterogeneity of applications and resources, a general meta-scheduler simply does not have enough information and support from the middleware to perform the optimal resource selection. In order to accommodate this need, application-specific meta-schedulers based on application runtime characteristics are used in GBLAST framework to adaptively schedule applications on the grid. Runtime information is also used by other software packages (ATLAS [31]and STAPL [30]) to determine the best application specific parameters on a given architecture.

5 Summary and Conclusions

Fig. 6. Individual fragments across resources using 16, 8, 4, and 2 CPUs.

4 Related Work 4.1 BLAST on Grid Several Grid-based BLAST systems have been proposed to provide flexible BLAST systems that could harness distributed computational resources. GridBLAST [24] is a set of Perl scripts that distribute work over computing nodes on a grid using a simple client/server model. Grid-BLAST [32] employs a Grid Portal User Interface to collect query requests and dispatch those requests to a set of NCSA clusters. Each cluster in the system is added and tuned to accept jobs in an ad hoc way. The major disadvantage of a non-service based system is that the computing resources cannot be integrated into the system automatically and human intervention is required to adapt a new version of BLAST and new computational resources. GT3 based BLAST [10] system, however, is based on web services programming model. A meta-scheduler is also used to farm out query requests onto remote clusters. Nevertheless, the job submission is still through traditional batch submission tools and does not exploit the benefits of SOA and Grid Services.

4.2 Scheduling Due to the heterogeneity of resources as well as different application choices in the Grid, resource selection is a hard task to perform correctly. Unlike the local schedulers [29, 33] which have much of the necessary information readily available to them, grid meta-schedulers are dependent on the underlying infrastructure. The general meta-schedulers such as Nimrod-G [15], AppLeS [21], the Resource Broker from CrossGrid [17], Condor [25], and MARS [14] are helping the general user in alleviating some of the intricacies of resource selection by automating resource selection across

The overall architecture for G-BLAST – a Grid Service for the Basic Local Alignment Search Tool (BLAST) was presented in this paper. G-BLAST not only enabled the execution of BLAST application in a grid environment but also abstracted the various details about selecting a specific application and computational resource and provided simple interfaces to the end-user to use the service. Using the factory design pattern multiple implementations of BLAST were incorporated into G-BLAST without requiring any change to the Core Interface. The two-level adaptive scheduler and the user interfaces used by G-BLAST enabled the process of application selection, resource selection, scheduling, and monitoring without requiring extensive user interventions. GBLAST was successfully deployed on UABGrid and different BLAST applications were tested for various combinations of input parameters and computational resources. The performance results obtained by executing various BLAST applications (multithreaded, query split, database split) on different architectures with different databases and query lengths illustrated the role of the adaptive scheduler in improving the overall performance of BLAST applications in a Grid environment. In this paper, we have used BLAST as an example for performing local alignment search. We also plan to extend this architecture to other bioinformatics applications.

REFERENCES [1] (2000, July 11). "Expressed Sequence Tags database," Retrieved June 6, 2005, from http://www.ncbi.nlm.nih.gov/dbEST/ [2] (2004, December 22). "GenBank Overview," Retrieved 4/21, 2005, from http://www.ncbi.nlm.nih.gov/Genbank/ [3] Afgan, E. and P. Bangalore, "Application Specification Language (ASL) – A Language for Describing Applications in Grid Computing," In the Proceedings of The 4th International Conference on Grid Services Engineering and Management - GSEM 2007, Leipzig, Germany, 2007, pp. 24-38. [4] Afgan, E. and P. Bangalore, "Performance Characterization of BLAST for the Grid," In the Proceedings of IEEE 7th International Symposium on Bioinformatics & Bioengineering (IEEE BIBE 2007), Boston, MA, 2007, pp. 1394-1398.

270 [5] Afgan, E., P. V. Bangalore, and S. V. Peechakara, "Effective Utilization of the Grid with the Grid Application Deployment Environment (GADE)," Univeristy of Alabama at Birmingham, Birmingham, AL UABCIS-TR-2005-0601-1, June 2005 2005. [6] Afgan, E., V. Velusamy, and P. Bangalore, "Grid Resource Broker with Application Profiling and Benchmarking," In the Proceedings of European Grid Conference 2005 (EGC '05), Amsterdam, The Netherlands, 2005, pp. 691-701. [7] Afgan, E., V. Velusamy, and P. V. Bangalore, "Grid Resource Broker using Application Benchmarking," In the Proceedings of European Grid Conference, Amsterdam, Netherlands, 2005, pp. 10. [8] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J Mol Biol, vol. 215, pp. 403-10, 1990. [9] Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res., vol. 25, pp. 33893402, 1997. [10] Bayer, M., A. Campbell, and D. Virdee, "A GT3 based BLAST grid service for biomedical research," In the Proceedings of UK e-Science All Hands Meeting, Nottingham, UK, 2004 [11] Bergeron, B., "Bioinformatics Computing," 1st ed. Upper Saddle River, New Jersey: Pearson Education, 2002. [12] Berman, F., A. Hey, and G. Fox, "Grid Computing: Making The Global Infrastructure a Reality." New York: John Wiley & Sons, 2003, pp. 1080. [13] Bjomson, R. D., A. H. Sherman, S. B. Weston, N. Willard, and J. Wing, "TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub," In the Proceedings of Proceedings of International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops, Ft. Lauderdale, FL, 2002 [14] Bose, A., B. Wickman, and C. Wood, "MARS: A Metascheduler for Distributed Resources in Campus Grids," In the Proceedings of Fifth IEEE/ACM International Workshop on Grid Computing, Pittsburgh, PA, 2004, pp. 10. [15] Buyya, R., D. Abramson, and J. Giddy, "Nimrod-G: An Architecture for a Resource Management and Scheduling in a Global Computational Grid," In the Proceedings of 4th International Conference and Exhibition on High Performance Computing in Asia-Pacific Region (HPC ASIA 2000), Beijing, China, 2000 [16] Camp, N., H. Cofer, and R. Gomperts, "HightThroughput BLAST," SGI September 1998. [17] CrossGrid. (2004). "CrossGrid Production resource broker," Retrieved 4/15, 2004, from http://www.lip.pt/computing/projects/crossgrid/crossgridservices/resource-broker.htm [18] Czajkowski, K., S. Fitzgerald, I. Foster, and C. Kesselman, "Grid Information Services for Distributed Resource Sharing," In the Proceedings of 10 th IEEE Symp. On High Performance Distributed Computing, 2001

Int'l Conf. Grid Computing and Applications | GCA'08 | [19] Darling, A. E., L. Carey, and W.-c. Feng, "The Design, Implementation, and Evaluation of mpiBLAST," In the Proceedings of ClusterWorld Conference & Expo in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution 2003, San Jose, CA, 2003 [20] Foster, I., C. Kesselman, J. Nick, and S. Tuecke, "The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration," Global Grid Forum June 22 2002. [21] Fran, B., W. Rich, F. Silvia, S. Jennifer, and S.Gary, "Application-Level Scheduling on Distributed Heterogeneous Networks," In the Proceedings of Supercomputing '96, Pittsburgh, PA, 1996, pp. 28. [22] Gamma, E., R. Helm, R. Johnson, and J. Vlissides, "Design Patterns," 1st ed: Addison-Wesley Professional, 1995. [23] Huedo, E., R. S. Montero, and I. M. Llorente, "A Framework for Adaptive Execution on Grids," Journal of Software - Practice and Experience, vol. 34, pp. 631-651, 2004. [24] Krishnan, A., "GridBLAST: A High Throughput Task Farming GRID Application for BLAST," In the Proceedings of BII, Singapore, 2002 [25] Litzkow, M., M. Livny, and M. Mutka, "Condor - A Hunter of Idle Workstations," In the Proceedings of 8th International Conference of Distributed Computing Systems, June 1988, pp. 104-111. [26] NCBI. (2004, November 15). "NCBI BLAST," Retrieved 4/21, 2005, from http://www.ncbi.nlm.nih.gov/BLAST/ [27] Puljala, R., R. Sadasivam, J.-P. Robinson, and J. Gemmill, "Middleware: Single Sign On Authentication and Authorization for Groups," In the Proceedings of the ACM Southeastern Conference, Savannah, GA, 2003 [28] Robinson, J.-P., J. Gemmill, P. Joshi, P. Bangalore, Y. Chen, S. Peechakara, S. Zhou, and P. Achutharao, "WebEnabled Grid Authentication in a Non-Kerberos Environment," In the Proceedings of Grid 2005 - 6th IEEE/ACM International Workshop on Grid Computing, Seattle, WA, 2005 [29] Systems, V., "OpenPBS v2.3: The Portable Batch System Software," 2004. [30] Thomas, N., G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, and L. Rauchwerger, "A Framework for Adaptive Algorithm Selection in STAPL," In the Proceedings of ACM SIGPLAN Symp. Prin. Prac. Par. Prog. (PPOPP), Chicago, IL, 2005 [31] Whaley, R. C., A. Petitet, and J. Dongarra, "Automated empirical optimizations of software and the ATLAS project," Parallel Computing, vol. 27, pp. 3-35, 2001. [32] Yong, L., "Grid-BLAST: Building A Cyberinfrastructure for Large-scale Comparative Genomics Research," In the Proceedings of 2003 Virtual Conference on Genomics and Bioinformatics, 2003 [33] Zhou, S., "LSF: Load Sharing in Large-scale Heterogeneous Distributed Systems," In the Proceedings of Workshop on Cluster Computing, 1992

SESSION GRID SERVICES, SCHEDULING, AND ...

SESSION GRID SERVICES, SCHEDULING, AND ...

Suggest Documents

Multi-Grid Job Scheduling

Workflow Scheduling Algorithm for Grid Services with ...

TOG and JOSH: Grid scheduling with Grid Engine and Globus

TOG and JOSH: Grid scheduling with Grid Engine and Globus

Modeling and Supporting Grid Scheduling - Semantic Scholar

Grid Resource Management, Scheduling and ... - Rajkumar Buyya

Advanced Scheduling Strategies and Grid Programming Environments

Model of Grid Scheduling Problem

Enabling Adaptive Grid Scheduling and Resource Management

Entropic Grid Scheduling - Ryerson University

Grid Services Design

Planning and Scheduling in Manufacturing and Services ...

Session Errata

IEEE-PES Smart Grid Super Session

Enhancing Smart Grid with Session-Oriented ...

PADUG15 Session Grid 2015 - final.pdf - Google Drive

Services Integration and Task-scheduling in ...

Discovering and Managing Grid Services - Description - Argonne ...

Lookahead Scheduling for Reconfigurable GRID Systems - CiteSeerX

Local Grid Scheduling Techniques using Performance Prediction

Decentralized Grid Scheduling with Evolutionary ... - Semantic Scholar

Reinforcement learning for utility-based grid scheduling

Grid-based Access Scheduling for Mobile Data

Grid Scheduling Simulations with GSSIM - CiteSeerX