Modelization and Performance Evaluation of the ... - Semantic Scholar

Modelization and Performance Evaluation of the Diet Middleware Eddy Caron, Benjamin Depardon and Frédéric Desprez

University of Lyon. LIP Laboratory. UMR CNRS - ENS Lyon - INRIA - UCBL 5668, FRANCE Email: {Eddy.Caron,Benjamin.Depardon,Frederic.Desprez}@ens-lyon.fr

1

Introduction

Using distributed resources to solve large problems ranging from numerical simulations to life science is nowadays a common practice [1,2]. Several approaches exist for porting these applications to a distributed environment; examples include classic message-passing, batch processing, web portals or GridRPC systems [3]. In this last approach, clients submit computation requests to a meta-scheduler (also called agent) that is in charge of nding suitable servers for executing the requests within the distributed resources. Scheduling is applied to balance the work among the servers. A list of available servers is sent back to the client; which is then able to send the data and the request to one of the suggested servers to solve its problem. There exists several grid middleware [4] to tackle the problem of nding services available on distributed resources, choosing a suitable server, then executing the requests, and managing the data. Several environments, called Network Enabled Servers (NES) environments, have been proposed. Most of them share a

clients which are agents which are in charge of handling the clients' requests (scheduling them) and of nding suitable servers, and nally computational servers which provide compu-

common characteristic which is that they are built with broadly three main components: applications that use the NES infrastructure,

tational power to solve the requests. Some of the middleware only rely on basic hierarchies of elements, a star graph, such as Ninf-G [5] and NetSolve [6] and GridSolve [7]. Others, in order to divide the load at the agents level, can have a more complicated hierarchy shape: WebCom-G [8] and a problem arises: what is the

best shape for the hierarchy?

Diet [9]. In this latter case,

Modelization of middleware behavior, and more specically their needs in terms of computations and communications at the agents and servers levels can be of a great help when deploying the middleware on a computing platform. The literature do not provide much papers on the modelization and evaluation of distributed middleware. In [10], Tanaka

et al. present a performance evaluation of Ninf-G, however, no

theoretical model is given. In [11] P.K. Chouhan presents a model for hierarchical middleware, and algorithms to deploy a hierarchy of schedulers on clusters and grid environments. She also compares the model with the

Diet middleware. However, a severe limitation in these latter works is that only one kind of service could

be deployed in the hierarchy. Such a constraint is of course not desirable, as nowadays many applications rely on workows of services. Hence, the need to extend the previous models and algorithms to cope with hierarchies supporting several services.

Diet 1 (Distributed Interactive Engineering Toolbox.) The Diet component architecture is structured hierarchically as a tree to obtain an improved scalability. Diet comprises several components. Clients that use Diet infrastructure to solve problems using a remote procedure call (RPC) approach. SeDs, or server daemons, act as service providers, exporting functionalities via a standardized computational service interface. Finally, agents facilitate the service location and invocation interactions of clients and SeDs. Collectively, a hierarchy of agents provides higher-level serIn this paper, we will focus on one particular hierarchical NES:

vices such as scheduling and data management. These services are made scalable by distributing them across a hierarchy of agents composed of a single several

Master Agent (MA) (the root of the hierarchy) and potentially

Local Agents (LA) (internal nodes.)

We rst briey present in Section 2 our model for hierarchical middleware, and a description of our algorithm to automatically build a hierarchy. Then, experimental results are presented in Section 3, before concluding.

1

http://graal.ens-lyon.fr/DIET

2

Model for hierarchical middleware

We do not detail the whole model and the algorithm in this paper. Instead we concentrate on the experimental results presented in the next section. Hence, we strongly encourage the reader to refer to our research report [12]. In this section, we give an overview of our model.

2.1

Resource architecture

In this paper we will focus on the simple case of deploying the middleware on a fully homogeneous, fully

G = (V, E, w, B), i.e., all nodes' processing power are the same: V is the set of nodes, E is the set of edges, w is the processing power in M f lops/s, and all links have the same bandwidth B in M b/s. We do not take into account contentions in the network. connected platform

2.2

Server and agent model

We aim at deploying a hierarchy of agents and servers that is able to support several services. As we have multiple services in the hierarchy, our goal cannot be to maximize the global throughput of completed requests (number of completed requests per time unit) regardless of the kind of services. Hence, our goal is to obtain for each service a throughput such that all services receive almost the same obtained throughput to requested throughput ratio (of course we try to maximize this ratio), while having as few agents in the hierarchy as possible, so as not to use more resources than necessary. Our model is divided into two parts: the servers and the agents models. Requests are sent from the root of the hierarchy to the servers where they evaluate the time they will take to execute the service. Then, replies are sent up the tree, and each agent aggregates the replies and selects the best server. In the end, the client receives the best server to execute the request. Then, it can directly contact this server to send the relevant data and execute the request. Our model takes into account the communications costs within the hierarchy (requests forwarding, and replies) as well as the computations costs at the agents level (time to treat a request, and time to aggregate the results) and the servers level (time to evaluate a request, and time to eectively execute a request).

2.3

Building a

Diet hierarchy

Based on the server and agent models, we designed an algorithm to automatically build a

Diet hierarchy.

The algorithm is based on a bottom-up approach. We rst distribute the available nodes between the dierent services. With this repartition we can compute the throughput the servers can oer. Then, with the remaining nodes, we try to build a hierarchy of agents that is able to support this load. The construction of the hierarchy is done level by level: we rst try to add enough agents to support the servers level,

i.e., allocate children

to the agents (as the load of an agent is linear in the number of children), and verify that the computations and communications do not exceed the platform capacity

w

and

B

(resource constraints.) If only one agent

is enough to support all the servers, then our hierarchy is built. Otherwise, we need to add at least a new level of agents on top of the previous one. We repeat this schema for as long as we have enough nodes to build new levels, and for as long as we do not have only one agent at the top of the hierarchy. If no suitable hierarchy can be found, we reduce the number of nodes allocated to the servers, which incidentally reduces the services throughput, and repeat the same schema,

i.e., we try to build a hierarchy of

agents with the remaining nodes. Note that the load of an agent depends on the number of children of each service type and the service throughputs. The agent level construction (i.e., nodes allocation, and children repartition) is done using a Mixed Integer Linear Program (MILP.) We do not detail the algorithm in this paper, instead we encourage the reader to refer to our research report [12].

2

3

Experimental Results

In this section, we present comparisons between the theoretical predictions and real experiments with the

Diet middleware. We also show that our bottom-up approach based on an MILP to build the hierarchy

gives really good results compared to classical deployments. We deployed two kinds of services: a matrix multiplication using

dgemm

[13] (for various matrix size), and the computation of the Fibonacci number

using a naive algorithm. We will denote by Fibonacci

3.1

x

dgemm x the call to a dgemm service on matrices of size x × x, and n = x.

the call to a Fibonacci service to compute the Fibonacci number for

Theoretical model / experimental results comparison

In order to validate our model, we conducted experiments with

Diet on the French experimental testbed

Grid'5000 [14]. Using benchmarks made on the Sagittaire cluster, and our bottom-up algorithm, we generated for various combinations of

dgemm

and Fibonacci services, for platforms with a number of nodes ranging

from 3 to 50. Even though the algorithm is based on an MILP, it took only a few seconds to generate the 35 hierarchies for our experiments. Our goal here is to stress

Diet, so we use relatively small services. We

dgemm 100 and Fibonacci dgemm 10 and Fibonacci 20 (small size services), dgemm 10 and Fibonacci 40 (small size dgemm, large size Fibonacci), dgemm 500 and Fibonacci 20 (large size dgemm, small size Fibonacci), and dgemm 500 and Fibonacci 40 (large size dgemm, large size Fibonacci.) We tried to obtained the same throughput compared the throughput measured and predicted for various services combinations:

30 (medium size services),

for each service: our target throughput is 10000 (which is higher than what we can expect on the platform.) The experimental protocol is the following: we rst deploy the hierarchy with

GoDiet [15], then we run

clients for both services and wait 10 seconds so that the load stabilizes. Finally, we measure the throughput during 60 seconds before killing all clients and the

Diet hierarchy. Each client runs on a separate node, and

sends one request at a time, once a request is completed, a new one is sent, this is repeated indenitely.

We used a 79-nodes cluster: the Sagittaire cluster from the Lyon site. Each node has an AMD Opteron 250 CPU at 2.4GHz, with 2GB of memory. All those nodes are connected on a Gigabit Ethernet network supported by an Extreme Networks Blackdiamond 8810. The operating system deployed using

kadeploy3

on the Sagittaire nodes was a Debian 2.6.24.3 x86_64. Finally, the chosen CORBA implementation was

2

omniORB 4.1.2 , as

Diet relies on CORBA for communications; and we used gcc 4.3.2. For the dgemm

service, we relied on the Altas library [16].

Table 1 presents the relative error between the theoretical and experimental throughput: a dash in the table means that the algorithm returned the same hierarchy as with fewer nodes, and hence the results are the same. As can be seen, the experimental results closely follow what the model has predicted. Higher errors

can be observed for really small services (dgemm 10 and Fibonacci 20.) This is due to the fact that really small services are harder to benchmark: the benchmarked server time to predict the execution time, and the time to execute a service are higher than what they should be.

3.2

Relevancy of the servers repartition

We also compared the throughputs obtained by our algorithm, and the ones by a balanced star graph (i.e., a star graph where all services received the same number of servers.) A balanced star graph is the naive approach that is generally used when the same throughput is requested for all services, which is what we aimed at in our experiments. Figures 1a to 1e present the comparison between the throughput obtained with both methods. As can be seen our algorithm gives better results: the throughput is better on all but one experiment, and no more resources than necessary are used (in Figure 1a, no more than 20 nodes are required to obtain the best throughput, and in Figure 1b only 3 nodes.)

dgemm 500 in dgemm 500 Fibonacci 40 experiment (see Figure 1e.) However, this is at the expense of the second service

The only experiment where the balanced star graph produced a higher throughput is for the

2

http://omniorb.sourceforge.net/ 3

Experiment

3

5

Number of nodes 10 20 30 40

50

dgemm 1.87% 1.58% 4.17% 4.39% Fibonacci 1.76% 1.26% 0.26% 2.28% dgemm 27.4% 10, Fibonacci 20 Fibonacci 27.3% dgemm 27.2% 29.1% 28.1% 28.7% 28.2% 22.9% 23.9% 10, Fibonacci 40 Fibonacci 14.6% 10.4% 9.7% 8.0% 6.7% 1.4% 2.3% dgemm 1.1% 0.7% 3.0% 3.0% 4.7% 9.5% 19.1% 500, Fibonacci 20 Fibonacci 31.8% 32.6% 26.4% 25.8% 20.1% 5.8% 10.4% dgemm 2.2% 7.9% 3.9% 2.8% 0.0% 0.5% 0.3% 500, Fibonacci 40 Fibonacci 10.5% 10.5% 8.2% 8.7% 5.7% 4.6% 0.3% Table 1: Relative error |theoretical−experimental| . theoretical

dgemm 100, Fibonacci 30 dgemm dgemm dgemm dgemm

which has a lower throughput. The

dgemm

throughput loss observable in Figure 1d is due to the load at the

agent level (this phenomena is predicted by our model.) Using our algorithm, we are able to increase the throughput to up to 700% (see Figure 1b for instance.)

3.3

Relevancy of creating new agent levels

In order to validate the relevancy of our algorithm to create the hierarchies, we compared the throughput obtained with our hierarchies, and the ones obtained with a star graph having exactly the same repartition of servers obtained with our algorithm. Thus, we aim at validating the fact that our algorithm sometimes creates several levels of agents whenever required. In fact in the previous experiments, this happens only for the following deployments:

dgemm 100 Fibonacci 30 on 20 nodes (1 MA and 3 LA), and dgemm 500 Fibonacci

20 on 30, 40 and 50 nodes (1 MA and 2 LA.) Here are the results we obtained with such star graphs. We present the gains/losses obtained by the star graph deployments compared to the results obtained with the hierarchies our algorithm computed:

dgemm 100 Fibonacci 30 on 20 nodes: dgemm

we obtained a throughput of 911.95 requests per seconds for

and 779.07 requests per seconds for Fibonacci, compared to respectively 1624.4 and 1699.0 with

our algorithm. Hence a loss of 43.8% and 54.1% respectively.




our algorithm. Hence a loss of 5.5%, and 28% respectively.


our algorithm. Hence a loss of 15.4% for

dgemm,



and 2476.9 requests per seconds for Fibonacci, compared to respectively 281.35 and 2391.6 with and a gain of 3.6% for Fibonacci.



our algorithm. Hence a loss of 40.7% and 16.1% respectively. We can see from the above results that our algorithm creates new levels of agents whenever required: without the new agent levels the obtained throughput can be much lower. This is due to the fact that the MA becomes overloaded, and thus do not have enough time to schedule all requests.

4

Conclusion and future work

We presented in this paper a model for hierarchical middleware based on a tree structure, and briey described our algorithm for automatically nding a suitable hierarchy. Our objective, when deploying several services in the hierarchy is to provide the best provided to required throughput ratio.

4

3500

3500

1600

1600

3000

3000

1400

1400

2500

2500

1200

1200 1000

800

800

600

600

400 200 10

20 30 Number of nodes

40

2000 1500

1000

400

500

500

200

0

50

0 3 5

10

4500

90 80 70

3500

60 3000 50 2500 40 2000

30

1500

20

1000

10

500

350

50

0 30

40

7000

dgemm our algorithm Fibonacci our algorithm dgemm balanced star Fibonacci balanced star

300 dgemm throughput (requests / s)


20

40

dgemm 500 Fibonacci 20 100

Fibonacci throughput (requests / s)

dgemm throughput (requests / s)

dgemm 10 Fibonacci 40

10


(b) dgemm 10, Fibonacci 20

5000

3 5

1500

1000

(a) dgemm 100, Fibonacci 30

4000

2000


6000

250

5000

200

4000

150

3000

100

2000

50

1000

0


3 5



1000






0

50

3 5

10

20

Number of nodes

30

40

50

Number of nodes

(c) dgemm 10, Fibonacci 40

(d) dgemm 500, Fibonacci 20 dgemm 500 Fibonacci 40 80 70

100

60 80

50

60

40 30

40

20 dgemm our algorithm Fibonacci our algorithm dgemm balanced star Fibonacci balanced star

20

0 3 5

10


40



120

10 0 50

(e) dgemm 500, Fibonacci 40 Fig. 1: Comparison: our algorithm and balanced star. We conducted several experiments on the Grid'5000 platform with the

Diet middleware. The results show

that the experimental results closely follow what the model predicted, and that our bottom-up algorithm

provides excellent performances. The experiments showed that our algorithm adds new levels of agents whenever required, and that it clearly outperforms the classical approach of deploying the middleware as a balanced star graph. Several points now need to be taken into account. So far we only studied the case of a fully homogeneous platform. Even though our model and algorithm can easily be extended to computation heterogeneous /

5

communication homogeneous platforms; the case of a fully heterogeneous platform is much more complicated. The model also needs to be extended with client/server communications, but for this we need to know where the clients are, whereas in this paper we considered that we did not know anything on the clients, this assumption do not seem to impact our model. Finally, as the MA can become a bottleneck, and as have several hierarchies interconnected at the MA level, we could also study this latter case.

5

Diet can

Acknowledgment

Experiments presented in this paper were carried out using the Grid'5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see

https://www.grid5000.fr.)

References 1. Berman, F., Fox, G., Hey, A.J.G.: Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, Inc., New York, NY, USA (2003) 2. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2003) 3. Seymour, K., Lee, C., Desprez, F., Nakada, H., Tanaka, Y.: The end-user and middleware apis for GridRPC. In: Workshop on Grid Application Programming Interfaces, In conjunction with GGF12, Brussels, Belgium. (2004) 4. Caniou, Y., Caron, E., Desprez, F., Nakada, H., Seymour, K., Tanaka, Y.: High performance GridRPC middleware. In Gravvanis, G., Morrison, J., Arabnia, H., Power, D., eds.: Grid Technology and Applications: Recent Developments. Nova Science Publishers (2009) At Prepress. Pub. Date: 2009, 2nd quarter. ISBN 978-1-60692768-7. 5. Tanaka, Y., Nakada, H., Sekiguchi, S., Suzumura, T., Matsuoka, S.: Ninf-g: A reference implementation of RPC-based programming middleware for grid computing. Journal of Grid Computing 1(1) (03 2003) 4151 6. Casanova, H., Dongarra, J.: Netsolve: a network server for solving computational science problems. In: Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), Washington, DC, USA, IEEE Computer Society (1996) 40 7. YarKhan, A., Dongarra, J., Seymour, K.: Gridsolve: The evolution of a network enabled solver. Grid-Based Problem Solving Environments (2007) 215224 8. Morrison, J.P., Clayton, B., Power, D.A., Patil, A.: Webcom-G: grid enabled metacomputing. Neural, Parallel Sci. Comput. 12(3) (2004) 419438 9. Caron, E., Desprez, F.: DIET: A scalable toolbox to build network enabled servers on the grid. International Journal of High Performance Computing Applications 20(3) (2006) 335352 10. Tanaka, Y., Takemiya, H., Nakada, H., Sekiguchi, S.: Design, implementation and performance evaluation of gridrpc programming middleware for a large-scale computational grid. In: GRID '04: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, Washington, DC, USA, IEEE Computer Society (2004) 298305 11. Chouhan, P.: Automatic Deployment for Application Service Provider Environments. PhD thesis, PhD thesis, Ecole Normale Supérieure de Lyon (2006) 12. Caron, E., Depardon, B., Desprez, F.: Modelization for the Deployment of a Hierarchical Middleware on a Homogeneous Platform. Technical report, Institut National de Recherche en Informatique et en Automatique (INRIA) (2010) 13. Dongarra, J., et al.: Basic linear algebra subprograms technical forum standard. International Journal of High Performance Applications and Supercomputing 16(1) (2002) 1111 14. Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Primet, P., Quetier, B., Richard, O., Talbi, E.G., Irena, T.: Grid'5000: a large scale and highly recongurable experimental grid testbed. International Journal of High Performance Computing Applications 20(4) (November 2006) 481494 15. Caron, E., Chouhan, P.K., Dail, H.: : A deployment tool for distributed middleware on grid'5000. In IEEE, ed.: EXPGRID workshop. Experimental Grid Testbeds for the Assessment of Large-Scale Distributed Apllications and Tools. In conjunction with HPDC-15., Paris, France (June 19th 2006) 18 16. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Computing 27(1-2) (2001) 3 35

GoDiet

6