ARC INRIA PROPOSAL Handling Uncertainties in Large ... - LaBRI

19 downloads 4925 Views 171KB Size Report
The goal of this aleae is to provide models and algorithmic solutions in the field of resource management that cope ... systems are full of uncertainties that happen at several levels (hardware, software and users). ..... Bi-objective Scheduling.
ARC INRIA P ROPOSAL Handling Uncertainties in Large-Scale Distributed Systems ALEAE Anne Benoit GRAAL IPT

Dick Epema TU Delft

Bruno Gaujal MESCAL IPT

Emmanuel Jeannot ALGORILLE IPT

November 18, 2008

Abstract The goal of this aleae is to provide models and algorithmic solutions in the field of resource management that cope with uncertainties in large-scale distributed systems. This work will be based on the Grid Workloads Archive designed at TU Delft, Netherlands. Moreover we will experiments our solutions to validate the proposed models and evaluate the algorithms using simulator or large-scale environments such as Grid’5000 in order to improve both models and algorithms. Keywords: scheduling, uncertainties, large-scale systems, probability, multi-criteria.

1 1.1

Introduction Motivation

A grid is a set of distributed and heterogeneous resources. In recent years, a lot of work has been done to efficiently manage and use such resources. However, most of the models, algorithms, protocols and programs designed and developed in this context do not take into account all the characteristics of grid systems. Indeed, in large scale systems, resources are dynamic, shared by many users, and the application behavior is not necessarily known in advance. Therefore, such systems are full of uncertainties that happen at several levels (hardware, software and users). At the infrastructure level, the hardware that composes a grid can fail (due to component breakdown), be volatile (due to resources that dynamically join or leave such systems) or have some performance degradation (due to a shared usage). Failure and volatility are quantitative uncertainties (the number of available resources changes with time), while performance degradation are qualitative uncertainties (the efficiency changes with time). At the application level, works in scheduling and resource management often assume that the duration of the composing parts (often called tasks) of the application are known. How these tasks are organized and what is their resource usage (disk, network CPU) is also often assumed to be known. Another common hypothesis is that the application is fully reliable. However, such assumptions are not always true and therefore have to be left. Lastly, users entail a lot of uncertainties in distributed environments. We can distinguish two levels of such uncertainties. First, their usage of the system is often unpredictable (users can submit jobs or requests randomly). Second, some users may behave with malignity: they can try to attack the system (denial of service –DOS– attack), or perturb the correct functioning (by sending wrong answers in case of volunteer computing). We think it is important to conduct research focused on efficiently dealing with unpredictable or unexpected behaviors. Indeed, as resource management algorithms already cope with heterogeneity, distribution or scale, they must also cope with uncertainty. There are several ways to handle uncertainties. A first set of methods is called proactive methods. With these methods, one take preventive actions in order, either to limit the chance that an error or a failure happen (by carefully allocating resources), or reduce the consequences of errors of failures (by duplicating tasks for instance). The second set of methods is called reactive methods. In this case, the goal is to take actions when an unexpected event happens in order to limit or correct the consequences of such event. Good examples of reactive methods are checkpoint-restart strategies (an application which failed is restarted using a previous checkpoint), or migration (tasks are moved to safer resources when errors start to happen). Furthermore, it is also possible to mix proactive and reactive methods by, for instance, providing a static scheduling solution and dynamically adapt it. Lastly, it is important to understand that each approach has advantages and drawbacks. Proactive methods are often resource costly but can provide some guaranties in terms of correct execution. Reactive methods have some management overhead that can greatly hinder the performance but handle almost every cases. When designing a resource management algorithm that cope with uncertainties, it is necessary to clarify the objectives targeted by this algorithm. Indeed, different kinds of uncertainties lead to different desired behavior. For instance, in case of hardware or software failures the goal is to improve the reliability and the fault-tolerance; in case of hardware performance degrada-

2

tion or software unpredictability, robustness1 is a major issue; in case of Byzantine behavior, the main criterion is correctness. Moreover, classical metrics used in scheduling and resource management such as makespan, load-balance, response time, etc. are still valid and, most of the time, contradictory with new metrics due to uncertainties. This means that, when designing resource management algorithms, a multi-criteria approach is necessary (for instance, consider makespan and reliability).

1.2

Goal of this Action

In the literature, there already exists some work that deals with uncertainty. However, this is a new subject and it is necessary to put a lot of energy on that matter. There exists a lot of open issues that the community should tackle: Modeling uncertainty. Before being able to provide efficient uncertainty-aware algorithms, it is important to provide a good modeling of such uncertainties. To do so, we need first to trace the usage behavior, then to analyze these traces and then to provide a modeling of the behavior. Another part of the modeling work concerns the metric definition. Mapping a goal (robustness, reliability, etc.) into a metric is not always a trivial task. Indeed, these intuitive notions can lead to different ways to measure them. It is therefore important to study these metrics in order to determine their correlation and chose the most suited one. Algorithm design. Once the above step have been fulfilled, it is possible to design algorithmic solutions that handle uncertainties. Such algorithms can be mono-criterion or multi-criteria; they can use static, dynamic or mixed method; one can try to give theoretical worst-case performance bound or experimental average behavior, etc. Among the problems we want to address the “proactive vs. reactive” one will be specifically tackled. As stated above, both methods have advantages and drawbacks. Deciding which one is the best suited approach certainly depends on the problem. Determining the set of problems that matches the most each approach is a very important challenge. Moreover, establishing when the mixed approach is possible and profitable is also a key problem. Evaluating solutions. Testing and experimenting solutions is of key importance in this area. Indeed, this helps to compare different solutions, validate the models, benchmark implementations, etc. Large scale distributed systems are subject to a lot of uncertainties. Such uncertainties come from different level (hardware, software, users). Whereas there exists some preliminary work on that problematic in the literature, a lot of resource management and scheduling issues are still open and are required to be addressed. We think that we have a mature understanding of these issues and that the proposed consortium of this Action have the skills and is ready to work on those challenges. Lastly we expect that if the research carried-out within the ALEAE action leads to novel and interesting results and ANR or a European project within the FP7 will be proposed after the funding period. 1 A schedule is said robust if it is able to absorb some degree degree of uncertainty in tasks duration while maintaining a stable solution

3

Modeling

Grid
Workload
 Archive


Models


Validation

Design

Experiments


Test

Algorithms


Figure 1: Action Work Plan

2 2.1

Work Plan General overview

Our work plan is based on the “The Grid Workloads Archive” which is a database where are stored large scale distributed infrastructure usage provided by the University of Delft partner. Based on this archive, we want to provide models of the behavior of these infrastructures and their uncertainties. The model we want to derive will be statistic and probabilistic. An other part of the modeling phase will consist in defining how to measure the goal we want to attain. Based on the modeling phase we will design algorithms that handle uncertainties. We want to tackle three kinds of uncertainties: the one based on resource availabilities, another based on the application behavior and the last one based on the users behavior. Once a solution is designed we will evaluate its performance with regards to the defined metrics and objectives. Moreover by performing realscale experiments (real applications on real platforms) we will also be able to validate/check the modeling. This will help to understand and identify the validity domain of the proposed models. We would therefore loop on our work plan by further improving the accuracy of our model, then designing better algorithms and experimenting them once again as shown in Figure 1.

2.2

The Grid Workloads Archive

The Grid Workloads Archive (GWA) [16] is an effort to collect grid workload traces and to make them available to this community. The GWA stores job-level data in a standard grid workload format, which has been purposely designed to allow extensions for higher-level information, such as ”bag-of-tasks” and workflows, jobs that use co-allocated resources, and jobs that use advanced reservation. We have also developed a comprehensive set of tools for collecting, processing, and using grid workloads. We have collected so far for the GWA traces from nine well-known grid environments, with a total content of more than 2000 users submitting more than 7 million jobs over a period of over 13 operational years, and with working environments spanning over 130 sites comprising 10000 resources. However, the GWA needs to be extended to become useful to the broader community of resource management in large-scale distributed computing systems. Towards this end, we identify two topics that we will address within the context of this proposal: 4

Implement on-line trace manipulation tools. Currently, it is not possible to add traces to the GWA without the help of the GWA administrator. In particular, this limits the ability to add monthly updates to the existing traces. Similarly, it is not possible at the moment to select from the data available online only the parts that are representative for a specific experiment, such as the busiest day. Design a generic grid workload model. This will be further developed in the next section.

2.3

Modeling the platforms, their usage and the objective function

Most distributed systems deployed nowadays are characterized by a high volatility of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and the increasing probability of failure. Therefore, as the size of the system increases, it becomes necessary that it adapts, as automatically as possible, to the changes of its components. This requires a self-organization capacity of the system with respect to the arrival and departure of participants, data, or resources. As a consequence, it becomes crucial to understand the behavior of large scale systems, so as to efficiently exploit these infrastructures. In particular it is essential to design dedicated algorithms and infrastructures handling a large amount of users and/or data. For large parallel systems, the non-determinism of parallel composition, the unpredictability of execution times and the influence of the outside world are usually expressed in the form of multidimensional stochastic processes that are continuous in time and with a discrete state space. The state space is often infinite or very large and several specific techniques have been developed to deal with what is often termed as the “curse of dimensionality.” The ultimate goal of this part is to provide a model of the environment and applicable ways to measure the performance of algorithmic solutions. For instance concerning the GWA, previous work have focused on analyzing [17, 18] and modeling [19] aspects of the GWA data. One research focus for this project is to extend previous work into a generic grid workload model that focuses on the variability of the arrival process and of the resource demand. Understanding qualitative and quantitative properties of distributed systems and parallel applications based on measurement and statistical data is therefore a major issue. We will deal with these questions using several complementary tracks: Robustness, Insensitivity and Comparisons. Inferring a distribution from a finite sample is very difficult and very hazardous. Actually, in most cases, one chooses a distribution beforehand and only infers parameters. This may not be very satisfactory from the system designer point of view. Another difficulty with the stochastic approach is the fact that when the distributions of the input processes of task-resource system are complex, it is often very difficult, or even impossible to compute the distribution of its output (or even to compute its expectation). However, these difficulties are often counterbalanced by the great analytical power of probability theory. Indeed, random processes can also be seen as a simplification of a complex deterministic processes. For example the release dates of many tasks can often be simplified as one Poisson process since this is the limit of the superposition of a large number of independent arbitrary processes. In fact, limit theorems are of paramount importance in

5

probability theory. The law of large numbers or the central limit theorem may constitute the most well-known tools used to easily analyze very complex dynamic systems. Another useful property which helps to overcome the technical difficulties of distribution inference is the insensibility properties. In some cases the task-resource system average behavior only depends on the expectation (or the first two moments) of the input data and not on the whole distribution. This makes distribution inference less critical. Finally, stochastic structural properties can also be used to derive qualitative properties based on the structure of the system valid for all distributions. Task-resource system analysis is very sensitive to the initial assumptions. An easy polynomial problem may become NPhard in the strong sense when a small assumption is changed. The complexity of classical questions such as computing the makespan of the system jumps from one class of complexity to the next, sometimes in a counter-intuitive way. However, when one focuses on bounds and qualitative properties (such as monotonicity with respect to the input parameters), then the classification of task-resource systems becomes much more stable and there exists several efficient techniques that can be used to tackle them, as detailed in [15]. Structural and qualitative analysis. In scheduling theory, the most common paradigms are of the type “as early as possible” or “best fit.” However, in cyclic systems or in systems with infinite horizons, the best schedule can be of the type “as regular as possible” or “as balanced as possible” when the system has some kind of qualitative properties such as discrete convexity of the objective function with respect to the input parameters of the system (called multimodularity) [1]. For example deterministic and stochastic polling [13] and routing systems [14] with N queues have such a multimodular property. In such cases general theorems assert that regular sequences are optimal scheduling policies when no state information is available to the scheduler [2]. Metric evaluation. In our context, measuring the quality of a solution (duration, robustness, reliability), often requires to evaluate the distribution of the output process. However, in the general case, due to dependencies between random variables given in input, such evaluation is #P-complete. We need to determine subclasses where the complexity class is easier and we will try to find suited approximation method to evaluate such metrics that provide a good trade-off between accuracy and speed.

2.4 2.4.1

Designing algorithms that handle uncertainties Proactive methods

As stated previously, large-scale distributed platforms are subject to load variations and potential failures. We will investigate the design of algorithms that can cope with resource changes and/or breakdowns. This is a proactive approach as explained in Section 1. A first tool is replication. Consider a simple application consisting of a large computational workload whose constituents are divisible in the sense that each chunk of work can be partitioned into arbitrary granularity [7]. Assume that we have access to p identical computers to help us compute the workload via work-sharing, and that these computers are subject to unrecoverable 6

interruptions that cause us to lose all work currently in progress on the interrupted computer. We wish to cope with such interruptions—whether they arise from hardware failures or from a loaned/rented computer’s being reclaimed by its owner, as during an episode of cycle-stealing [8]. Here the goal is to maximize the expected amount of work that gets computed by the p computers, no matter which, or how many computers get interrupted. The algorithmic challenges can be described in terms of dilemmas. Sending each remote computer a small amount of work minimizes vulnerability to interruption-induced losses, but it maximizes the impact of per-work overhead and minimizes the opportunities for “parallelism” within the assemblage of remote computers. Replicating work lessens our vulnerability to interruption-induced losses, but it minimizes the expected productivity advantage from having access to remote computers. Because communication to remote computers is likely to be costly in time and overhead, we limit such communication by orchestrating work replications in an a priori, static manner, rather than dynamically, in response to observed interruptions. While we thereby duplicate work unnecessarily when there are few interruptions among the remote computers, we also prevent the server from becoming a communication bottleneck. Preliminary works show that good trade-offs are very difficult to achieve [4], even with fully identical resources (same-speed processors that are subject to the same simple linear failure probability distribution). One objective of this Action is to extend the approach to heterogeneous resources, which may obey different speed and risk characteristics. Moving from embarrassingly parallel workloads to more realistic (but also more complex) applications will induce further algorithmic challenges. In this proposal we target workflow applications [3] that continuously operate on a stream of data sets. Each workflow is partitioned into tasks that are linked by simple constraints, such as a linear chain of precedence. In steady-state, data are pumped from one task to its successor. The problem is to map tasks onto resources so that communications and computations are organized so as to optimize (usually conflicting) optimization criteria. It is already a difficult problem with a single workflow to decide which (and how many) resources to use for mapping each task when optimizing performance related objectives (throughput, response time, energy consumption) simultaneously with reliability and robustness objectives. Note that the problem becomes even more complex when several workflows are executed simultaneously and compete for resources. All workflows will then compete for CPU and network resources, and for instance replicating for speed and/or for reliability cannot be achieved concurrently for all applications. 2.4.2

Reactive Method

Concerning reactive approach, we will consider brokering with feedback. Scheduling and brokering are fundamental when one seeks performance in a distributed system. The common approach is to employ static scheduling heuristics that approximate an optimal schedule with respect to an objective function such as the makespan. Such an approach is not relevant anymore on large-scale distributed platforms (Grid computing platforms, enterprise networks, peer-to-peer systems). Statistic predictions for grid brokering is seldom used in modern grids (such as EGEE). In the past, we have successfully used stochastic models (which proved to be very robust to parameter changes) and law estimations to construct a broker providing up to 20% of throughput increase compared with standard protocols over a large range of loads [6, 5]. We believe that dynamic brokering for batch allocation in grids based on multi-dimensional index tables can be used in practice for computational grids, with or without knowing the job sizes. Furthermore, fast algorithms can be used (off-line or even on-line) to compute the index 7

tables. Index routing policies prove to be very efficient and very robust with respect to parameter changes. They also allow one to assess the value of information by comparing the performance of indexes when the sizes of the jobs are known and when they are not. In this Action, we will improve the proposed method when the system sends feedback to the resource broker about its state (resource availability, software failures, hardware breakdown, etc.) 2.4.3

Proactive vs. reactive methods

Based on the above study, we will identify the cases in which each approach is beneficial and to what extent. Moreover we will try to see if a mixed approach (static scheduling further enhanced by a reactive method) can be used and provide good performance.

2.5

Experiments to analyze test and compare our solutions

The goal of the previous sections was to provide models and solutions to the problem of uncertainties in large-scale systems. However, such models and solutions need to be validated in order to assess their viability, usability and performance. Unfortunately, such validation is a difficult challenge. Indeed, in this case, the target environments are very complex, at large scale, dynamic, and shared. Hence, a pure analytical approach is certainly not sufficient, and not always possible: it is required to implement the solutions and experimentally compare their real behavior. Furthermore, model validation can only be done through experiments. Performing a good experiment is a difficult task and a scientific issue by itself: naives experiments on real platforms are often not reproducible, whereas it is hard to use these experiments in other or future contexts. Moreover, as the parameter space is very large, choosing the good parameter settings is not trivial. 2.5.1

Model Validation

Models are abstraction of the reality. We will therefore work on assessing their domain of validity and their realism. The models developed in Section 2.3 will be validated by confronting their behavior with the reality. We will compare the statistical and probabilistic models of grid usage with the real workload given by the Grid Workloads Archive. Failure models will also be compared with real recorded behavior provided by the literature such as in Desktop grid [21] In order to validate the models we will use the following methodology: 1. define error metrics (e.g. precision); 2. evaluate these metrics by varying the model parameters and comparing the results with the real measures; 3. statistically define the model error and domain of validity based on these comparisons. 2.5.2

Solution validation

We distinguish three methodologies: simulation, emulation and in-situ. For each of these methodologies, there is a class of tools: simulators, emulators and real-scale environments respectively. Many members of this ARC have a strong background in experimental validation of models and

8

algorithms. Some of the members participate in the Grid’5000 project, which we will use for real-scale experiments. When, due to the large configuration setting space, real experiments are too time consuming, we will use simulator such as SimGRID [11] (MESCAL & Algorille) or The DGsim simulator (TU Delft). When more realism is required, or when real testbed are not heterogeneous enough, we will use an emulator of heterogeneity developed in the Algorille team and called Wrekavoc [9]. 2.5.3

Result analysis

Based on the results of the model/solution validation, we will be able to identify the limits of the work done in the previous sections. The next step will be to overcome these limitations by looping again and refining the models and provide better and improved algorithms (see Figure 1).

3

4

Participants list Team ALGORILLE

Location INRIA Nancy Grand-Est

GRAAL

ˆ INRIA Grenoble Rhone-Alpes

MESCAL

ˆ INRIA Grenoble Rhone-Alpes

TU Delft

Delft, Netherlands

Members Louis-Claude Canon (PhD) Emmanuel Jeannot (CR INRIA) – ARC leader Anne Benoit (MCF ENS-Lyon) Fanny Dufoss´e (PhD) Matthieu Gallet (PhD) Yves Robert (Pr ENS-Lyon) Bruno Gaujal (DR INRIA) Derrick Kondo (CR INRIA) Jean-Marc Vincent (MCF UJF) Dick Epema (Pr) Alexandru Iosup (PhD)

Team Synergies

We have built a consortium that regroups the different skills required to successfully carry-on this project: Modeling (Major: Delft, MESCAL; Minor: GRAAL). The Delft team provides us with the “The Grid Workloads Archive”. The MESCAL INRIA Project Team (IPT) have a strong background in statistic analysis and probabilistic modeling. Algorithm Design (Delft, GRAAL, ALGORILLE). The Delft team is specialized in designing scheduling algorithm in parallel environment with co-allocation. The GRAAL IPT is specialized on algorithms for large-scale systems. The ALGORILLE IPT is working on resource management algorithms with a recent emphasis on reliability [12] and robustness [10]. Experiments (Major: MESCAL, ALGORILLE; Minor: Delft). The ALGORILLE IPT has a research axis on experiments and model validation based on different methodologies [20] (simulation, emulation and real-scale). They co-develop the SimGRID simulator [11] with the

9

MESCAL IPT. TU Delft participates in the Das-3 project (The Dutch large-scale experimental testbed similar to Grid’5000).

5 5.1

Team Description Algorille IPT – INRIA Nancy Grand-Est

The Algorille IPT is the coordinator of this Action (Emmanuel Jeannot). The main interest of Algorille concerns the algorithmic groundings of large-scale distributed systems. More precisely 3 research axis are covered by this team: (1) Structuring of applications for scalability. The goal is to provide modeling tools to handle size, locality and granularity of computation and data. (2) Transparent resource management. Designing sequential and parallel task scheduling algorithms. Working on migration of computations, data exchange; distribution and redistribution of data. (3) Experimental validation. Provide methodologies that enable reproducibility, extendability and applicability of simulations, emulations and in-situ (real-scale) experiments.

5.2

TU Delft

The research area of the TU Delft grid team is resource management, and in particular scheduling, in large-scale distributed systems and in grids. In this area, TUD focuses on the following three research directions. First, we are building and analyzing the performance of the grid scheduler called KOALA, which we have deployed on the Dutch DAS grid system. Second, we build and test a set of grid resource management research tools, among which the Grid Workloads Archive (GWA) and the DGsim grid simulator. Finally, we design and assess methods for improving the predictability of performance in grids.

5.3

GRAAL IPT – INRIA Grenoble Rhone-Alpes ˆ

The GRAAL IPT has two main research areas: (1) environments and tools for the deployment of applications in a client-server mode; (2) algorithm design and scheduling strategies for heterogeneous platforms. The present ALEAE proposal is related to the latter topic. We investigate scheduling problems that are of practical interest in the context of large-scale distributed platforms. We assess the impact of the heterogeneity and volatility of the resources onto the scheduling strategies.

5.4

MESCAL IPT – INRIA Grenoble Rhone-Alpes ˆ

MESCAL (Middleware Efficiently SCALable) is concerned with large parallel systems and their exploitation for high performance computing. Our approach can be summarized by the following program: we want to understand the dynamics of large parallel systems so that we can design middleware and exploitation mechanisms making high performance computing applications easy and efficient on such infrastructures. Therefore, our project-team gathers researchers in the fields of discrete dynamic systems, optimization theory and parallel software design.

10

The goal of MESCAL is to design software solutions for the efficient exploitation of large distributed architectures at metropolitan, national and international scales. Our main applications are intensive scientific computations. Our methodology is based on : (1) stochastic modelling of large discrete event systems, (2) performance evaluation and simulation of large deterministic and probabilistic systems, (3) middleware design, (4) distributed system methods.

6

Budget

We plan to meet three times a year, for a one day meeting in order to exchange ideas and progress on the work plan: this will cost us 500e×8 people × 3 meetings= 12 Ke/year. We will also have several one week visits for which we ask 10 Ke/year. We will also need a post-doc to work on the most time consuming part which is the design and the experimental validation of models. More precisely the post-doc subject will be (1) to work on the traces given by the GWA and provide stochastic models of the archive then validate these models using statistical analysis (2) an other complementary part will consist in experimenting, on Grid’5000, the solutions proposed by the Action an validate the models and the algorithms used to design these solutions. It will be preferably hosted by the Algorille team but could also work within the MESCAL IPT. It will require a full-time work starting the second year to meet all the implied team (42 Ke). Total for the two years of this project: 44 Ke for missions and 42Ke for the postdoc.

References [1] Eitan Altman, Bruno Gaujal, and Arie Hordijk. Balanced sequences and optimal routing. Journal of the ACM, 47(4):752–775, 2000. [2] Eitan Altman, Bruno Gaujal, and Arie Hordijk. Discrete-Event Control of Stochastic Networks: Multimodularity and Regularity. Number 1829 in LNM. Springer-Verlag, 2003. [3] Anne Benoit and Yves Robert. Mapping pipeline skeletons onto heterogeneous platforms. J. Parallel Distributed Computing, 68(6):790–808, 2008. [4] Anne Benoit, Arnold Rosenberg, Yves Robert, and Frederic Vivien. Static strategies for worksharing with unrecoverable interruptions. Research Report 2008-29, LIP, ENS Lyon, France, October 2008. [5] Vandy Berten and Bruno Gaujal. Brokering strategies in computational grids using stochastic prediction models. Parallel Computing, 2007. Special Issue on Large Scale Grids. [6] Vandy Berten and Bruno Gaujal. Grid brokering for batch allocation using indexes. In EuroFGI NET-COOP, Avignon, France, June 2007. LNCS. [7] V. Bharadwaj, D. Ghose, V. Mani, and T.G. Robertazzi. Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, 1996. [8] S.N. Bhatt, F.R.K. Chung, F.T. Leighton, and A.L. Rosenberg. On optimal strategies for cyclestealing in networks of workstations. IEEE Trans. Computers, 46(5):545–557, 1997. 11

[9] Louis-Claude Canon and Emmanuel Jeannot. Wrekavoc a Tool for Emulating Heterogeneity. In 15th IEEE Heterogeneous Computing Workshop (HCW’06), Island of Rhodes, Greece, April 2006. [10] Louis-Claude Canon and Emmanuel Jeannot. Scheduling Strategies for the Bicriteria Optimization of the Robustness and Makespan. In 11th International Workshop on Nature Inspired Distributed Computing (NIDISC 2008), Miami, Floride, USA, April 2008. [11] Henri Casanova, Arnaud Legrand, and Martin Quinson. SimGrid: a Generic Framework for Large-Scale Distributed Experiments. In 10th IEEE International Conference on Computer Modeling and Simulation, March 2008. [12] Jack J. Dongarra, Emmanuel Jeannot, Erik Saule, and Zhiao Shi. Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems. In 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’07), San Diego, CA, USA, June 2007. [13] Bruno Gaujal, Arie Hordijk, and Dinard Van der Laan. On the optimal open-loop control policy for deterministic and exponential polling systems. Probability in Engineering and Informational Sciences, 21:157–187, 2007. [14] Bruno Gaujal, Emmanuel Hyon, and Alain Jean-Marie. Optimal routing in two parallel queues with exponential service times. In WODES, pages 193–198, Reims, 2004. IFAC. [15] Bruno Gaujal and Jean-Marc Vincent. New trends in scheduling, chapter Comparison of stochastic task-ressource systems. Taylor and Francis publisher, 2009. to appear. [16] A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, and D.H.J. Epema. The grid workloads archive. Future Generation Comp. Syst., 24(7), February 2008. [17] Alexandru Iosup, Catalin Dumitrescu, Dick H. J. Epema, Hui Li, and Lex Wolters. How are real grids used? the analysis of four grid traces and its implications. In GRID, pages 262–269. IEEE Computer Society, 2006. [18] Alexandru Iosup, Mathieu Jan, Omer Ozan Sonmez, and Dick H. J. Epema. The characteristics and performance of groups of jobs in grids. In Anne-Marie Kermarrec, Luc Boug´e, and Thierry Priol, editors, Euro-Par, volume 4641 of Lecture Notes in Computer Science, pages 382–393. Springer, 2007. [19] Alexandru Iosup, Omer Ozan Sonmez, Shanny Anoep, and Dick H. J. Epema. The performance of bags-of-tasks in large-scale distributed systems. In Manish Parashar, Karsten Schwan, Jon B. Weissman, and Domenico Laforenza, editors, Proceedings of the 17th International Symposium on High-Performance Distributed Computing (HPDC-17 2008), 23-27 June 2008, Boston, MA, USA, pages 97–108. ACM, 2008. [20] Emmanuel Jeannot. Experimental Validation of Grid Algorithms: a Comparison of Methodologies. In Fifth High-Performance Grid Computing Workshop (HPGC’08), in conjunction with IPDPS 2008, page 8, Miami, FL, USA, April 2008. Invited article.

12

[21] Derrick Kondo, Filipe Araujo, Paul Malecot, Patr´ıcio Domingues, Lu´ıs Moura Silva, Gilles Fedak, and Franck Cappello. Characterizing result errors in internet desktop grids. In EuroPar, pages 361–371, 2007.

13

Suggest Documents