Planning Spatial Workflows to Optimize Grid Performance - CiteSeerX

19 downloads 9994 Views 351KB Size Report
1Federal University of Rio de Janeiro – COPPE, Department of Computer ..... information present in the correlated coordinate systems of the jobs and the ...
Planning Spatial Workflows to Optimize Grid Performance Luiz Meyer1, James Annis2, Marta Mattoso1, Mike Wilde3, Ian Foster3, 4 1

Federal University of Rio de Janeiro – COPPE, Department of Computer Science 2 Fermilab - Experimental Astrophysics 3 Argonne National Laboratory – Mathematics and Computer Science Division 4 University of Chicago – Department of Computer Science

Abstract. In many scientific workflows, particularly those that operate on spatially oriented data, jobs that process adjacent regions of space often reference large numbers of files in common. Such workflows, when processed using workflow planning algorithms that are unaware of the application’s file reference pattern, result in a huge number of redundant file transfers between grid sites and consequently perform poorly. This work presents a generalized approach to planning spatial workflow schedules for Grid execution based on the spatial proximity of files and the spatial range of jobs. We evaluate our solution to this problem using the file access pattern of an astronomy application that performs co-addition of images from the Sloan Digital Sky Survey. We show that, in initial tests on Grids of 5 to 25 sites, our spatial clustering approach eliminates 50% to 90% of the file transfers between Grid sites relative to the next-best planning algorithms we tested that were not “spatially aware”. At moderate levels of concurrent file transfer, this reduction of redundant network I/O improves the application execution time by 30% to 70%, reduces Grid network and storage overhead and is broadly applicable to a wide range of spatially-oriented problems.

1

Introduction

In many scientific domains, typified by astronomy, earth systems science, and geographic information systems, datasets represent the characteristics of some region of the sky, the atmosphere, or the earth. Analogous spatial dataset representation is often found in domains such as material science, computational chemistry, and fluid dynamics, where the entities being modeled have spatial characteristics. These problem spaces may be of varying dimensionality, with spaces of degrees 2 and 3 being common, but higher-order spaces are often used to represent additional attributes such as spectral characteristics or time. Various coordinate systems specific to the problem domain are employed. The datasets of these domains are indexed (and often named and cataloged) by the spatial coordinates of the region that the data pertains to. Spatial datasets are often processed by “gridding” the problem space into uniformly-sized partitions. Applications that process spatial data are frequently partitioned into jobs that perform processing in manageable units of resources such as file storage and CPU or RAM. Jobs are typically sized to make them small enough (in memory and duration) to be practical to run, yet large enough to reduce scheduling overhead to acceptable levels. Frequently, these jobs are chained, based on data or resource dependencies, into flow graphs, to form complex scientific workflows. Running such workflows in a Grid imposes many challenges due to the large number of jobs, file transfers, and amount of disk space needed to process them at distributed sites, and to the various tradeoffs in the scheduling of Grid resources. The thesis of this work is that spatial workflows can best be planned for Grid execution using spatially-aware scheduling algorithms; that such algorithms are simple to develop and deploy; and that such a scheduling algorithm have very general applicability to this popular and important class of application workflows. The opportunity to exploit knowledge of spatial proximity to improve the performance of a Grid workflow is manifest in many application codes used by the Sloan Digital Sky Survey (SDSS) [16], a formidable astronomical survey project, creating a publicly accessible database of about one quarter of the sky, and measuring the characteristics of over 100M celestial objects. SDSS will also measure the distances to over one million galaxies and quasars. Our study and experiments were motivated by the exceptionally long execution times we encountered for a specific workflow, the SDSS southern-hemisphere coaddition or “Coadd” [20,21], running on the Grid3 computing facility [10,22] currently consisting of over 30 sites and 4500 CPUs. This workflow, which has to date been run only once on Grid3, required approximately 70 days to complete, and was executed with largely manually workflow planning. Our initial analysis of this workflow indicated that its performance difficulties stemmed from the fact that

a) most jobs reference a large number of files that are also referenced by jobs which process nearby regions of space (i.e., there is a high degree of data reference overlap between jobs) and b) the workflow execution mechanisms employed contained no mechanism to control the amount of storage used per site. These problems are not specific to this application – they are shared by every spatial processing application in which jobs processing overlapping regions of their problem space. This is common to algorithms in which the values of results in each specific point in the problem space depends on neighboring conditions. The distance to which such effects apply determines the degree of overlap. Often, modeling a larger degree of influence from adjacent regions of space increases the accuracy of results. In such spatial problems, we are typically constrained by the following objectives. First, we want to leave files processed at sites at the end of jobs, so that adjacent jobs can use these files. But, at the same time, we want to remove files from sites as soon as they are no longer needed, as the amount of space needed to hold the input dataset poses a significant space pressure on the sites. We describe here a simple algorithm we have developed to cluster jobs by their spatial proximity and to schedule these clusters on Grid sites such that the problem space is processed in an order that attempts to take maximum advantage of files already present at each site. Our results show that this algorithm, which we name SPCL (for “spatial clustering”), reduces the number of file transfers between grid sites and improve overall workflow execution time. Our SPCL strategy takes advantage of data locality through the use of dynamic replication and schedule jobs in a manner that reduces the number of replicas created and the number of file transfers performed when executing a workflow. We first evaluated the effectiveness of SPCL via simulation studies, conducted in a simulation environment comprising one submit host and multiple execution sites. Using the SDSS Coadd application as a representative of a broad range of spatial data processing applications with highly overlapping file reference patterns, we evaluated the file reference trace and performance of a complete run of the survey’s southern hemisphere data, comprising 44,400 jobs with approximately 5.4 million file references. The results show that the SPCL planning algorithm, in initial simulations on Grids of 5 to 25 sites, and live tests on Grids of 4 and 10 sites, eliminates 50% to 90% of the file transfers between Grid sites relative to the next-best planning algorithms we tested that were not “spatially aware”. At moderate levels of concurrent file transfer, this reduction of redundant network I/O improves the application execution time by 30% to 70%, reduces Grid network and storage overhead and is broadly applicable to a wide range of spatially-oriented problems. The rest of this work is organized as follows. Section 2 discusses related work that deals with grid scheduling and data placement. Section 3 describes the characteristics of the SDSSS/Coadd workflow that we consider in our experiments, and the design of our clustering by access affinity algorithm. Section 4 describes the simulation framework adopted in the experiments, while in Section 5 the experimental results are analyzed. Finally, Section 6 concludes this work and points to future directions.

2

Related Work

Clustering techniques have been widely used in many areas in computer science. Database management systems use clustering to store index and data blocks in close proximity. In such way, operations like a join of two tables can be performed faster since less I/O is executed. Data mining is also an area that exploits clustering algorithms to group data. Clustering applications include characterization of different customer groups based upon purchasing patterns, groups of genes and proteins that have similar functionality, and categorization of documents on the World Wide Web. There are numerous studies addressing the benefits of data locality to process jobs in a grid environment. However, the works we have studied to date (summarized here) treat jobs as isolated applications and do not deal with cases where a job processes a set of shared files. Many works in the literature address the benefits of data replication in data grid scenarios. Casanova et al. [1] propose an adaptive scheduling algorithm for parameter sweep applications where shared files are pre-staged strategically to improve reuse. Ranganathan and Foster [13,14] evaluate a set of scheduling and replication algorithms and the impact of network bandwidth, user access patterns and data placement in the performance of job executions. The evaluation was done in a simulation environment and the results showed that scheduling jobs to locations where the input data already exists, and asynchronously replicating popular data files across the Grid, provide good results.

Cameron et al. [5, 6] also measure the effects of various job scheduling and data replication strategies in a simulation environment. Their results show benefits of scheduling taking into account the workload in the computing resources. However, this work assumes that each job requires a single input file. Also, the workload is generated from different users spread over different sites. An integrated replication and scheduling strategy is proposed and evaluated by Chakrabarti et al. [2]. Like in the previous mentioned works, the workload scenario consists of a set of independent jobs, submitted from different sites at different times. Data replication is performed asynchronously but jobs can access multiple files. The effect of transferring input/output data on overall grid performance is studied by Shan et al. [17]. However the work assumes that all input files are resident in the host where the job is submitted and files are not reused by different jobs. Mohamed and Epema [11] propose an algorithm to place jobs on clusters close to the site where the input files reside. The work assumes that the jobs require the same single input file which can be replicated. In our work, files are accessed by jobs belonging to a single workflow. Thus, when deciding whether to replicate files we do not take into account accesses by other users. Files are temporarily replicated because they will be used by other jobs in the context of the same application. Although some files have more accesses than others, we do not proactively replicate them. The selection of a site to run a job is done based on an expectation that subsets of the required files will be in place. These replicas may be deleted once they are no longer required by a group of jobs. This deletion is done because it may not be feasible to retain these replicas during the execution of the whole workflow. We assume that we have advance knowledge of the entire workflow for a large-scale operation such as the coaddition of an entire sky region, and that we know the spatial coordinates of every job and every referenced data file, as an n-tuple in the coordinate space of the application. We note that these coordinates need not take on integral values. Further, unlike the approach taken in [14] we do not assume that specific files will grow in average popularity over the lifetime of the run, but rather that such bursts in file popularity will be highly localized, transient, and due to file reference overlaps. Singh et al. cluster jobs to improve workflow execution [18]. In their work, the authors discuss different configurations of the workflow engine and their impact in the performance results. They also show that restructuring a workflow can reduce the completion time in applications where jobs do not present data dependencies. Our work differs by implementing a clustering approach with the goal to improve performance through the reduction of redundant file transfers while scheduling work in a manner that matches the sites storage capacity.

3

Clustering by Spatial Proximity

One possible approach to processing spatial workflows is first to transfer the input files needed by a job dynamically to a site and then, once the processing has finished, erase the input files. This strategy requires minimal disk space for processing the workflow, but at the expense of the number of file transfers between the source and destination sites, which can be huge. Since the time needed to transfer all the input files can be sometimes greater than the time needed to run a job, this strategy, while space-efficient, typically results in long run times. Replication of all needed files in advance can also be employed, but may be not feasible if there is insufficient disk space in the sites. The approach of caching the files while there is available disk space in the sites helps to reduce transfers, but as this strategy is not spatially aware, it is shown unable, in our simulations, to perform as well as SPCL. Clusters can be computed in several ways. Given a workflow definition, the set of jobs, files, and dependencies can be extracted and passed to a clustering algorithm that can apply heuristics to produce the clustering definition. Figure 1 illustrates this approach. In cases where the workflow is formed by several pipelines as exemplified in Figure 2, this design is simple and intuitive, clustering together all jobs that are pipelined in the workflow. If a job that reads a file is placed at the same site of the job that produced the file, no file transfer is needed. Cluster Cluster Algorithm Algorithm

Cluster Cluster Definition Definition

Cluster

Workflow Workflow Definition Definition

Heuristics Heuristics

Fig. 1. Clustering definition process

Fig. 2. Workflow with pipelined components

The main goal of the SPCL clustering algorithm is to reduce the number of file transfers between grid sites. The idea is to create clusters of jobs by spatial proximity, and during the execution of the workflow, assign all jobs belonging to a cluster to the same site. In spatially-oriented applications, the ability for a data-flow-driven algorithm to automatically detect filereference clusters is more challenging, because it must discover how jobs are related to each other based on their file access pattern. This process can be computationally intensive, and doesn’t take advantage of the easier-to-process information present in the correlated coordinate systems of the jobs and the members of the dataset. On the other hand, if there is little or no file sharing between jobs, the SPCL approach is of little use. Figure 3 illustrates this situation. However, when there is high degree of overlap of the dataset processed by adjacent jobs in the same region of the space as showed in figure 4, clustering jobs based on spatial proximity can yield dramatic performance improvements. f

f f

Jf

f f

f

f

f f

f

J

f

f

f

J

f

f

Jf

f f

f

f

f

f

f f

f

f

f

J

f

f f

f

f

f

f

Jf

f

f

f

f

f f

f f

J

f f

f

f

f f

f

f

f

J

f

fJ

f

f f

f

J

f

J

f

f

J – JOB f - FILE

Fig. 3. Jobs without file overlapping

f

J f

f

f

f f f f

f

f

f

f

f

Jf

f f f

f

f f f f

J f

f

f f

J f

f

f f

f f

J

f

f

f

f f

f

f

f

f fJ

f f

f

f

f

f

J f

f

f

f

f f

f

J

f

f

f

f

f

f

f

f

f f

f

f

f

J

Jf

f f

Jf

f

f

f

f

f

f

Jf f

f

f

J

f

f f

f

f

J – JOB f - FILE

Fig. 4. Jobs with high file overlapping

The SDSS coaddition workflow consists of 44,400 jobs that operate on 2.5 Terabytes of input data stored in 544,500 distinct files, performing 5,419,370 file references. On average, each file is referenced by approximately 10.1 different jobs. The portions of the SDSS dataset processed by the coadd consists of image data, and are cataloged on a coordinate grid consisting of three dimensions: right ascension, declination, and spectral filter. Each file in this portion of the survey is associated with (ra,dec) coordinates that are gridded based on the survey capture and filtering techniques: adjacent files along the same ra value vary only in declination, and vice-versa. These input files contain images taken from regions of the sky that vary across 12 different survey declination grid point and 740 right ascension grid points, in 5 spectral regions (filter values). For simplicity, we treat ra, dec, and filter as if they were coordinates (X, Y, Z) in a three-dimensional space. Each job operates on a region of centered at a specific (ra,dec,filter) value, and processes files related to adjacent (ra,dec) grid points. The pattern in which coadd jobs traverse their subspace within the survey to process their files determines how the clusters should be created. Such spatial patterns may obviously vary widely among different applications and algorithms in dimensionality, pattern, stride, etc. For instance, in a two dimensional space, one application may have jobs processing files spread over only one dimension, while other applications may be characterized by having jobs processing files distributed over both dimensions. Various versions of the SDSS Coadd algorithm have, in fact, utilized both of these “sweep” patterns to traverse their regions-of-interest in the survey. Our studies have focused on the more challenging two-dimension sweep. Figure 5 depicts spatial cluster patterns in one and two dimensions. The Coadd jobs we studied traverse files in both ra and dec dimensions. The SPCL algorithm for clustering jobs and files together was trivial, performed by a simple division of the overall 3-space into approximately uniformly sized partitions. The z or filter dimension was treated as a dependent variable of the ra,dec: within each cluster, all the filter values for jobs and data files at a given ra,dec coordinate were included in the same cluster. The majority of files are referenced by jobs in 3 declination grid points and 8 ra grid points. Thus we generated clusters with these dimensions and assigned each job to a cluster based on the value of both attributes. As result, 372 clusters were created: 368 cluster with 120 jobs each, and 4 with 60 jobs each. Two important metrics to evaluate the quality of the generated clusters for this kind of problem are the total number of file transfers needed when executing the workload and the size obtained for each cluster. The importance of the first aspect is obvious, since the goal of clustering in this case is to reduce the number of file transfers among the grid sites. Having a small number of clusters can lead to a small number of file transfers, but can also produce clusters with unfeasible sizes, that is, clusters with sizes that will not fit at sites. Therefore SPCL must generate

clusters such that the disk space required to process the cluster’s jobs can be allocated in the available Grid sites. Figure 6 shows the sizes obtained for the clusters when using SPCL. Cluster Sizes (Gb)

fJ f f f J ff f Jf 120 100

f J f f J

f

f

f

J

80

f

J

f

f

60

f

f

f

f

40

f f

Fig. 5. The spatial cluster definition

20 0 1

31 61 91 121 151 181 211 241 271 301 331 361 Clusters

Fig. 6. Size of generated clusters

We have, to date, performed preliminary evaluation of the effectiveness of the SPCL algorithm using a combination of discrete event simulation and initial verification of these runs on the Grid3 computing facility. We describe first the simulator, and then present our experimental results.

4

The Simulation Framework

We model the Grid environment as a set of sites, each having a fixed number of processors and a fixed amount of total disk storage space. We assume that in each site all processors have the same performance and have access to all disk files stored at that site (i.e., a shared filesystem). In a separate submit host, external to the Grid, a workflow manager controls the scheduling of the jobs to the sites. A replica location service (RLS) keeps track of the location of file replicas spread over the Grid. The model is based on that used in studies [13,14,15]. In the workflow, each file has a unique filename and is characterized by its size. Each job is characterized by a unique job identifier and the list of files it needs to process. In order to execute, a job needs the set of input files to be available in the site that was selected for its execution. We assume (as in practice) that all input data files required by the workflow initially reside in an SDSS archive server. In our studies to this point, we disregard the transfer back of out results, as in the specific application that motivated this work the transfer of output data was a negligible portion of the overall execution time and load on the Grid. (Of course, for general applicability, this will need to be considered in further work on this problem). As in Condor/DAGman [4], a “prescript” and a “postscript” step is associated with each workflow job. The prescript and postscript steps are responsible for transferring input files and deleting output files, respectively, so we shall refer to them as “transfer” and “cleanup” steps, respectively. The number of transfer steps that can be started concurrently (across all jobs) by the submit host is controlled by the DAGman parameter MAXPRE, which serves as a convenient workflow-wide throttle on the data transfer load that the workflow manager can impose on the Grid from the submit host. Figure 7 depicts the Grid environment modeled by our simulator. We modeled the behavior of this Grid using discrete event simulation. Initial experiments were performed on a simulator built on the Parsec simulation framework. This fabric, while powerful, proved slow and enhancing its performance was difficult. After the initial simulation algorithm was tested in Parsec, we recoded the simulator in Perl and tailored its algorithms and data structures to the needs of the Grid elements in our simulation domain. Figure 8 details the architecture of the submit host in the simulation environment. The execution of a job is simulated in the following stages, while a simulation clock keeps track of elapsed time and drives the event loop: First, the workflow manager takes the identification of the job and its list of required files and starts the input file transfer processing for that job. The transfer script processing is responsible for two tasks: deciding where to run the job and transferring needed input files to that site. The decision about where to run a job is based on a selectable scheduling algorithm while the transfer task is performed based on the information from a replica location service (RLS). The job is then sent to a site, waits in a local scheduler queue for a free CPU, and then runs for a specified amount of time. After job execution, the workflow manager starts the postscript processing that is responsible for registering or deleting the new replicas in the RLS, depending on the replica policy being enforced.

Metrics takes from Grid3 calibration runs were used to set various simulation parameters. Wide-area network transfer times across Grid3 were assumed to be 1MB / sec, which corresponded to the average transfer times across 10 varying-sized Grid3 sites with default GridFTP parameters and single-TCP-stream transfers. SUBMIT HOST

RLS RLS

Workflow Workflow

GRID GRID 2

Site i

Site l

P P

P P

P

D

D

4

D D

Site j

P P

D D

P

D D

D

7 PostScript PostScript

1 Workflow Workflow

P P

Workflow Workflow Manager Manager

D

Site k

P

D

Fig. 7. The Grid Simulation Environment

5

PreScript PreScript

P

D

5

6

Workflow Workflow Manager Manager

3 RLS RLS

Fig. 8. Architecture of the submit host

Experiments

Our experimental approach was as follows. We began by studying the file reference trace of the Coadd run performed in 2004 by James Annis and Vijay Sekhri of SDSS-Fermilab, on the resources of Grid3. The parameters of this 44,400 job workflow were presented above. We then performed simulations of the expected Grid performance of this workflow on grids of various sizes, with different levels of concurrent transfers, and with four different workflow planning (site selection) algorithms. These are described in more detail below. 5.1

Methods

We modeled all of our test Grids on a large subset (25) of the 30+ sites of Grid3, using the table of site parameters listed below. Each simulated test Grid of size N (N=5,10,15,20,25) consisted of the first N sites listed in this table. The measurements presented here were based on 60 simulation runs, representing the combinations of 5 site sizes, and 3 levels (20,40,and 60) of “prescript” (transfer and site-selection) concurrency. We then began a number of Grid runs of the same parameters, but faced several technical / engineering obstacles in obtaining these measurements. Most notably, the currently deployed GridFTP data transfer service in Grid3 at the time of this writing is based on a now-obsolete version of code which contained defects that interfered with data transfer at the rates we were attempting to benchmark (up to 60 x 4 concurrent file transfers from a single server to multiple sources). This problem was remedied by implementing a user-deployed overlay of Globus Toolkit 4.0 GridFTP servers, far more robust that the version deployed in Grid3, to perform data transfers. (We note that the new server performed flawlessly in all our tests, but this overlay testbed was not completed until shortly before the submission of this paper. A final paper will contain significantly larger set of real results to validate the simulations). We assume that each input file is approximately 5 Megabytes in size, is initially located at one site, and takes five seconds to be transferred. Each job runs for 15 minutes. We performed experiments with 5, 10, 15, 20 and 25 sites. The resources of the site are described in table 1. The values were taken from sites in Grid3 [10]. The first column indicates the number of the site, and the second and third columns show the amount of disk and processors that exists at each site. We assume that we have a target of obtaining 10% of the resources in each site to define our Grid simulation environment. This assumption is based on the following execution strategy: for disk space, the SPCL algorithm ensures that the space used at any give site remains under 10% of the total disk space of the site; for CPU allocation, we assume an approach where fixed-size pools of CPUs are held dedicated at each site for the duration of

the workflow, in a manner similar to that implemented by the Condor “Glide-In” facility. We modeled this glide-in approach in the simulator, and implemented, for the initial live-Grid3 runs described below, a run-time simulation of glide-in execution by simply running a “mock” job on the submit host, with accurately simulated queuing. Thus, the live-Grid runs represent actual data transfer figures, and a closely-simulated run time for glide-in CPU allocation, but without consuming actual CPU resources on the production Grid. A similar resource-saving shortcut was used for source files – rather than doing the initial tests on a live SDSS archive server, we created a replica catalog that pointed to a single dummy file on a testbed GridFTP server for all of the thousands of cataloged test files. This permitted measurements of the actual, live, data transfers times, without consuming precious SDSS archive server bandwidth. Table 1. Simulation parameters for the grid sites SITE Grid3 1 2 3 4 5 6 7 8 9 10 11 12 13

Disk( GB) Grid3 958 865 200 2170 1627 1130 647 1365 591 976 1712 2172 879

CPUs Grid3 720 516 400 350 312 296 272 158 136 100 82 80 80

DISK( GB) Simul. 95 86 20 217 162 113 65 139 59 97 171 217 88

CPUs Simul. 72 51 40 35 31 29 27 16 13 10 8 8 8

SITE Grid3 14 15 16 17 18 19 20 21 22 23 24 25

Disk( GB) Grid3 1190 1723 569 1190 243 1712 109 53 976 1311 55 1294

CPUs Grid3 76 64 60 55 42 42 40 40 32 20 19 12

DISK( GB) Simul. 119 172 57 119 24 172 11 6 98 131 5 130

CPUs Simul. 8 7 6 6 4 4 4 4 3 2 2 2

The workflow manager starts the prescript process for each job with an interval of one second. When the number of prescripts running reaches the maximum number of prescripts allowed the workflow manager has to wait for the end of the first prescript already running in order to start a new prescript. In the same way, when a job is submitted, if there are no available processors in the execution site then the job has to wait in the remote queue. We consider that the pool of processors at each site remains available and are exclusively used by the application during the entire workflow execution. The total time to run a job is computed by summing the time in the queue in the submit host, the time to transfer the necessary input files, the time in a queue in the execution site, and the time to run the job. We analyze in our experiments both (i) the total number of file transfers and (ii) the total workflow execution time. The number of file transfers is important from the perspective of overall resource use and also for its impact in performance. The overall completion time is important for users that want to run their workloads in a short period of time. 5.2

Execution Strategies

We compare the results according to two external scheduling algorithms combined with two replication policies. We assume that at each site there is a local scheduling policy which in our study is set to FIFO (first in first out), and a file deletion policy which is set to LRU. Thus, the workflow can be executed based on one of the four execution/replication strategies: 1. Random. The execution site is selected randomly. Input files are replicated to the execution site but do not become available to other jobs. Once a job finishes, the transferred input files are deleted. 2. Clustered. The execution site is selected based on the job identification. Each job is assigned to a cluster as a consequence of the cluster design and each cluster is assigned to a site. The number of clusters per site is acquired by dividing the number of clusters by the number of sites. Input files are replicated to the execution site. 3. RandomCaching. The execution site is selected randomly. Input files are replicated to the execution site and become available for other jobs.

4.

DataPresent. Jobs are sent to a site with the most files that they need. If more than one site qualifies then a random one is chosen.

Jobs are random scheduled to execute when running with strategies 1, 3 and 4. Jobs from different clusters are not scheduled to run at same time at the same site when running with strategy 2. 5.3

Results

We start analyzing the impact of the four execution strategies on the total number of file transfers performed. As can be observed in figure 9, executing the Coadd workflow according to strategies 1 and 3 is extremely expensive in terms of file transfers, requiring almost 5.5 million transfers. Allowing for reuse of input files with RandomCaching results in only a minor reduction in the number of transfers. The cluster approach can reduce the number of file transfers to approximately 550,000 when running with the first five and ten sites. The assignment of clusters to sites did not take into account the resources available per site. Thus when we ran the workflow with 15, 20 and 25 sites, sites with minimal storage availability were added to the pool of running sites causing the number of file deletions to increase and consequently increasing the number of transfers. The DataPresent strategy also caches data and takes advantage of the total amount of storage available, showing significant reduction on the number of file transfers when running with more sites in the pool. Figures 10, 11 and 12 show the performance of the four strategies when running the simulator with 20, 40 and 60 prescript jobs respectively. Execution Time - hour

File Transfers 6000000 5000000 Random Cluster RandomCaching DataPresent

4000000 3000000 2000000 1000000 0 5

10

15

20

450 400 350 300 250 200 150 100 50 0

Random Cluster RandomCaching DataPresent

5

25

10

15

20

25

Number of sites

Number of sites

Fig. 10. Execution time for MAXPRE=20

Fig. 9. Total number of file transfers according the execution strategies

Execution Time - Hour

Execution time - Hour 350

350

300

300 250

250

Random Cluster RandomCaching DataPresent

200 150 100 50

Random 200

Cluster RandomCaching

150

DataPresent 100 50

0

0

5

10

15

20

25

Number of sites

Fig. 11. Execution time for MAXPRE =40

5

10

15

20

25

Number of sites

Fig. 12. Execution time for 60 MAXPRE=60

The figures show similar results for the Clustering strategy with 20, 40 and 60 prescript jobs. Clustering allows file reuse, reduces the number of file transfers and consequently the time to run the prescript job. Transferring fewer files for a job can bring performance benefits if by the time the transfers end there is available processors to run the job. Increasing the number of prescripts in this strategy causes more jobs to be ready to run but having to wait for free processors because the time to run the job is higher than the time to perform the few numbers of transfers. Also, it can be observed that the performance decreases for the Clustering strategy when the number of sites grow. This result can be explained as a consequence of adding sites with less number of processors in the execution pool. Since clusters are assigned to sites not considering the resources available, sites with few processors receive the same load as sites with more resources causing jobs to wait for free processors. Increasing the number of prescript jobs enhances the performance for Random and RandomCaching. For these two strategies, the cost to transfer more files is reflected in the performance when the number of prescript is 20. When more file transfers are performed in parallel, the execution time for both strategies improves. However, having a high number of prescript jobs running may be not feasible if jobs perform large number of file transfers. The simultaneous number of hits over the ftp server can cause problems this service. Increasing the number of sites in the pool does not show better execution times for both strategies. On the contrary, the performance is worse when running with 25 sites than running with 5, 10, 15 and 20 sites with 40 and 60 prescripts. With MAXPRE set to 20, the Datapresent strategy presents improvements until the number of sites reaches 15. After this point, the inclusion of sites does not bring benefits with any number prescripts. Increasing the number of prescripts to 60 only cause improvement when the number of sites is five. For all strategies, the results show that increasing the number of sites with few resources does not bring benefits for the performance when the site selection mechanism does not have knowledge about the availabilities of resources at each site. Finally, we have just begun the validation of these simulation results on live Grid3 resources, using the methodology described above. Two live comparison runs have been completed. In the first, a 5% sample of the full 44K job coadd workflow was run on 4 Grid sites, using full file sizes and job durations. This run was performed for two of the four algorithms, SPCL (labeled “Cluster” in figs 9-11) and Random. Simulation results shows run completion time of 22.83 hours for the Random algorithm and 18.80 hours for SPCL; measured results were 25.00 hours for Random and 22.25 hours for SPCL. In a second test, we compared two smaller runs across a larger Grid of 10 sites. This scaled-down test used 1000 jobs of 3 minutes each, and a data file size of 1MB. Simulation results yielded 78 minutes for the Random algorithm and 37 minutes for SPCL; measured results were 1.3 times longer: 106 minutes for Random and 48 for SPCL. While highly preliminary, these initial live Grid test offer encouraging indications that SPCL holds promising potential for speedup of spatial data processing.

Fig.13. Reduction in data transfer operations with 4 sites, 2220 Jobs x 15 mins, 5MB data files. Top: random placement, bottom, spatial clustering with SPCL Workflow duration: SPCL: 21 hours 40 minutes; Random: 25 hours (this random run was terminated for operational needs at 22:15, after I/O completion, and extrapolated)

6

Conclusions and Future Work

We presented a clustering approach to execute workflows in Grid environments. We evaluated the proposed solution with a real file access pattern in a simulation environment. The results show that clustering jobs by affinity of files reduces total number of file transfers. This reduction benefits the Grid as a whole by reducing traffic between the sites. It also benefits the application by improving its performance. We believe that the same approach can be adopted by other scientific applications that present the same characteristics of spatial processing. The good results encourage us to plan the validation of the clustering approach in a real Grid. We also intend to integrate this approach with other site selections algorithms that take into account the resources availabilities in the sites in order to schedule jobs. Furthermore, we plan to apply job clustering techniques in other workflows patterns. For example, a pattern that is frequently observed in scientific workflows is the presence of several independent pipelines where jobs communicate with the preceding and succeeding jobs via data files. We intend to exploit the idea of clustering by DAG transformation grouping together the jobs that are part of the pipeline portions of the workflow. Such transformation can reduce the number of file transferences and also the total number of job submissions.

References 1. 2.

3. 4. 5.

6.

7. 8.

9.

10. 11.

12. 13.

14. 15.

16. 17.

18. 19. 20. 21. 22.

Casanova, H., Obertelli, G., Berman, F., Wolski, R., The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid, in SC’2000, Denver, USA, 2000. Chakrabarti, A., Dheepak, R.A., Sengupta, S., Integration of Scheduling and Replication in Data Grids, Proceedings of 11th International Conference on High Performance Computing, Bangalore, India, December 2004. Cluto, http://www-users.cs.umn.edu/~karypis/cluto/ DagMan, http://www.cs.wisc.edu/condor/dagman/. D.G. Cameron, R. Carvajal-Schiaffino, A.P.Millar, Nicholson C., Stockinger K., Zini, F., Evaluating Scheduling and Replica Optimisation Strategies in OptorSim, in Proc. of 4th International Workshop on Grid Computing (Grid2003). Phoenix, USA, November 2003. D.G. Cameron, R. Carvajal-Schiaffino, A.P.Millar, Nicholson C., Stockinger K., Zini, F., Evaluation of an Economic-Based File Replication Strategy for a Data Grid, in Int. Workshop on Agent Based Cluster and Grid Computing at Int. Symposium on Cluster Computing and the Grid (CCGrid2003), Tokyo, Japan, May 2003. Foster, I., Kesselman, C., 1999, Chapter 2 of "The Grid: Blueprint for a New Computing Infrastructure", Morgan-Kaufman, 1999. Foster, I., Voeckler, J., Wilde, M., Zhao, Y., Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation, in 14th International Conference on Scientific and Statistical Database Management (SSDBM 2002), Edinburgh, July 2002. Foster, I., Voeckler,J., Wilde,M., Zhao, Y., The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration, in Conference on Innovative Data System Research, Pacific Grove, January 2003. Grid3, An Application Grid Laboratory for Science: http://www.ivdgl.org/grid3 Mohamed, H.H., Epema, D.H.J., An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters, IEEE International Conference on Cluster Computing, San Diego, USA, September 2004. Ranganathan,K., Foster,I., Identifying Dynamic Replication Strategies for a High-Performance Data Grid, in Proceedings of the International Workshop on Grid Computing, Denver, USA, November 2001. Ranganathan, K., Foster, I., Decoupling Computation and Data Scheduling in Distributed Data Intensive Applications, in Proceedings of the 11th International Symposium for High Performance Distributed Computing (HPDC-11), Edinburgh, July 2002. Ranganathan, K., Foster, I., Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids, in Journal of Grid Computing, V1(1) 2003. Ranganathan,K., Foster, I., Computation Scheduling and Data Replication Algorithms for Data Grids, 'Grid Resource Management: State of the Art and Future Trends', J. Nabrzyski, J. Schopf, and J. Weglarz, eds. Kluwer Academic Publishers, 2003. SDSS Project, https://www.darkenergysurvey.org/the-project. Shan, H., Oliker, L., Smith, W., Biswas, R., Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration, International Conference on Advanced Computing and Communication, Gujarat, India, 2004. Singh, G., Kesselman, C., Deelman, E., Optimizing Grid-Based Workflow Execution, work submitted to 14th IEEE International Symposium on High Performance Distributing Computing, July 2005. Thain, D., Bent, J., Arpaci-Dusseau, A., Arpaci-Dusseau, R., Livny, M., Pipeline and Batch Sharing in Grid Workloads, 12th Symposium on High Performance Distributing Computing, Seattle, June 2003. DES Project Site: Next Gen Coadd Code – How To Run. Project notes of James Annis: https://www.darkenergysurvey.org/the-project/simulations/sdss-grid-coadd/Next-Gen-Coadd.html Report by Vijay Sekhri: Lessons Learned on Summer 04 Grid SDSS Coadd: https://www.darkenergysurvey.org/the-project/simulations/sdss-grid-coadd/summer-04-grid-coadd Foster, I. et. al., The Grid2003 Production Grid: Principles and Practice, HPDC 2004, Honolulu, Hawaii, USA, June 2004.

Suggest Documents