Job-resource matchmaking on Grid through two-level benchmarking * A. Clematis 1, A. Corana 2, D. D'Agostino 1, A. Galizia1, A. Quarati 1 1 IMATI-CNR, Via De Marini 6, 16149 Genova, Italy 2 IEIIT-CNR, Via De Marini 6, 16149 Genova, Italy
Abstract Grid environments must provide effective mechanisms able to select the most adequate resources satisfying application requirements. A description of applications and resources, grounded on a common and shared basis, is crucial to favour an effective pairing. A suitable criterion to match demand with supply is to characterize resources by means of their performance evaluated through the execution of low-level and application-specific benchmarks. We present GREEN, a distributed matchmaker, based on a two-level benchmarking methodology. GREEN facilitates the ranking of Grid resources and the submission of jobs to the Grid, through the specification of both syntactic and performance requirements, independently of the underlying middleware and thus fostering Grid interoperability. Keywords: Grid matchmaking, Benchmark-driven resource selection, Middleware interoperability, Job submission languages extensions
1
Introduction
Computational Grids comprise a large number of machines, as a rule heterogeneous and belonging to different administrative domains, each providing multiple resources, with a variety of properties and configuration settings [1]. Both Grid administrators and users need powerful but simple tools to set-up, administrate and exploit Grid platforms. To correctly operate, these tools have to rely on consistent and efficient functionalities able to discover the available resources and services, and to describe their properties and current state. This information is essential to guarantee that the job submitted by a user will be forwarded to the most appropriate resources. A matchmaking component (broker, matchmaker,…) is responsible for accomplishing this supply-demand coupling process [2]. Essentially matchmaking is a two-phase task against resources. Firstly, a syntactic phase checks the correspondence between job requirements and resources based on a (possibly) uniform and intelligible description of the two classes, and produces a set of feasible solutions. Then, semantically relevant aspects of resources (e.g. performance, QoS metrics) are examined by the broker to filter out the solutions that best fit application requirements (ranking phase). This second step, though not mandatory, is quite important to address well-informed scheduling policies, taking into due account different (and often diverging) needs of resource owners and concurrent resources consumers. To accomplish the ordering of the feasible solutions set, based on performance criteria, each resource description should comprehend suitable information (with respect to codified and committed metrics) about the resource behaviour when executing different kinds of well-known applications. A widespread method to measure and evaluate the performance of computer platforms is benchmarking [3]. Though widely used, the main drawback of well-established benchmarks is their one-size-fits-all attitude. So, the description of system performance is often partial and therefore not completely suitable to specific application requirements. Depending on application domain, a job may benefit from faster CPU rather than from larger memory size or more advanced I/O devices, or from a combination of various system components. A more accurate answer to user’s specific needs should take into account some of the proper characteristics of the application at hand. By analysing the system indicators more stressed by the application, under some typical and known conditions (e.g. input data size, degree of parallelism, average bandwidth consumption), it should be possible to figure out the behaviour of the system when executing the application in similar future scenarios. This predictability may be helpful to evaluate the fitness of a collection of Grid resources for executing that application [4,5] and plays a significant role in the matchmaking process. Notwithstanding, so far application benchmarks have not been much considered in Grid environments, due to diversified types of applications, architectural complexity, dynamic Grid behaviour and heavy computational costs [6]. It is however to consider that very often the participants to a Virtual Organization (VO) have similar aims, and therefore it is possible to identify a set of the most used applications. Our aim is twofold. 1) To introduce a two-level benchmarking methodology able to meet user needs with a satisfactory description of Grid resources, based on relevant performance metrics stressed both by basic and application *
Corresponding author. Tel: +39 0106475673; Fax: +39 0106475660 E-mail address:
[email protected]
1
benchmarks. To evaluate our methodology, the former have been chosen amongst a collection of widely experimented micro-benchmarks [7-11]; for the latter, a set of application benchmarks related to some applications developed by the authors in the fields of image processing [12], isosurface extraction [13], and linear algebra have been considered. It is worth noting, however, that the proposed methodology is fully general and different low-level and application benchmarks can be used. 2) To enrich the basic descriptive set of Grid resources, supplied by the middleware Information Service (IS), e.g. Globus Monitoring and Discovery System (MDS), with a detailed description of benchmark activity, without requiring changes to the underlying IS. Relying on this deeper knowledge of Grid resources, we designed GREEN (GRid Environment ENabler), a Grid service addressed both to Grid administrators and users. It assists administrators in the insertion of benchmark information related to every Physical Organization (PO) composing the Grid, and provides users with features which a) facilitate the submission of job execution requests, by specifying both syntactic and performance requirements on resources; b) support the automatic discovery and selection of the most appropriate resource(s). Moreover, as a major design guideline, GREEN aims to be as far as possible independent of the middleware supporting the various POs that compose the Grid. GREEN behaves as a bridge between the application level and the middleware: leveraging on standards such as JSDL (Job Submission Description Language) to describe job requirements [14] and GLUE (Grid Laboratory Uniform Environment) [15] to describe resources, GREEN relieves users and applications of the burden of managing the different versions of these languages (actually operating in the various middleware) by accommodating (i.e. mapping and translating) internally this task. In this aspect, GREEN accomplishes a considerable step towards Grid interoperability. The outline of the paper is as follow. Section 2 discusses some of the main contributions in the field of job and resource characterization languages. Section 3 introduces the two-level benchmarking methodology. Section 4 describes some design issues of GREEN, and reports a detailed analysis of the extensions operated to existing languages to obtain a benchmarking-aware representation of resources. Section 5 discusses some results obtained through simulation in order to assess the validity of the proposed approach. Section 6 shortly describes some related work, and Section 7 gives some concluding remarks.
2
Job and Resource characterization
To accomplish the matchmaking task, a proper description of resources is required both at the job/user and at the owner side. To this end, from the very beginning of the Grid, a number of languages have been proposed, originated by the activity of different projects and research groups. The most remarkable of these, by constituting the basis of our benchmarking-driven matchmaking solution, are the subject of the two next subsections. 2.1
Describing job submission A Job Submission Request (JSR), in addition to stating the application-related attributes (e.g. name and location of source code, input and output files), should express syntactic requirements (e.g. number of processors, main memory size) and ranking preferences (if any) to guide and constraint the matching process on resources. The three mainly adopted Job Submission Languages (JSLs) by the Grid community are the Globus Resource Specification Language (RSL) [16] and its successor Job Description Document (JDD) [17], the EU-DataGrid Job Description Language (JDL) [18], and JSDL [14] proposed by a Working Group of the Grid Forum. In the following we highlight their major properties and differences, pointing out the degree of support to the expression of requirements on resources. 2.1.1 RSL/JDD Globus Alliance [19] deployed two successive versions of its JSL: RSL, presents into the pre-Web Services versions of the Globus Toolkit was replaced by JDD in the Web Services versions (GT3.2 and GT4 [20]). JDD substitutes the attribute-pair lists, used by its predecessor to describe jobs to submit, with a more structured and platform-neutral XML closer to the XMLish dialects used in the Web Services Resource Framework (WSRF) family of specifications for Web Services. A simple example of a JDD document describing a sequential job (in particular the isosurface application) submission is: ${GLOBUS_USER_HOME}/isovalue inputvolume.raw 200 ${GLOBUS_USER_HOME}/isosurface.wrl gsiftp://mypc.ge.imati.cnr.it/home/dago/inputvolume.raw file:///${GLOBUS_USER_HOME}/inputvolume.raw
2
file:///${GLOBUS_USER_HOME}/isosurface.wrl gsiftp://mypc.ge.imati.cnr.it/home/dago/isosurface.wrl file:///${GLOBUS_USER_HOME}/inputvolume.raw file:///${GLOBUS_USER_HOME}/isosurface.wrl
The main purpose of such document is to set the parameters for the correct execution of a job. In the JDD schema it is possible to specify a minimal set of requirements, as the minimum amount of memory, or to specify further information as the expected maximum amount of CPU time. Furthermore, it is possible to extend the schema with userdefined elements. An example of an extended version of the RSL (namely XRSL) introducing new attributes, aimed to match complex tasks (e.g. a High Energy Physics production), is given in [21]. However, their use depends on the capabilities of the scheduler available on the target machine. Actually Globus is a toolkit, therefore it does not provide high level solutions as a broker. 2.1.2 JDL The European Data Grid Project [23] introduced JDL, afterwards adopted by its extension, the EGEE project [24]. It is based on the ClassAd language [16,[25] and can be used as the language substrate of distributed frameworks. A JDL document contains a flat list of argument-value pairs, specifying two classes of job properties: job specific attributes (e.g. Executable, InputSandbox) and resources-related properties (e.g. Requirements and Ranks) used to guide the matching process towards the most appropriate resource(s). As shown below, a JDL file lacks of any kind of structure as typically supplied by an XML document like a JDD one. [ JobType = "Normal"; Executable = "$(USER_HOME)/isovalue"; Arguments = "inputvolume.raw 200"; StdOutput = "isosurface.wrl"; InputSandbox = {"/home/dago/inputvolume.raw"}; OutputSandbox = {"isosurface.wrl"}; Requirements = (other.GLUEHostOperatingSystemName == "linux"); Rank = other.GLUECEStateFreeCPUs; ]
As to the Requirements and Rank attributes, it is worth noting that their values can be arbitrary expressions which use the fields published by the resources in the MDS, and are not part of the predefined set of JDL attributes, as their naming and meaning depend on the adopted Information Service schema. In this way JDL is independent of the resources information schema adopted, allowing the discovery of resources described by different Information Systems without any changes in the JDL [18]. 2.1.3 JSDL The Job Submission Description Language, proposed by the JSDL-Working Group [26] of the Global Grid Forum, aims to synthesise some of the more consolidated and common features present in other JSLs, to produce a standard language for the Grid. JSDL contains a vocabulary and normative XML Schema facilitating the declaration of job requirements as a set of XML elements. Likewise JDL, job attributes may be grouped into two classes. The JobIdentification, Application and DataStaging elements describe job-related properties. The Resources element lists some of the main attributes used to constraint the selection of the most suitable resources (e.g. CPUArchitecture, FileSystem, TotalCPUTime). As reported in [14] the pseudo-schema definition of JSDL is: ? ? ? *
3
*
As only a rather reduced set of these elements is stated by the JSDL schema, an extension mechanism is foreseen which provides two patterns for extension: adding attributes to any existing JSDL element, and adding elements (as in the case of JobDefinition reported above, allowing the insertion of any element from other namespaces). In [27,28] examples of JSDL extensions able to capture a finer grain description of the degree of parallelism of jobs are presented. A detailed example of JSDL document is given in Section 4.3, where the extensions introduced to the language are outlined. 2.2
Describing resources At the resource side adequate information is required to advertise static (e.g. OS, number of processors) and dynamic (e.g. number of executing tasks, amount of free memory) properties. The main efforts in the direction of a standard resource-language came from the DataTAG [29], iVDGL [30], Globus [19], and the Data-Grid [23] projects, which collaborated to agree upon a uniform description of Grid resources. This effort resulted in the GLUE schema, a conceptual model of Grid entities, which comprises a set of information specifications for Grid resources that are expected to be discoverable and subject to monitoring [31]. The schema has major components to describe Computing and Storage Elements, and also generic Service and Site information. An implementation of this information model is provided in [32] as a concrete data model through a XML Schema. As the schema has evolved over the years, different versions have been used by various middleware. The schema version deployed with gLite 3 is currently 1.2, but version 1.3 is planned to be adopted shortly [33]. As to Globus, the file format produced by the WS MDS since GT 3.2 is largely based on the XML translation of the GLUE schema 1.1 [34]. To support the adoption of the different schema versions without requiring (or limiting to the minimum) an explicit and invasive action by the administrators, a tool able to transparently adapt to different middleware running on the POs participating in a Grid is particularly welcomed. Our work goes in the direction of extending JSDL and GLUE with elements able to represent more exhaustively the performance characterization of resources based on benchmarks. 3 Benchmarking on Grid The characterization of computational resources is usually based on their performance capacity evaluated through benchmarks. Computer benchmarking provides a commonly accepted basis for comparing the performance of different computer systems in a fair manner, and it is used to investigate the behaviour of resources stressing particular aspects of performance. Considering traditional microprocessors and HPC systems, it is possible to outline mainly two categories of benchmarks: 1. Micro-benchmark or low-level probes, to profile resources considering isolated low-level capabilities such as CPU, memory, and interconnection speed. In this context well-known tools are MAPS [35], STREAM [36], SKaMPI [37], and examples of reference metrics are the number of floating point operations per second to express CPU performance, and latency and bandwidth to evaluate interconnection performance. 2. Application-specific benchmarks, to measure the performance of resources stressing simultaneously several aspects of the system; they correspond to the computationally demanding part of a class of real applications. In HPC, examples of widespread tools are the NAS Parallel Benchmarks (NPB) [38] used to capture the computational requirements of a class of computational fluid-dynamics problems, and the parallel HighPerformance LINPACK benchmark [39] representative of a class of linear algebra problems. The LINPACK benchmark is used as performance measure for ranking the computer systems in the TOP500 project [40], aimed to detect trends in high-performance computing providing an updated list of the 500 most powerful computer systems. Another possibility for this class of benchmarks is to use directly a “light” version of the application of interest, with a reasonable computational cost, but still representative of the real behaviour. The importance of benchmarking in the evaluation of the computational resources of Grid environments is largely acknowledged together with the criticality that this task implies [41,42]. In fact, due to the peculiar nature of the Grid, performance evaluation in a dynamical, heterogeneous context is more complex and less deterministic than in traditional single-machine or even cluster scenarios. Another issue is related to the kind of evaluation pursued by the benchmarking activity: in fact, the Grid has a multi-layered structure, therefore benchmarks investigating performance aspects of the different Grid layers should be considered in order to grasp a predictable behaviour of a real application run. Actually, besides the spectrum of interesting parameters to measure (CPU, memory, …), several variables have to be taken into account when considering the execution of a benchmark suite on a Grid environment: the impact of the underlying Grid middleware, the bandwidth limitation of the interconnections linking the various POs and the simultaneous demand by competing applications for limited resources.
4
To evaluate these aspects different tools could be considered. The Grid Assessment Probes [5] test and measure performance of basic Grid functions, such as job submission, file transfers, and Grid Information Services. The GridBench tool [43] provides a graphical interface to define, execute and administrate benchmarks, considering also interconnection performance and resource workload. It supports the construction of custom graphs to analyse results, and enables the ranking of Grid resources following user-driven specification. The NAS Grid Benchmark (NGB) suite [44], also known as the suite of ALU Intensive Grid Benchmarks (AIGB) [45], defines a set of computationally intensive, synthetic Grid benchmarks, which are representative of scientific, post-processing and visualization workloads. It uses four kinds of data-flow graphs (DFG) according to parallel paradigms extracted from real applications in NASA. In these graphs, each node is a NAS Parallel Benchmark task, and the edge is data or control flow. In this paper, we propose a methodology to facilitate the matchmaking process based on information about performance capacity of computational resources. To discharge complexity aspects that may hinder a performance evaluation of Grid environments, we focused our attention on resources in isolation. Taking account the influence of middleware layers and interconnection performances, possibly results in a more realistic picture of the Grid, but may bias a deterministic and reproducible vision of resources activity. Future developments of this work would capture some of these aspects, focusing on the bandwidth limitation of the interconnections linking the various POs. 3.1
A two-level benchmarking methodology To describe Grid resources, we present a two-level methodology aimed to facilitate the matchmaking process. It is based on two integrable approaches: 1) the use of micro-benchmarks, to supply a “zero-degree” of resource description, mainly based on low-level performance metrics as reported in Table1; 2) the deployment of application-driven benchmarks to get a closer insight into the behaviour of resources for a class of applications under more realistic conditions. In this case, the metric chosen to characterize resources is the execution time evaluated on a reference dataset and specific parameters. This second level of benchmarks profits of a closer knowledge of the characteristics of the application at hand. For example, depending on the application domain, a job could be more computing than data intensive, and therefore it benefits from faster CPU rather than from greater memory capacity or more advanced I/O devices. Through application-driven benchmarks it is possible to assign an evaluation to resources on the basis of the indicators that are more stressed by the application. The decision of supplying a double level of benchmarks is a response to users with different degrees of confidence on the behaviour of their applications. In fact, a more expert user is likely to give a preference to the application-driven benchmark that better expresses the requirements of its application. On the contrary, a user with scarce knowledge of the job to execute may, at least, choose among metrics and related micro-benchmarks that, he/she presumes, closer describe his job. Moreover, application-specific benchmarks are particularly advisable for frequently used applications. According to our methodology, every resource of a PO is tagged with the results obtained through the two levels of benchmarks. This evaluation process plays a significant role, as during the matchmaking phase resources are selectable on performance basis, allowing a better fit of the job needs. Depending on the context of use, e.g. number of POs, kind of applications and their execution time, it is critical to choose when benchmarks have to be executed. Two different approaches may be followed: 1) to execute benchmarks once for each resource at the installation time or when significant changes at the hardware level occur and publish results in the information system [46]; 2) to execute benchmarks periodically or when users ask for the characterization of the resources with respect to specific values [43,47]. Though our methodology is meant to be applied in generic contexts of use, independently of any management policies or access rules, we envisage that a reasonable approach for the submission of benchmarks to the Grid and subsequent tagging of resources is to delegate the deployment to the administrator. The administrator is supported in the task of the creation and maintenance of the performance description by our proposed tool. In particular, if a class of applications is often used, the administrator of the PO will be required to execute the related application-driven benchmarks on the computational resources. Benchmarks are executed on idle CPU, i.e. the only process imposing significant load on the CPU is the benchmark itself. Then a user may avail himself of the performance-based tagging mechanism, by explicitly expressing via a JSL document, as explained in Section 4, his preferences about the expected performance of the resource on which his job has to be executed. To report about the meaningfulness of benchmarking resources, through the different metrics exploited by the twolevel methodology, we present in Section 5.1 the results we gathered by executing micro and application benchmarks on a composite and heterogeneous tesbed. 3.2
Zero-degree level In order to supply a “zero-degree” of resource descriptions we consider the use of traditional micro-benchmarks. A reasonable assumption is that the performance of a Grid node mainly depends on the CPU, the memory and cache, and interconnection speed [47]; therefore, we selected a concise number of parameters to evaluate in order to provide an easy-to-use description of resources. Table 1 shows resource properties and related metrics measured by the microbenchmarks we selected to characterize computational resources. These metrics are well established for evaluating each benchmark performance capacity.
5
Table 1 Low-level benchmarks and related metrics Resource capability Metric CPU Memory Memory-Cache Interconnection I/O
MFLOPS bandwidth in MBps bandwidth in MBps MBps with different message sizes MBps
Benchmark Flops Stream CacheBench Mpptest Bonnie
Flops provides an estimate of peak floating-point performance (MFLOPS) by making maximal use of register variables with minimal interaction with main memory. The execution loops are chosen so that data will fit in any cache [7]. Stream is the de facto industry standard benchmark for measuring sustained memory bandwidth. It is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications [8]. CacheBench is designed to evaluate the performance of the memory hierarchy of computer systems. Its specific focus is to characterize the performance of possibly multiple levels of cache present on the processor. Performance is expressed by raw bandwidth in megabytes per second [9]. Mpptest measures the performance of some of the basic MPI message passing routines in a variety of situations. In addition to the classic ping pong test, mpptest can measure performance with many participating processes (exposing contention and scalability problems) and can adaptively choose the message sizes in order to isolate sudden changes in performance [10]. Bonnie performs a series of tests on a file of known size. If the size is not specified, Bonnie uses 100 megabytes. For each test, Bonnie reports the bytes processed per elapsed second, per CPU second, and the percentage of CPU usage (user and system) [11]. The micro-benchmarks used in this phase generally return more than a value. In order to obtain easily usable results in the matchmaking process, depending on the metrics, we considered for each benchmark a synthetic parameter or the selection of the most significant value. Only for Mpptest we considered different results corresponding to various message lengths (actually our system manages this benchmark as a set of benchmarks, one for each message size). Once benchmarks results are returned, these values are used to characterize resources by populating the benchmark description managed by our tool. 3.3
Application-specific level Micro benchmarks are a good solution when the user has little information about the job she/he is submitting to the Grid or when the job is seldom executed in the Grid environment. Indeed to effectively evaluate system performance, application-driven benchmarks are surely more suitable to approximate at best the real application workload. To this end, we introduce a second level of benchmarks able to describe resources on this basis. In our opinion, application-driven benchmarks are most capable, with respect to micro-benchmarks, to augment the characterization of Grid resource. Meanwhile new applications are tested, and performance results gathered and associated to resources, a bigger picture of the Grid potentiality emerges. This changing and evolving view of resources integrates the static information supplied by Grid IS. As we discuss in Section 4.1, a tool able to submit and save application benchmarks will better exploit Grid resources through an enriched description. We considered some applications of our interest, i.e. image processing, isosurface extraction, and linear algebra. For the first two applications, we derive a light version code. More precisely, we select a reference data set and specific parameters in order to avoid long executions, but in the meantime, to still maintain a representative run of the real application. In this way, we avoid large cost for running benchmarks, obtaining at the same time a good characterization of each application. With respect to image processing [12], we selected a compute intensive elaboration applied to a reference image of about 1 MB. In particular, we considered an edge detection that mainly stresses CPU metrics and memory speed with small data size (Image Processing Benchmark, IPB). The isosurface extraction application [13], other than having impact on CPU and memory, also heavily involves I/O operations. For this reason, to limit execution times, we considered the processing of a small 3D data set of 16 MB, producing a mesh made by 4 million triangles (ISOsurface extraction benchmark, ISO). To represent the class of applications based on linear algebra, we used the well-kwon High-Performance Linpack benchmark (HPL) [39]. As already mentioned, for these application-driven benchmarks the metric considered to characterize resources is execution time. Similarly to the micro-benchmarks case, the resulting values are stored in the internal memory of our proposed tool.
4
Job-resource benchmark-aware matchmaking
One of the main issues in Grid Computing is the “clever” discovery and selection of resources so that a user or an agent could find as quickly as possible the resources he/she needs. However, Grid middleware usually offer basic services for the retrieving of information on single resources, and are often inadequate with respect to the more detailed and specific
6
user requirements. An huge gap separates users and resources, and tools that allow the two parts to better come to an agreement are essential. In [48] we presented GEDA a Grid service based on a distributed and cooperative approach for Grid resource discovery, which establishes a structured view of resources (single machines, homogeneous and heterogeneous clusters) at the PO level, and leverages on a overlay network infrastructure which connects the various POs constituting a Grid. In this work, we present an advanced version of GEDA called GREEN (GRid Environment ENabler), able to provide users with a detailed performance characterization of Grid-resources based on benchmarks evaluation. To this end, GREEN assists administrators in the management of the benchmark information related to every PO composing the Grid For each PO, a two level tree of Grid Information Service (IS) is defined. The low level IS represents clusters of homogenous or heterogeneous machines (HoC and HeC in Figure 1). The upper one represents all the resources of the PO, that can be either clusters or single machines. A single machine, for example, may be a multi-core or a many-core system. Every instance relies on the root node of the tree to keep updated information about the state of all PO’s resources, and to exchange them with other GREEN instances when discovery operations are required.
Figure 1 An example of Grid environment with a number of interconnected POs. 4.1
GREEN as a Grid enabler Acting as a distributed matchmaker, GREEN manages and compares the enriched view of resources with usersubmitted jobs, with the goal of selecting the most appropriate resources. Operating at intermediate level between applications (e.g. schedulers) and Grid middleware, GREEN aims to discover the set of all resources satisfying user requirements ordered by ranks, not to select any particular amongst them. This task is left to a (meta)scheduler to which the resource set is passed, allowing to apply the preferred scheduling policies to optimize Grid throughput or other target functions (e.g. response times, QoS,….). Once the “best” resource is chosen, GREEN will be re-invoked to carryout the submission of the job on it, via the Execution Environment (e.g. Globus GRAM). The design of GREEN has been directed by the following guidelines: a) to allow the insertion of benchmark information by system administrators, without requiring any changes or extensions to the underlying middleware (e.g. Globus, gLite); b) to manage the benchmark information in semi-automatic way with minimal user intervention; c) to translate the JSDL document submitted by the user to the specific JSL processed by the middleware (we considered GT4 and gLite); d) to assure the maximum interoperability, i.e. to allow an easy extension to other middleware, simply by developing the proper API.
7
Figure 2 Component view of GREEN. Figure 2 shows the main components of a GREEN instance and their interactions with other middleware services, notably the Information System (IS) and the Execution Environment (EE). The Job Submission (JS) component is the main interface to GREEN functionalities; it receives requests of benchmark submission by PO administrator or jobs submission initiated by users. Depending on the activation mode (according to the different published signatures), it behaves just like a message dispatcher or a translator of JSL documents, carrying-out their subsequent submission to the EE. The main task of the Benchmark Evaluation (BE) component is to support administrator in the characterization of PO resources on the basis of their benchmark-measured performance. The Matchmaker performs the core feature of GREEN: the matching of Grid resources with the requirements expressed by the users through the Job Submission Request, and their subsequent ranking. As these three components are discussed deeply in the next Subsections, here we take a closer insight into Resource Discovery (RD). RD is in charge of feeding GREEN with the state of Grid resources. It operates both locally and globally by carrying out two tasks: 1) to discover and to store the state of its PO resources; 2) to dispatch requests to other GREEN instances. As to the first task, RD periodically dialogues with the underlying IS (e.g. MDS, gLite IS) (step A) and produces a document (namely the PO snapshot), having an XML format largely conformed to the GLUE version adopted by the underlying middleware (step B). The PO snapshot is asynchronously “consumed” by the Matchmaker to answer to the queries issued by clients (e.g. users, meta-schedulers,…) or other GREEN instances (steps 4.1 and 7 of Figure 3,). As explained in next subsection, for consistency reasons related to the management of the benchmark characterization of resources, a comparison is always made between the actual PO snapshot and the new state image returned by the IS: if relevant divergences occur, events are triggered before the old copy is replaced. Note that the syntactic differences among the various versions of GLUE are managed by a conversion mapping at matching time. This way GREEN is able to deal with different underlying middleware transparently to Grid users and applications. To accomplish the dispatching task, RD handles the so called neighbours view, after the activation of the Matchmaker (point 4.2 of Figure 3). Depending of the number of POs, i.e. GREEN instances running, the neighbours view may be limited to a short list of network addresses to be contacted individually, or deployed via more complex data structures and algorithms like those used in Super-Peer networks such as DHT [49] or random walk [50,51]. Whatever the solution adopted, RD listens to the incoming responses, interacts with the Matchmaker to retrieve the matching resources and transfers them to the requiring component. 4.2
Benchmarking Grid resources Before users are allowed to express performance preferences through JSDL documents, GREEN has to create the enriched view of Grid resources. To this end, GREEN supplies Grid administrators with the facility of submitting and executing benchmarks (both micro and application oriented) against the resources belonging to their administrative domain (PO) and storing results. Once a benchmark is executed and its results collected, their values are stored into GREEN internal datastructures. To reflect the underlying view of grid resources offered by the GLUE 2.0 specification language and to support the matching mechanism (i.e. the comparison with resources information contained in the XML PO snapshot), the benchmark-value copies are directly represented as GLUE entities according to the XML reference realizations of GLUE 2.0 [32] (specifically the Benchmark_Type complex type and the BenchmarkType_t open enumeration type). An example of a benchmark document related to the execution of the micro-benchmark Flops against the resources cluster1.ge.imati.cnr.it, resulting in 478 MFLOPS, is:
8
cluster1.ge.imati.cnr.it MFLOPS 478 micro descending
As BenchmarkType_t defines an open enumeration (that is an extensible set of values) according to GLUE 2.0, it is possible to chose for the Type element amongst a list of six values (e.g. specint2000, specfp2000, cint2006,...), nevertheless any other value compatible with the string type and with the recommended syntax is allowed. We can therefore deal with an increasing list of new benchmark types without having to modify the document schema. Let us explain now in more detail how GREEN supports administrators in the task of benchmarking evaluation. Initially, for any benchmark relevant to its PO, the administrator submits a JSDL document, with no indications about the target resource, to the Grid, through the Job Submission Port (JSP) of the GREEN instance associated with his PO (step 1 of Figure 2). After translating the JSDL document into the particular JSL format supported by the middleware of the PO (e.g. JDL for gLite, JDD for Globus), JSP passes it to the Benchmark Evaluator port (step 2), which calls the EE to execute the benchmark against all resources/machines alive (reported in the PO snapshot document) (step 3). When results are returned, an XML fragment, similar to the one reported above, is created for each resource and inserted in a XML document (that we denote as Benchmark image), which collects all benchmark evaluations for the PO (step 4). Since changes occurring in the PO network/grid may affect the consistency of the previous computed benchmarks (e.g. an upgrade of CPU or RAM in a cluster,…), GREEN triggers an event to force the computation of the benchmarks for all the interested resources. Through the use of the extension mechanism (based on the Extension class) as defined in GLUE specification, we extended the Benchmark_Type by adding two sub-elements: BenchLevel and Order. The first reports the level of the benchmark and accepts two string values, according to our two-level methodology, i.e. micro and application. The Order sub-element is used to inform the matchmaker on the direction (ascending or descending) into which resources are to be ranked (see next Section for a discussion of the use of the Order tag). 4.3
Extending JSDL The counterpart of benchmarking resources is the ability for users submitting a job to express their preferences about the performance of target machines. Resources are then ordered according to performance values (ranks). To this end a mechanism is required to allow users to explicitly assess these requirements inside the job submission document. As explained in Section 2, contrary to JDL which provides the definition of a ranking element, JDD and especially JSDL do not provide any construct able to express some preferential ordering on selected resources. However, as JSDL mission is to provide a standard language to be used on top of existing middleware, and considering the interoperability issue motivating the design of GREEN, the natural choice has been to extend the JSDL schema to take into account ranking specification. Therefore, we introduce an element Rank (of complex type Rank_Type) devoted to this task. To maintain a desirable, although not mandatory, uniform lexicon between the JSDL constructs at job side and the GLUE description at resource side, we borrowed from the GLUE extension the definition of BenchmarkType_t, which is embedded as sub-element of Rank, as in the following schema declaration:
9
Although, at present, we are mainly interested in performance-based ranking, we provide the Rank element with the possibility of dealing with new ranking mechanisms that should appear useful in the future (e.g. QoS, Real Time metrics). In this context, the meaning of the Value element is to be intended as a lower or upper threshold (depending on the type of benchmark) used to filter out resources by the matchmaker before the ranking takes place. This filtering is realized though the combined used of the corresponding Value element (related to the benchmark stated by Type) along with the Order element contained in the extended Benchmark element of the modified GLUE 2.0 Schema and associated to any resource. For example in case of a job with latency requirement, the declared Value is to be intended as low as possible (e.g. no more than 9ms), i.e. as an upper bound, while in case of MFLOPS Value is to be high as possible (e.g. at least 400 MFLOPS or more). In the first case, only resources with benchmark value less or equal than the one required are returned, as opposite to the second case. Through the Order value (ascending or descending), the matchmaker is able to decide what resources are to be returned and their related order. As we are interested in the execution of parallel applications, we borrowed from SPMD [28] an extension to JSDL which supports users with a rich description set related to concurrency aspects (e.g. number of processes, processes per host). The following is an example of an extended JSDL document, containing the extensions relating to the parallel requirements (as proposed by SPMD) along with our extension concerning ranking on benchmark specification. In particular, the document is requesting nodes executing the application-level IsoSurface_Benchmark in no more than 300 seconds. Note how the Rank element has been located inside the Resource one, according to the extension mechanism provided by JSDL schema. ParIsoExtraction parisoextraction inputvolume.raw 200 isosurface.raw 4 2 http://www.ogf.org/jsdl/2007/02/jsdlspmd/MPICH2
10
LINUX IsoSurface_Benchmark 300 application descending inputvolume.raw overwrite true http://mypc.ge.imati.cnr.it/home/dago/inputvolume.raw isosurface.wrl overwrite true http://mypc.ge.imati.cnr.it/home/dago/isosurface.wrl
As detailed in next Section, the availability of a unique format to express job submission requirements that is independent of the underpinning middleware, while capturing the best aspects of other languages (e.g. ranking constructs from JDL, XML structuring from JDD), is an essential feature supplied by GREEN in the direction of a truly interoperable Grid matching tool. 4.4
Benchmark-driven matchmaking If the previously discussed extensions of GLUE and JSDL are essential to develop a tool able to better capture the requirements of users and to operate as much as possible independently of the specificities of the underlying middleware layers, the core of GREEN is constituted by the Matchmaker component in charge of accomplish the distributed resources discovery. To describe our matchmaking strategy we consider a Grid composed of three POs, namely PO1, PO2 and PO3. Figure 3 shows the interactions occurring between the main GREEN components, along with the dataflows exchanged during the distributed matchmaking process. More in detail: a user submits an extended JSDL document through the Grid portal (step 1). The document is managed by the Resource Selector component which initiates the distributed matchmaking by forwarding it to the Job Submission component of a randomly selected GREEN instance (e.g. PO2) (step 2). JS activates the Matchmaker (step 3). This instance of matchmaker will be responsible to provide the set of candidate resources to the Resource Selector, acting as a Master Matchmaker (MM) for this specific request. The Master Matchmaker checks its local memory (step 4.1) and contemporaneously activates the Resource Discovery (step 4.2) since it forwards the document to all the Matchmaker components of the other known GREEN instances (PO1 and PO3 instances) (steps 5 - 6) through their RDs. It is to note that the PO snapshots are filtered by the matchmakers in order to extract the set of PO resources satisfying the query (4.1 and 7). By analysing the pre-computed Benchmark image, the satisfying resources having a Value element (for the chosen benchmark) that fulfils the threshold fixed in the corresponding Rank element of the JSDL document are extracted. The specification of the adopted policy is beyond the scope of this work. The resource identifiers and their corresponding benchmark values are included in a list, called PO list, which is returned to the Master Matchmaker through the RD components (steps 8-10). MM merges these lists with its own PO list and produces a Global List ordered on the ranking values. The Global list is passed to the JS (step 11) which returns it back to the Resource Selector (step 12). RS applies its scheduling policy to determine the resource to use, and activates the JS of the GREEN responsible of the PO owning the selected machine (GREEN PO1’s instance in our case) by sending it the extended JSDL document along with the data indentifying the selected resource (step 13). The JS translates the information regarding the job execution of the original JSDL document into the format proper of the specific middleware of the PO, stating the resource on which the computation will take place. In particular, it produces a JDD document for GT4 resources and a JDL document for the gLite ones. Finally it activates the Execution Environment in charge of executing the job represented in the translated document (step 14). It is worth noting that the PO lists we use are middleware-independent in the sense that each machine is represented by a tuple of values. In the current work, a resource is represented by the triple (where GREEN_id is the GREEN instance managing the resource identified by resource_id) as at present we support a selection strategy based only on the ranking value. A richer set of resource properties is to be provided in a resource tuple, if more complex scheduling strategy should be deployed.
Figure 3 The distributed matchmaking process carried out by three GREEN instances, one for each PO. A second aspect re-enforcing interoperability with respect to different middleware environments is related to the exchange of the JSDL document forwarded among cooperating POs. The complete translation process, performed by the JS, takes place only once, after the matching process is concluded and the resource selected, just before the submission to the EE is made by the GREEN responsible for the machine executing the job (step 14). Until that moment each GREEN instance deals with a (standard) extended JSDL document from which the requirement section is extracted and compared (through mapping) with the counterpart expressed via the GLUE representation adopted by the middleware, while the ranking-related portions are directly compared with the benchmark fragments, complying with GLUE 2.0, included in the benchmark image. This also implies that a GREEN instance does not need to know the type of middleware running on the other POs.
5
Evaluating the impact of benchmarks on matchmaking
As mentioned in Section 4.1 the task of selecting the most suitable resource among those returned by the Matchmaker, is usually demanded to a scheduler that applies its optimization policies to satisfy strict requirements from client side and to improve the overall system throughput. However, in this paper we want to highlights that, also at matchmaker level, performance may be improved by the adoption of our multi-level benchmark description of resources. To assert the soundness of this statement we carried out a two-steps experimentation. First of all, we aimed to assess the appropriateness of the two-level benchmarking methodology. Starting from a five resources testbed (see Table 2), we executed several benchmarks (both micro and application specific) on it, focusing, in this paper, on four of them. Our goal has been to obtain a true picture of the actual performance offered by those resources, along different metric axes. As described in Section 5.1, quite unexpected divergences emerged, confirming the importance of carefully characterizing machines with calibrated tools. As second step, based on the results gathered by the benchmarking phase, we simulated the matchmaker’ behaviour for different operating scenarios. Indeed, as our matchmaker acts on the assumption of a precise ordering of the resources, we want to examine the effects of alternative orders. To collect useful evidences, we considered three main strategies: I) random ordering; II) resources ordered with respect to a single predetermined and general purpose
12
value (e.g. the peak GFLOP/s), irrespective of the application type; III) resources ordered for each application using the corresponding application-specific benchmark. 5.1.
Running the two-level benchmarks: experimental results We tested the behaviour of each of the five resources listed in Table 2, measured with respect to the Flops microbenchmark and the three application benchmarks introduced in Section 3.3 (IPB, ISO, HPL). Table 2 highlights the great architectural heterogeneity of our testbed, especially as regards computing capacity (N° of CPUs) and memory. Indeed, such testbed has been chosen to reflect the technical differences normally occurring in Grid environment, and allows us to clarify how a “first sight” approach may be misleading when looking for the “best” resource in a Grid. Note that, for sake of clarity, Table 2 omits other parameters, such as disk capacity, memory bandwidth and cache dimensions. Resources in our testbed are shared within projects and collaborative researches. Cluster1 and paperoga are parallel machines belonging to our domain, michelangelo has been accessed within the project LitBio [52], SC1458 has been kindly provided by SiCortex [53], and the ibm blade cluster is exploited in collaboration with the DIST Department of the University of Genoa1. Table 2 Machines in the testbed ibm michelangelo SC1458 paperoga cluster1
Proc. Type 2 Quad Core Xeon 2.5 GHz 2 AMD Opteron 275 2.2GHz dual core proprietary dual 3 GHz Intel Xeon 2.66 GHz Pentium IV
N° CPU/Core 32 212 1458 8 16
Network Infiniband Gigabit Ethernet proprietary Gigabit Ethernet Gigabit Ethernet
RAM 64 GB 424 GB 1.9 TB 16 GB 16 GB
5.1.1. Micro-benchmark In Figure 4 are depicted the results related to the Flops benchmark, used to evaluate resources on the basis of the CPU metric. For each machine Flops has been run on one CPU/core, and then the aggregate results are computed to measure the performance of the whole parallel resource [9].
Figure 4 CPU performance of Flops benchmark (Single CPU and Aggregate values) Figure 4 highlights that, as expected, SC1458 gives the worst single-core performance but the best aggregate performance, owing to the architectural design of the machine, built with a high number (1458) of low power cores. Michelangelo classifies second, both on aggregate and single CPU values, and follows ibm (635 MFLOP) in this latter measure. Analogous evaluations arise from the analysis of the other machines. The first, quite obvious, observation is that the choice of the most suitable resource should depend on the degree of parallelism of the application at hand and, as consequence, it does not exist the best machine for all situations. When submitting a sequential application to our testbed, the ibm node is the most suitable, while a fine-grained parallel application may exploit the overall SC1458 aggregate power. Thus it seems clear that, even just limiting to a simple benchmark like Flops, things dramatically change when considering the resource performance as described in terms of a unique “weight” (the single core or the aggregate power) or taking into account both values. 1
the resource has been granted within the IBM Shared University Research (SUR) Awards [54]
13
5.1.2. Application benchmarks The final observation of previous section, is further confirmed after the execution of the three application benchmarks introduced in Section 3.3: image processing (IPB), isosurface extraction (ISO) and High- Performance Linpack (HPL). To evaluate their performance, execution Wall Clock Time (WCT) was chosen as the common metric. Moreover, to simplify the comparison of the results, the execution time of each benchmark is normalized according to a base value, specifically the one related to our reference resource cluster1. We remind here that IPB and ISO are deployed as sequential code while HPL is parallel in nature.
ibm 0.64
Table 3 Normalized WCT obtained with IPB michelangelo SC1458 paperoga 1.11 6.1 0.85
cluster1 1
Considering our deep experience about the image processing application, we expect that IPB mainly benefit from fast single CPU and cache memory performance. The ranking of Table 3 reflects this expectation. Indeed, combining figures obtained with Flops and CacheBench (the latter not reported in this paper), we can observe consistently better results from ibm, paperoga, and cluster1. Note that michelangelo’s high CPU ranking (see Figure 4) degrades owing to its poor memory performance. As expected, SC1458 achieved the worst performance due to its slow single cores
ibm 0.8
Table 4 Normalized WCT obtained with ISO michelangelo SC1458 paperoga 1.14 8.2 0.45
cluster1 1
Table 4 reports benchmark values related to the isosurface extraction benchmark, which mainly stresses CPU, memory and I/O system components. For this reason, similarly to the IPB case, the single cores of SC1458 performed poorly. Considerable better results came from paperoga, that, thanks to its high disk bandwidth, outperformed cluster1 and michelangelo.
ibm 16 p 32 p 0.54 0.34
Table 5 Normalized WCT obtained with HPL michelangelo SC1458 paperoga 32 p 64 p 32 p 128 p 4p 8p 0.41 0.29 0.24 0.07 1.43 0.92
cluster1 8p 16 p 1.89 1
Table 5 reports the values obtained from HPL benchmark. In this case, due to the parallel nature of the benchmark, we tested the performance considering a different number of CPUs/cores for each resource, where each specific choice depended on the access policies granted to us. To normalize results we used the 16 p configuration of our reference resource (cluster1). SC1458 considerably outperforms the other machines when increasing the parallelism degree. This agrees with HPL feature which tests the entire system and benefits from high number of processes linked with fast connections. Furthermore, SC1458 exploits a customized implementation of the BLAS library [55], whose routines are the core of the HPL benchmark. Table 6 summarizes results reported in Figure 4 and in Tables 3-5, and shows the absolute ranking obtained for each of the five considered benchmarks.
More Performant
Less Performant
Table 6 Ranking of testbed resources with respect to the five benchmarks Flops Flops Aggregated HPL ISO Single CPU/core CPU/core IBM SC1458 SC1458 IBM Michelangelo Michelangelo IBM Paperoga Paperoga IBM Michelangelo Cluster1 Cluster1 Cluster1 Paperoga Michelangelo SC1458 Paperoga Cluster1 SC1458
IPB Paperoga IBM Cluster1 Michelangelo SC1458
By analysing these results we can conclude that a fine description of resources, obtained executing different kinds of benchmarks, allows the best coupling of applications to resources. So, the additional effort of benchmarking Grid resources is worthwhile, as we will show in the next Subsection. 5.2.
Running the simulation Confirmed by the consistency of our approach, as further step in our experimentation we simulated the behaviour of the matchmaker, according to the three different ranking strategies mentioned above.
14
As we had not the availability of a significant workload sample related to our testbed, we preferred to adapt an existing workload. It is to note that, original workload traces are often modified (e.g. changing job arrival times, job type and requirements) in order to best address specific analysis needs [56],[57]. 5.2.1. The settings The implemented simulator acquires, as input, a workload log and outputs three relevant metrics: average execution time, maximum number of waiting jobs, more loaded resource. The first two measures express the performance of each strategy whereas the last gives hints about its coupling effectiveness. As regards the input data, we run several simulations by varying the workload on the basis of the logs provided by the “Parallel Workloads Archive” [58]. This allowed to cover a quite wide spectrum of real scenarios, which, though presenting some differences in the numerical values, led to very similar trends. Here, for brevity, we describe the results obtained using the CTC SP2 [59]2 log file. This workload is related to 11 months (July 1996 - May 1997) of accounting records for the 512-node IBM SP2 located at the Cornell Theory Center (CTC). The log contains 77,222 jobs, but only 60,553 were considered, because 17 jobs have an execution time of 0 seconds and 16,652 jobs failed during the execution. Among the considered ones 26,657 are sequential and 33,896 parallel. The maximum execution time is of 64,834 seconds, the minimum one of 38 seconds, with an average of 9,906 seconds. For each job the log file specifies its arrival time, the amount of memory used and the parallelism degree, shown in Figure 5. We adapted the workload to our testbed and considered the execution time of each logged job as if it was obtained on one of our testbed resources, namely the cluster1 node which we chosen as reference machine. A second adaptation regarded the type of jobs. We decided to substitute the original job types with one of the three applications introduced in Section 3.3. The assignment of these types has been made using an even round robin partitioning. Although the modifications have to be carefully carried out, the resulting data are in any case preferable to completely synthesised ones as they are closer to real workload. Independently from the ordering strategy adopted, upon job arrival, the matchmaker looks for a resource with a sufficient amount of free CPUs and memory. If several suitable resources are found, the one that better ranks with respect to the current strategy will have the job assigned. If no resource is found, the job is put into a FIFO queue. The choice of a simple queuing policy is justified by the fact that we are focusing here on matchmaking, and therefore we can disregard all aspects connected with the adoption of smarter queuing schemes, typical of a scheduler. The waiting time of a queued job is updated until it is possible to execute it. 30000
26657 25000
% Jobs
20000
18104
15000
10000
7631 4984
5000
3052 125
0 1
2-8
9-16
17-32
33-212
>212
CPUs Requested
Figure 5 Degree of Parallelism of jobs in the CTC SP2 workload To experiment the various strategies we used the different ranks of the resources of our testbed, obtained from the analysis of Section 5.1. The actual execution time of a job on a resource is computed using the values reported in the table corresponding to its type (Table 3, 4 and Table 5). For example an isosurface extraction job that requires 100 seconds on cluster1 will be executed in 45 seconds on paperoga and in 820 seconds on SC1458.
2
.The log was graciously provided by Dan Dwyer (
[email protected]) from the Cornell Theory Center, Cornell University, Ithaca, New York, USA.
15
In particular, for linear algebra applications we considered the values corresponding to 32p for IBM, Michelangelo and SC1458, to allows a real comparison among the most powerful resources we used, and the ones corresponding to the maximum parallelism degree for Cluster1 (16p) and Paperoga (8p). 5.2.2. Simulation Results We simulated the performance of the testbed considering the aforementioned three strategies: I) Random ordering; II) Single rank associated to a general purpose benchmark III) A different rank, associated to each application specific benchmark. In particular, for II, we used the Flops benchmark, considering both the ranking based on the single CPU/core performance and the ranking based on the overall aggregate power. According to I, in the absence of any order, a submitting job is assigned randomly to one of the free resources satisfying its requirements. The second scenario assigns a job to the free resource which ranks better with respect to the Flops micro-benchmark, irrespectively of the job type. This strategy is followed, for example, in gLite, which (possibly) ranks resources only on the basis of general purpose benchmarks as SPEC ones [60]. These strategies are contrasted with the solution we propose, which associates to each job the free resource which better performed with regards to the associated benchmark type (i.e. having the same value for the corresponding type element in the related GLUE and JSDL documents). Table 7 summarises the simulation results. Two explanation notes occur. So that Random ordering be significant, we executed the simulation 200 times averaging on the collected results (with the obvious exception of the resource more loaded metric). As regards the Single rank, we wanted to explore the impact of considering the Flops value of the single CPU/core as well as the machine aggregated power. Table 7 Performance results varying the Matchmaker ordering strategy. max n° waiting jobs average execution time resource more loaded 7 19,045 ---------Random 3 8,842 IBM Single rank - core 64 47,588 SC1458 Single rank - aggregated 1 6,637 SC1458 Application specific ranks A quick analysis of the presented simulation results (together with the ones obtained with other simulations using different workloads, omitted here), confirms the appropriateness of focusing on the “real” performance to match jobs with resources. Indeed the best results are obtained with the Application-specific rank strategy. In fact, considering the average execution time metric, this strategy registered 6,637 seconds, followed by the 8,842 seconds resulted by Single rank core strategy, and 19,045 seconds of the random strategy. Not surprisingly Single rank-aggregated strategy gives the worst performance, as in this case 98% of the jobs were executed on SC1458, mainly for two reasons: a) it yields the best aggregated performance; it provides a high number of cores able to host a large number of jobs. Unfortunately SC1458 is the best choice only for linear algebra jobs, while it provides worst performance for image processing and isosurface extraction ones, so the majority of the 59,000 jobs executed on it employed a lot of time to complete. SC1458 is however the most loaded resource also in the best scenario. In this case nearly all and only the linear algebra jobs, that represent one third of the workload, were executed on it, while the remaining two thirds were subdivided among the other four resources. The different loads of resources heavily depends on the ranking strategy. For example, in the single rankaggregated scenario, paperoga was almost completely disregarded, as it executed only 6 jobs, while in the best case it was the second more loaded resource, being the best choice for the image processing jobs. As to the maximum waiting jobs metric, it is to note that, at least for the selected workload, the length of the queue is negligible except for the worst case. This is due to the fact that about half of the jobs are sequential, therefore it is easier to find quickly a resource providing a free CPU and the required amount of memory with respect to parallel job.
6
Related Works
The implementation of an effective and automatic mechanism for the efficient discovery of resources that best suit a user job is one of the major problems in present Grids. The Globus toolkit does not provide the resource matchmaking/brokering as a core service, but the GridWay metascheduler [61] was included as an optional high level service since June 2007. The main drawback of GridWay is that it allows users to specify only a fixed and limited set of resource requirements, most of them related to the queue policies of the underlying batch job systems. This choice limits the ranking of resources, and benchmarks are not considered at all. On the contrary gLite has a native matchmaking/brokering service that takes in consideration more requirements and also benchmark values, although they are fixed and the service is based on a semi-centralized approach [62]. Even more critical in nowadays brokering systems is the support to interoperability. To make advantage of the availability of multiple interconnected domains (POs), due to the different middleware, users are often constrained to submit alternative versions of the same document, one per each JSL accepted by the middleware. The birth of JSDL has
16
been motivated mainly from this reason. Unfortunately, however, its adoption in production Grids is still in its infancy. For example, EGEE acknowledges the usefulness of JSDL, but CREAM – BES [63], a webservice-based execution service developed in EGEE as part of the gLite middleware, currently in beta-version, makes a limited use of JSDL, in particular ignoring the Resource element. In [57] is presented an approach aimed to assure interoperability among different Grid systems. Resource information is exchanged using an aggregated form, with a good compromise between communication efficiency and accuracy of resource description. However, the system does not employs the GLUE standard and the aggregate resources description does not include benchmark related information. Another approach to supply some kind of interoperability is provided by Grid portals, which operate above the middleware and hide some middleware specificity to Grid users. P-Grade [64] is a portal providing workflow management features which leverages on the brokering functions of GTbroker and LCG2 brokers, allowing users to develop and execute multigrid workflows. Though a GUI aids users to express job descriptions and requirements on resources, without the need of tackling JSL notions, the system is not completely transparent to users that have to know in advance the broker (hence the JSL) which manages the submission. Another solution is proposed by [65] which presented a portal for heterogeneous Grids, based on JSDL. The GUI portal facilitates viewing, authoring and validating JSDL documents, however authors say that in its current version the only supported middleware is Globus 2.x, while support for JDL scripts is foreseen. A way to improve the efficiency of resource discovery, is to drive the search towards resources that shown good performance in the execution of jobs with similar or known behaviour. As explained in Section 3, the characterization of Grid resources based on pre-computed benchmarks seems a valid strategy to follow. In [46] a brokering mechanism is proposed based on benchmarking Grid resources. The scope of the broker is actually focused on the ARC middleware and the NorduGrid and SweGrid production environments, and it adopts XRSL [21], an extension of RSL, to submit users jobs. The authors recognized the importance to improve the portability and interoperability of the job submission service, in order to facilitate cross-middleware job submission.
7
Conclusions
A very important issue in Grid Computing is the discovery and selection of resources, so that a user could find the resources providing a good matching with his/her application’s needs. Grid middleware usually offer only basic services for the retrieving of information on resources, resulting inadequate with respect to the more detailed and specific user requirements. We proposed a methodology to improve the matchmaking process based on information about performance of computational resources. Our aim is to integrate the information available via Grid information and monitoring services by annotating resources with both low-level and application-specific performance metrics. The use of a double level of benchmarks improve the generality and usefulness of our tool. In fact, a more expert user can choose the applicationdriven benchmark that better expresses the requirements of its applications, whereas a user with low knowledge of the job to execute may prefer to choose some general micro-benchmark. Moreover the use of application-specific benchmarks is particularly advisable when an application is often executed on the Grid. On these bases, we developed GREEN, a Grid service addressed both to Grid administrators and users. It assists administrators in the insertion of benchmark information related to every PO composing the Grid, and provides users with features which a) facilitate the submission of job execution requests, by specifying both syntactic and performance requirements on resources; b) support the automatic discovery and selection of the most appropriate resources. The GREEN aim is to discover the set of all resources satisfying user requirements ordered by ranks. The selection phase is left to a (meta)scheduler, allowing to apply the preferred scheduling policies to satisfy specific goals. A very important point in the design of our tool is the interoperability: GREEN is able to deal with different middleware transparently to Grid users. To this end a unique standard format (JSDL with suitable extensions) is used to express job submission requirements, independently of the underlying Grid middleware. GREEN performs internally the task of translating this format to different JSLs used by the various middleware. Since we are interested in the execution of parallel applications, we borrowed from SPMD some extension to JSDL related to concurrency aspects (e.g. number of processes, processes per host). The present work mainly concerns the architectural aspects of the proposed tool, with a particular focus on the two-level benchmarking methodology, on a standard description of both user requirements (JSDL) and resource description (GLUE), and on interoperability issues and proposed solutions. We also report some experimental results obtained using a simple ad-hoc simulator, that confirm the usefulness of the proposed approach. As shown in Section 5, the careful use of application-level benchmark information, in selecting resources, significantly reduces job execution times. As future work we plan to improve the simulation taking into account the delays introduced by the overlay network connecting the various GREEN instances, one for each PO, comparing different strategies for neighbour search depending on the number of POs constituting the Grid platform.
Acknowledgement This work has been partially supported by the Project “Grid and High Performance Computing” of ICT Department of CNR, Research Unit ICT.P09.004 “Methodologies, Algorithms and Applications for Collaboration Grids”.
17
Bibliography [1]
I. Foster, C. Kesselman, The GRID 2: Blueprint for a new computer infrastructure, Elsevier, 2005.
[2] X. Bai, H. Yu, Y. Ji, D.C. Marinescu, Resource matching and a matchmaking service for an intelligent Grid, Int. Journal of Computational Intelligence 1 (3) (2004) 163-171. [3] 1996.
R.W. Hockney, The science of computer benchmarking (software, environments, tools), SIAM, Philadelphia,
[4] G. Tsouloupas, M.D. Dikaiakos, Gridbench: A tool for benchmarking Grids, in: Proceedings of the 4th Int. Workshop on Grid Computing (Grid2003), Phoenix, USA, IEEE Computer Society, 2003, pp. 60-67. [5] G. Chun, H. Dail, H. Casanova, A. Snavely, Benchmark probes for Grid assessment, in: 18th Int. Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, USA, CD-ROM / Abstracts Proceedings, IEEE Computer Society, 2004. [6] F. Nadeem, R, Prodan, T. Fahringer, A. Iosup, Benchmarking Grid applications, in: D. Talia, R. Yahyapour, W. Ziegler (Eds.), Grid middleware and services: Challenges and solutions, CoreGRID Series, Springer, 2008, pp. 1937. [7] J.L. Henning, SPEC CPU2000: Measuring CPU Performance in the New Millennium, COMPUTER 35 (2000) 28-35 (SPEC – Standard Performance Evaluation Corporation, http://www.spec.org/). [8] J.D. McCalpin, Memory bandwidth and machine balance in current High Performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. [9] P.J. Mucci, K. London, J. Thurman, The CacheBench Report, University of Tennessee, v. 19, 1998 (Cachebench Home Page, http://icl.cs.utk.edu/projects/llcbench/cachebench.html). [10] W. Gropp, E. Lusk, Reproducible measurements of MPI performance characteristics, in: J. Dongarra, E. Luque, T. Margalef (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface (Proc. 6th European PVM/MPI Users’ Group Meeting), Lecture Notes in Computer Science, vol. 1697, Springer, 1999, pp. 11-18 (MPPTEST - Measuring MPI Performance, http://www-unix.mcs.anl.gov/mpi/mpptest/). [11]
T. Bray, Bonnie benchmark. http://www.textuality.com/bonnie, 1988.
[12] A. Galizia, D. D'Agostino, A. Clematis, The use of PIMA(GE)2 Library for efficient image processing in a Grid environment, in: F. Davoli, N. Meyer, R. Pugliese, S. Zappatore (Eds.), Grid Enabled Remote Instrumentation, Series: Signals and Communication Technology (INGRID 2007, S. Margherita Ligure (Italy)), Springer, 2008, ISBN: 978-0-387-09662-9, pp. 511-526. [13] A. Clematis, D. D'Agostino, V. Gianuzzi, An online parallel algorithm for remote visualization of isosurfaces, in: J. Dongarra, D. Laforenza, S. Orlando (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface (Proc. 10th European PVM/MPI User's Group Meeting), Lecture Notes in Computer Science, vol. 2840, Springer, 2003, pp. 160-169. [14] A. Anjomshoaa, F. Brisard, M. Drescher, D. Fellows, A. Ly, S. McGough, D. Pulsipher, A. Savva, Job Submission Description Language (JSDL) Specification v1.0, Grid Forum Document GFD-R.056, November 2005 (http://www.ggf.org/documents/GFD.56.pdf). [15] S. Burke, S. Andreozzi, L. Field, Experiences with the GLUE information schema in the LCG/EGEE production Grid, Journal of Physics: Conference Series 119 (CHEP’07), IOP Publishing, 2008. [16] K. Czajkowski, I. Foster, C. Kesselman, Agreement-based resource management, Proceedings of the IEEE 93 (3) (2005) 631-643 (http://www.globus.org/toolkit/docs/2.4/gram/rsl_spec1.html). [17] M. Feller, I. Foster, S. Martin, GT4 GRAM: A Functionality and Performance Study, in: Proceedings of the 2007 TeraGrid Conference, 2007, Madison (USA) (http://www.globus.org/toolkit/docs/4.2/4.2.0/user/gtuser-execution.html).
18
[18] Job Description Language Attributes Specification for the gLite Middleware, Doc. EGEE-JRA1-TEC555796-JDL-Attributes-v0-8, 3/5/2006 (https://edms.cern.ch/file/555796/1/EGEE-JRA1-TEC-555796-JDL-Attributes-v0-8.pdf). [19] I. Foster, C. Kesselman, Globus: a metacomputing infrastructure toolkit, Int. Journal of High Performance Computing Applications 11 (2) (1997) 115 (www.globus.org). [20] I. Foster, Globus Toolkit version 4: Software for service-oriented systems, in: IFIP Int. Conference on Network and Parallel Computing, Lecture Notes in Computer Science, vol. 3779, Springer, 2005, pp. 2-13 (http://www.globus.org/toolkit/docs/4.0/). [21] Extended Resource Specification Language Reference Manual, NORDUGRID-MANUAL-4, 2010 (http://www.nordugrid.org/documents/xrsl.pdf). [22] M. Ellert et al., Advanced resource connector middleware for lightweight computational Grids, Future Generation Computer Systems 23 (2007) 219–240. [23] F. Gagliardi, B. Jones, M. Reale, S. Burke, European DataGrid Project: Experiences of deploying a large scale testbed for E-science Applications, in: Lecture Notes in Computer Sciences, vol. 2459, Springer, 2002, pp. 255-264 (The DataGrid Project: http://eu-datagrid.web.cern.ch/eu-datagrid/). [24] F. Gagliardi, B. Jones, F. Grey, M.E. Begin, M. Heikkurinen, Building an infrastructure for scientific Grid computing: Status and goals of the EGEE project, Philosophical Transactions of the Royal Society A (Mathematical, Physical & Engineering Sciences) 363 (2005) 1729-1742. [25] R. Raman, M. Livny, M. Solomon, Matchmaking: Distributed resource management for high throughput computing, in: Proceedings of the Seventh IEEE Int. Symposium on High Performance Distributed Computing (HPDC7), 1998 (http://www.cs.wisc.edu/condor/classad/). [26]
Job Submission Description Language WG (JSDL-WG) http://forge.gridforum.org/projects/jsdl-wg.
[27] I. Rodero, F. Guim, J. Corbal, J. Labarta, How the JSDL can exploit the parallelism?, in: Sixth IEEE Int. Symposium on Cluster Computing and the Grid (CCGRID'06, Singapore), IEEE Computer Society, 2006, pp. 275-282. [28] A. Savva (Ed.), JSDL SPMD Application Extension, Version 1.0, Grid Forum Document GFD-R-P.115, Open Grid Forum (OGF), August 2007 (http://www.ogf.org/documents/GFD.115.pdf). [29] J.P. Martin-Flatin, P.V.B. Primet, High-speed networks and services for data-intensive Grids: The DataTAG Project, Future Generation Computer Systems 21 (2005) 439-442. [30] P. Avery et al., An international Virtual-Data Grid Laboratory for data intensive science, NSF Information and Technology Research Program, Proposal #0122557, April 2001 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.9893&rep=rep1&type=pdf). [31] S. Andreozzi, GLUE Specification v. 2.0 (rev. 3), 2009 (http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.glue-wg/docman.root.drafts/doc15023). [32] GLUE v. 2.0 – Reference realizations to concrete data models, 2008 (http://forge.gridforum.org/sf/go/doc15221?nav=1). [33] gLite 3.1 User Guide, Doc. CERN-LCG-GDEIS-722398, 7 January 2009 (https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.html). [34]
http://www.globus.org/toolkit/mds/glueschemalink.html.
[35] L. Carrington, A. Snavely, N. Wolter, A performance prediction framework for scientific applications, Future Generation Computer Systems 22 (2006) 336-346 (PMaC: Performance Modeling and Characterization , http://www.sdsc.edu/PmaC). [36]
The STREAM Benchmark: Computer Memory Bandwidth (www.streambench.org).
19
[37] R. Reussner, P. Sanders, J.L. Traff, SKaMPI: a comprehensive benchmark for public benchmarking of MPI, Scientific Programming 10 (1) (2002) 55-65 (http://liinwww.ira.uka.de/~skampi/index.html). [38]
NAS Parallel Benchmarks, http://www.nas.nasa.gov/Software/NPB.
[39] J.J. Dongarra, P. Luszczek, A. Petitet, The LINPACK benchmark: Past, present, and future, Concurrency and Computation: Practice and Experience 15 (2003) 1-18. [40]
Top500 Supercomputer Site, http://www.top500.org/.
[41] M.D. Dikaiakos, Grid benchmarking: vision, challenges, and current status, Concurrency and Computation Practice & Experience 19 (2007) 89-105. [42] A. Snavely, G. Chun, H. Casanova, R.F. Van der Wijngaart, M.A. Frumkin, Benchmarks for grid computing: a review of ongoing efforts and future directions, ACM SIGMETRICS Performance Evaluation Review 30 (2003) 27-32. [43] G. Tsouloupas, M. D. Dikaiakos, GridBench: A Tool for the interactive performance exploration of Grid infrastructures, Journal of Parallel and Distributed Computing 67 (2007) 1029-1045. [44] M. Frumking, R.F. Van der Wijngaart, NAS Grid Benchmarks: A tool for Grid space exploration, Cluster Computing 5 (2002) 315-324. [45] R.F. Van der Wijngaart, ALU Intensive Grid Benchmarks, Global Grid Forum, Research Group in Grid Benchmarking, GWD-I, February 2004. [46] E. Elmroth, J. Tordsson, Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions, Future Generation Computer Systems 24 (2008) 585-593. [47] G. Tsouloupas, M. Dikaiakos, Characterization of computational Grid resources using low-level benchmarks, in: Proceedings of the 2nd IEEE Int. Conference on e-Science and Grid Computing (e-Science’06), IEEE Computer Society, 2006, pp. 70. [48] A. Clematis, A. Corana, D. D'Agostino, V. Gianuzzi, A. Merlo, A. Quarati, A distributed approach for structured resource discovery on Grid, in: Proc. Int. Conference on Complex, Intelligent and Software Intensive Systems (CISIS 2008, Barcelona), IEEE Computer Society, 2008, pp. 117-125. [49] M. Cai, M. Frank, J. Chen, P. Szekely, MAAN: A Multi-Attribute Addressable Network for Grid Information Services, in: Proc. 4th Int. Workshop on Grid Computing (Grid 2003, Phoenix, USA), 2003, pp. 184-191. [50] C. Rabat, A. Bui, O. Flauzac, A Random Walk topology management solution for Grid, in: Lecture Notes in Computer Science, vol. 3908, Springer, 2006, pp. 91-104. [51] Z. Jia, J. You, R. Rao, M. Li, Random walk search in unstructured P2P, Journal Systems Engineering and Electronics 17 (2006) 648-653. [52]
Michelangelo Hardware, LITBIO Project (http://www.supercomputing.it/At_Cilea/michelangelo_eng.htm).
[53]
http://sicortex.com.
[54]
IBM Shared University Research Awards (http://www.ibm.com/developerworks/university/sur/).
[55] L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, An updated set of Basic Linear Algebra Subprograms (BLAS), ACM Transactions on Mathematical Software 28 (2002) 135-151. [56] K. Aida, H. Casanova, Scheduling mixed-parallel applications with advance reservations, Cluster Computing Journal 12 (2009) 205-220. [57] I. Rodero , F. Guim , J. Corbalan , L. Fong , S. M. Sadjadi, Grid broker selection strategies using aggregated resource information, Future Generation Computer Systems 26 (2010) 72-86.
20
[58]
ParallelWorkloads Archive (http://www.cs.huji.ac.il/labs/parallel/workload/) 2005.
[59] S. Hotovy, Workload evolution on the Cornell Theory Center IBM SP2, in: D.G. Feitelson, L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, vol. 1162, Springer, 1996, pp. 27-40. [60] A. Kretsis, P.C. Kokkinos, E.A. Varvarigos, Developing scheduling policies in gLite Middleware, in: Proc. 9th IEEE/ACM Int. Symposium on Cluster Computing and the Grid (CCGRID 2009, Shanghai, China), IEEE Computer Society, 2009, pp. 20-27. [61] E. Huedo, R.S. Montero, I.M. Llorente, A modular meta-scheduling architecture for interfacing with pre-WS and WS Grid resource management services, Future Generation Computer Systems 23 (2007) 252–261. [62] T. Glatard, D. Lingrand, J. Montagnat, M. Riveill, Impact of the execution context on Grid job performances, in: Proc. Int. Workshop on Context-Awareness and Mobility in Grid Computing (WCAMG'07, CCGrid'07, Rio de Janeiro, Brazil), IEEE, 2007, pp. 713-718. [63] P. Andreetto, S. Andreozzi et al., Standards-based job management in Grid systems, Journal Grid Computing 8 (2010) 19–45. [64] P. Kacsuk, T. Kiss, G. Sipos, Solving the Grid interoperability problem by P-GRADE portal at workflow level, Future Generation Computer System 24 (2008) 744-751. [65] D. Meredith, M. Maniopoulou, A. Richards, M. Mineter, A JSDL application repository and artefact sharing portal for heterogeneous Grids and the NGS, in: Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK, 2007, pp 110-118, ISBN 978-0-9553988-3-4.
21