Programming Grid Applications with GRID ... - Semantic Scholar

8 downloads 200520 Views 596KB Size Report
Jan 9, 2004 - Grid may be as easy as writing a sequential application. Our assumption is ... paradigm that eases the development of Grid applications to the point that .... the GRID superscalar will take into account all data dependencies.
Programming Grid Applications with GRID superscalar Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela and Rogeli Grima CEPBA-IBM Research Institute, UPC, SPAIN January 9, 2004 Abstract. The aim of GRID superscalar is to reduce the development complexity of Grid applications to the minimum, in such a way that writing an application for a computational Grid may be as easy as writing a sequential application. Our assumption is that Grid applications would be in a lot of cases composed of tasks, most of them repetitive. The granularity of these tasks will be of the level of simulations or programs, and the data objects will be files. GRID superscalar allows application developers to write their application in a sequential fashion. The requirements to run that sequential application in a computational Grid are the specification of the interface of the tasks that should be run in the Grid, and, at some points, calls to the GRID superscalar interface functions and link with the run-time library. GRID superscalar provides an underlying run-time that is able to detect the inherent parallelism of the sequential application and performs concurrent task submission. In addition to a data-dependence analysis based on those input/output task parameters which are files, techniques such as file renaming and file locality are applied to increase the application performance. This paper presents the current GRID superscalar prototype based on Globus Toolkit 2.x, together with examples and performance evaluation of some benchmarks. Keywords: Grid programming models, Grid middleware

1. Introduction Grid computing is becoming a very important research and development area in this decade. However, one of the important concerns of the Grid community is whether or not a killer application will appear or not. This concern comes partially because of the difficulty of writing applications for a computational Grid. Although skilled programmers may be willing and able to write applications with complex programming models, scientists usually expect to find easy programming methodologies that allow them to develop their applications with both flexibility and ease of use. Furthermore, different scientific communities (high-energy physics, gravitational-wave physics, geophysics, astronomy, bioinformatics and others) deal with applications with large data sets whose input consists of non monolithic codes composed of standalone application components which can be combined in different ways. Examples of this kind of applications can be found in This work has been funded by the Ministry of Science and Technology of Spain under CICYT TIC2001-0995-CO2-01. © 2004 Kluwer Academic Publishers. Printed in the Netherlands.

index_jogc.tex; 12/01/2004; 20:57; p.1

2

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

the field of astronomy where thousands of tasks need to be executed during the identification of galaxy clusters [22]. These kinds of applications can be described as workflows. Different tools for the development of workflow based applications for GRID environments have been recently presented in the literature [22, 23]. However, in all of them the user has to specify the task dependence graph in a non imperative language. The goal of this paper is to present GRID superscalar, a programming paradigm that eases the development of Grid applications to the point that writing such an application can be as simple as programming a sequential program to be run on a single processor and the hardware resources remain totally transparent to the programmer. GRID superscalar takes advantage of the way in which superscalar processors execute assembler codes [1]. Even though the assembler codes are sequential, the implicit parallelism of the code is exploited in order to take advantage of the functional units of the processor. The processor explores the concurrency of the instructions and assigns them to the functional units. Indeed, the execution order defined in the sequential assembler code may not be followed. The processor will establish the mechanisms to guarantee that the results of the program remain identical. While the result of the application is the same, or even better, if the performance achieved by the application is better than the one that would have been initially obtained, the programmers are freed of any concern with the matter. Another mechanism exploited by processors is the forwarding of data generated by one instruction that is needed by next ones. This reduces the number of stall cycles. All these ideas are exportable to the Grid application level. What changes is the level of granularity: in the processors we have instructions lasting in the order of nanoseconds and, in computational Grids, functions or programs that may last from some seconds to hours. Also, what it changes is the objects: in assembler the objects are registers or memory positions, while in GRID superscalar we will deal with files, similar to scripting languages. In GRID superscalar, applications are described in imperative language (currently C/C++ or Perl), and the inherent parallelism of the tasks specified in the sequential code is exploited by the run-time, which is totally transparent to the application programmer. This paper presents these ideas and a prototype that has been developed over Globus Toolkit 2.x [2]. 1.1. GRID

SUPERSCALAR BEHAVIOR AND STRUCTURE

GRID superscalar is a new programming paradigm for GRID enabling applications, composed of an interface and a run-time. With GRID superscalar a sequential application composed of tasks of a certain granularity is automat-

index_jogc.tex; 12/01/2004; 20:57; p.2

3

Programming Grid Applications with GRID superscalar

Application code initialization(); for (i=0; i simulator1 file_out2.txt Instead of using the pipe, a socket is opened when a file is opened for write inside a task. When the same file is opened for read in another task, the other side of the socket is opened. The write/read operations do not need to be substituted, since the writing task writes into the socket and the reading task reads from the socket. However, in the implementation the writing task also writes the data into the file, since this forwarding mechanism is intended to be transparent to the programmer. This mechanism has been implemented by means of dynamic interception of the open, close, read and write operations by using Dyninst [6]. The scheduling schema is slightly modified with the use of this mechanism. Now, a task (T2) that has a RaW data dependence with a running task (T1) is started when the task T1 opens the file that is responsible for the data dependence. The two tasks will then run concurrently. Although this will enlarge the degree of concurrency of the application, care has to be taken since deadlock situations may arise. This forwarding mechanism is currently under development and has not been used in the experiments detailed in the next section. However, the initial experiments that we have tested show that the instrumentation adds a lot of

index_jogc.tex; 12/01/2004; 20:57; p.17

18

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

overhead, reducing the expected performance increase. Consequently, we are studying other ways of implementing the file forwarding mechanism.

4. Results and performance analysis Several examples have already been implemented with the GRID superscalar, although the current version of the GRID superscalar can still be considered a prototype. We have selected two examples for the paper to analyze performance: first a very simple example that allows us to show the details of an application written with the GRID superscalar paradigm, and second, the NAS Grid Benchmarks. In this section we present some of the results obtained. We have instrumented the run-time to obtain Paraver tracefiles [29] and a performance analysis has also been done. We present the results of the performance analysis for two of the cases, and details of an an interesting bioinformatic application are given. 4.1. S IMPLE OPTIMIZATION EXAMPLE The simple optimization example was described above in section 2. Some results of this example are shown in Table I. The results were obtained by setting the MAX_ITERS parameter of the application to 5 and the ITERS parameter to 12, leading the application to a maximum parallelism of 12 and the total number of remote tasks generated is 180. Two different machines were used: Khafre, an IBM xSeries 250 with 4 Intel Pentium III, and Kadesh8 a node of an IBM Power4 with 4 processors. As client machine, a PC based system with Linux was used. In each case, a maximum number of tasks that could be sent to each machine was set. Column Machine describes the server or servers used in each case, column # max tasks describes the maximum number of concurrent processes allowed in each server and column Elapsed time the measured execution time of each case. The number of tasks in each server was set to a maximum value of 4 since the nodes we were using have 4 processors each. For the single machine executions, it is observed that the execution time scales with the number of processes, although it has a better behavior in server Khafre than in server Kadesh8. When using both servers, we obtained execution times between the time obtained in Khafre and the time obtained in Kadesh with the same number of tasks. For example, when using 2 tasks in each server, the elapsed time is between the elapsed times obtained in Khafre and Kadesh with 4 tasks. In this case, the 180 tasks executed in the benchmark, 134 were scheduled on the faster server (Khafre) and 46 in the slower one (Kadesh8).

index_jogc.tex; 12/01/2004; 20:57; p.18

Programming Grid Applications with GRID superscalar

19

Table I. Execution times for the simple optimization example Machine

# max tasks

Elapsed time

Khafre Khafre Khafre Khafre Kadesh8 Kadesh8 Kadesh8 Kadesh8 Khafre + Kadesh8 Khafre + Kadesh8 Khafre + Kadesh8

4 3 2 1 4 3 2 1 4+4 2+2 1+1

11 min 53 s 14 min 21 s 20 min 37 s 39 min 47 s 27 min 37 s 28 min 27 s 34 min 51 s 48 min 31 s 8 min 45 s 15 min 11 s 24 min 33 s

4.2. NAS GRID BENCHMARKS The NAS Grid Benchmarks (NGB, [3]), which are based on the NAS Parallel Benchmarks (NPB), have been recently specified. Each NGB is a Data Flow Graph (DFG) where each node is a slightly modified NPB instance (BT, SP, LU, MG or FT), each defined on a rectilinear discretization mesh. Like NPB, a NGB data flow graph is available for different problems sizes, called classes. Even within the same class there are different mesh sizes for each NPB solver involved in the DFG. In order to use the output of one NPB solver as input for another, a interpolation filter is required. This filter is called MF. Four DFG are defined, named Embarrassingly Distributed (ED), Helical Chain (HC), Visualization Pipe (VP) and Mixed Bag (MB). Each one of these DFG represents an important class of grid applications. 1. ED represent a parameter study, which is formed by a set of independent runs of the same program, with different input parameters. In this case there are not data dependencies between NPB solvers. 2. HC represents a repeating process, such as a set of flow computations that are executed one after another. In this case a NPB solver cannot start before the previous one ends. 3. VP represents a mix of flow simulation, data postprocessing and data visualization. There are dependencies between successive iterations of

index_jogc.tex; 12/01/2004; 20:57; p.19

20

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

Table II. Execution times for the NAS Grid Benchmarks MB and VP on a IBM Power4 node (Kadesh8). Benchmark

MB.S MB.W VP.S VP.W

1 task

2 tasks

3 tasks

4 tasks

294.14 s 543.31 s 310.16 s 529.65 s

166.57 s 298.14 s 280.52 s 336.96 s

152.05 s 223.60 s 252.89 s 339.55 s

154.15 s 225.15 s 248.05 s 339.08 s

the flow solver and the visualization module. Moreover, there is a dependence between solver, postprocessor and visualization module in the same iteration. BT acts as flow solver, MG as postprocessor and FT as visualization module. 4. MB is similar to VP, but introducing asymmetry in the data dependencies. Figure 10 shows the DFG of the four benchmarks for class S. A paper and pencil specification is provided for each benchmark. The specification is based on a script file that executes the DFG in sequence and in the local host. For each benchmark a verification mechanism of the final data is provided. The developer has the freedom to select the implementation mechanism. We have implemented them using the GRID superscalar prototype. However, a modification of the NPB instances was needed to allow GRID superscalar to exploit all its functionalities. We modified the code of the NPB instances in such a way that the names of the input/output files are passed as input parameters. In the original code each NPB instance generates internally these names. The file names were generated in such a way that they were different in each execution of the same NPB program. With our modification we can reuse the same file name in different iterations, and the GRID superscalar prototype guarantees the proper execution using the renaming feature. In this way, the NGB main program is much simpler. We run all the benchmarks in the same testbed as previous example, and thus we can validate GRID superscalar as an operative system to develop grid applications. The maximum parallelism of ED is 9 for all of these classes. HC is totally sequential and for MB and VP the maximum task parallelism is 3. Table II and III show the results for the VP and MB benchmarks when run with classes S and W. The benchmarks were run assigning from 1 to 4 tasks to each server. The times reported in the tables are average times from several

index_jogc.tex; 12/01/2004; 20:57; p.20

Programming Grid Applications with GRID superscalar

21

Table III. Execution times for the NAS Grid Benchmarks MB and VP. Machine Khafre. Benchmark

MB.S MB.W VP.S VP.W VP.A

1 task

2 tasks

3 tasks

4 tasks

243.70 s 591.88 s 330.56 s 518.40 s 1663.21 s

133.25 s 321.55 s 248.07 s 310.39 s 1243.04 s

96.46 s 222.18 s 252.77 s 320.21 s 1267.77 s

110.69 s 225.98 s 251.87 s 320.32 s 1128.61 s

Table IV. Execution times for the NAS Grid Benchmarks. Machines Kadesh8 and Khafre simultaneously. Benchmark

1+1 task

1+2 tasks

2+1 tasks

MB.S MB.W VP.S VP.W

278.55 s 348.72 s 317.88 s 373.85 s

172.85 s 314.83 s 314.54 s 249.11 s

209.92 s 365.99 s 314.96 s 232.14 s

executions, since the total execution time can vary more than 10% from one execution to another. MB scales with the number of tasks as expected in both servers used alone. VP is not scaling nicely with the number of tasks. A performance analysis of this benchmark is reported later in this section. When both servers are used the time is not scaling as expected. The reason for that behavior was analyzed and is explained later at the end of this section. 4.3. B IOINFORMATICS EXAMPLE Another example that has been programmed with GRID superscalar prototype is a bioinformatics application currently in development. The application compares the DNA of the mouse with the DNA of humans. In order to be able to perform this comparison, both DNAs must be split into several files and then each file of the mouse set has to be compared with each file in the human set. The BLAST application is used to compare the DNAs.

index_jogc.tex; 12/01/2004; 20:57; p.21

22

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

A previous version of the application was based on Perl using LoadLeveler specific functionalities. This application was ported to GRID superscalar using the C/C++ interface. The use of GRID superscalar has simplified the programming of the application and we plan to use the GRID superscalar version of that application for production in our systems. The numbers regarding the porting of this application to GRID superscalar are impressive: the number of lines was reduced to 10% of the original version in Perl and the development time was reduced to half, including the GRID superscalar learning process. Also, this experience allowed us to get a lot of feedback from the users and motivated the implementation of the Perl interface. 4.4. P ERFORMANCE A NALYSIS In order to be able to do a performance analysis of the benchmarks the GRID superscalar run-time was instrumented to generate Paraver tracefiles. Paraver [29] is a performance analysis and visualization tool which has been developed at CEPBA for more than 10 years. It is a very flexible tool that can be used to analyze a wide variety of applications from traditional parallel applications (MPI, OpenMP or mixed) to web applications. The instrumentation of the GRID superscalar is at a preliminary stage, but facilitate the performance analysis process. The traces generated for the GRID superscalar applications were only of the client side. We are considering getting traces of the whole application in the future. However, to take into account the overhead of Globus, time measures of the duration of the servers tasks were also performed. To generate the traces for the GRID superscalar applications, two kind of elements (which are the base of the Paraver tracefiles) were used: the state and the events. The state of the GRID superscalar can be, for example, user (when running user code), Execute (when running run-time primitives)... Additionally, events were inserted to indicate different situations: beginning/end of callback function, task state change (job request, active, done...), file open/close... 4.4.1. NAS Grid Benchmark VP, size W Some of the results for the NGB benchmarks presented in tables II, III and IV seem to be unreasonable at a first glance. For example, VP benchmark, class W, when run in the IBM Power4 node. As the maximum parallelism of this benchmark is 3, it is not surprising that no benefit is obtained with 4 tasks. However, we expected to get better performance with 3 tasks than with 2, therefore this two cases were re-run and Paraver tracefiles were obtained. Table V shows the time the application is in each state for different runs:

index_jogc.tex; 12/01/2004; 20:57; p.22

23

Programming Grid Applications with GRID superscalar

Table V. General performance information for the NAS Grid Benchmark VP, class W Task #

User

Execute

GS_On/GS_Off

Barrier

TOTAL

4 tasks 3 tasks 2 tasks

0.002 s 0.002 s 0.002 s

325.579 s 347.159 s 318.360 s

13.494 s 23.652 s 28.583 s

11 s 12 s 11 s

339.08 s 370.81 s 341.946 s

It is observed that the main part of the execution of the application is in the Execute function. Analyzing with more detail it can be observed that there are 16 different Execute bursts in the tracefiles, one for each of the tasks of the graph (see figure 10) plus one for a GS_FOpen performed at the end of the benchmark. Next step was to identify for each task, the time invested in each step. In figure 11 we can see for each task the elapsed time for different steps of the task (except the last GS_FOpen task, which is locally execute and therefore no Globus events are inserted in the tracefile). The figure shows for each task: Request to Active: the time elapsed between the client does the job request until the callback function enters notifying that the job is in the active state. Active to Done: the time elapsed between the callback notifying that the job is in the active state until the callback notifying that the job has ended. Task duration: elapsed time of the task measured in the server side (this time is included in the Active to Done time, but it is indicated separately to outline the difference) It is observed that for each task, the Request to Active time is in average 3.86 seconds and the Active to Done 32.06 seconds. However, the average elapsed time of the tasks in the server is 1.03 seconds. The Active to Done has an average value rounding the 30 seconds. This time matches the GRAM Job Manager polling interval. This has been reported before in other works [35]. This polling interval can be changed editing the GRAM sources. However, if the granularity of the tasks is large enough, this polling interval would be reasonable. Regarding the performance between the 2 and 3 tasks cases, the corresponding schedules between both cases were observed with some detail.

index_jogc.tex; 12/01/2004; 20:57; p.23

24

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

Although the VP data flow graph has a maximum parallelism of 3, this maximum parallelism is only achievable for a part of the graph. With a correct schedule, the same performance can be achieved with 2 tasks as with 3 tasks. The reason why GRID superscalar is not scaling when using 3 tasks is because the schedule with 2 tasks it is good enough to get the maximum performance. 4.4.2. NAS Grid Benchmark VP, size S with 2 servers In this section we describe the results of the analysis of the NAS VP, class S, when run with two servers. The results shown in table IV when one task is assigned to each server are worst than the case when one task is assigned to server Khafre. In this case, we are directly analyzing the task schedule. Figure 12 shows the assignment of each VP task to each server. The tasks in light color have been assigned to Khafre, the faster server, and tasks in dark color have been assigned to Kadesh. The dashed line between tasks assigned to different servers represent that a file transfer is required between both servers. Regarding the assignment, we consider that it is correct, since GRID superscalar has assigned more tasks (and the tasks in the critical path) to the faster server and also the number of file transfers between both servers is very low (only 3 file transfers). Figure 13 shows the elapsed time for each task in the two different servers. In this case, the Request to Active time is different if the task is assigned to one server or the other. In Khafre the average is 2.4 seconds and in Kadesh8 is 4.1 seconds. The overhead of the transfer time from Kadesh8 to Khafre in tasks 9 and 14 makes that the Request to The Active time in these two tasks is above the average in Khafre (6 seconds and 5.8 seconds respectively). Also, for task 6, which receives a file from task 0, the Request to Active time is above the Kadesh8 average (4.6 seconds). The Active to Done time is again around 30 seconds for almost all cases except for two tasks, for which it is around 1.5 seconds. In those two cases, the poll entered much earlier than in the other cases and the end of task was detected with a much shorter time. Finally, figure 14 show the task schedule for the tasks in each server. Those plots are Paraver windows. The window in the top shows the tasks assigned to Khafre and the one in the bottom the tasks assigned to Kadesh8. The segments in dark color represent the Request to Active state. During this time the file transfers (if necessary) are performed. The segments in white represent the Active to Done state. The file transfers between both servers has been highlighted with strong light color lines. Since the overhead of the file transfers is no more than 15 seconds and the schedule is appropriate, it is difficult to understand the low performance achieved with this benchmark when using two servers. Finally, with the comparison with the one server version the reason is detected: to allow the correct execution between the two

index_jogc.tex; 12/01/2004; 20:57; p.24

Programming Grid Applications with GRID superscalar

25

servers the benchmarks is run in ASCII mode. For example, VP.S when run alone in Khafre with two servers in ASCII mode takes 312.14 seconds, which is above the 280.52 s when run in binary mode. Also, MB.S when run alone in Khafre with two servers in ASCII mode takes 223.46 seconds, which is again above the 166.57 seconds obtained in the binary mode.

5. Related work Some of the ideas presented in this paper are related to previous work developed by the group in EU projects PEMPAR, PARMAT, and ASRA-HPC. In those projects the PERMAS code was parallelized by means of PTM, a tool that totally hides parallelization from higher level algorithms [8, 9, 10]. With PTM an operation graph was asynchronously build and executed on top of blocked submatrix operations. A clustering algorithm distributed the work, performing a dynamic load balancing and exploiting data locality such that the communication on the network was kept at a minimum. Furthermore, a distributed data management system allowed free data access from each node. Above PTM the sequential and parallel code was identical. When an application runs on a parallel machine, PTM does an automatic run-time parallelization. Some similarities can be found with the commercial tool ST-ORM [13], although this last is mainly oriented to stochastic studies. In ST-ORM you can define a task graph with dependences. Each task can be anything, from a script to a crash simulation. ST- ORM handles the job submission and results collection into heterogeneous Grids. Again, the difference is that the graph and the dependences should be defined by the user. Also, the work presented may have similarities with the workflow language BPEL4WS [11] or other similar ones, as proposed in the Web Services Choreography Working Group [12]. However, in these languages what we can define is the graph, with the dependences already described. Also, these languages are oriented to medium size graphs, while our system may handle really huge task graphs that are automatically generated. An approach to the automatic generation of job workflow systems for the Grid is presented in [22]. Two workflow generators are proposed: CWG (Concrete Workflow Generator) maps an abstract workflow defined in terms of application-level components to the set of available resources, and ACWG (Abstract and Concrete Workflow Generator) which not only performs an abstract to concrete mapping but also explores the different application components based on the application metadata attributes. Current implementation of the workflow generators has deployed the Pegasus system, and integrated it into Chimera [26]. Chimera uses an abstract description of components as input. Pegasus receives an abstract workflow from Chimera and uses CWG (or ACWG) to

index_jogc.tex; 12/01/2004; 20:57; p.25

26

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

produce a concrete workflow. Pegasus then generates and input to the DAGMan [24] for execution and monitors the jobs. Another approach based on the job workflow description is Triana [23]. Triana is a visual programming environment that allows the user to describe applications by dragging and dropping their components, connecting them together to build a workflow graph. In EC project Gridlab, Triana it is being extended to exploit Grids for many different work-flow applications, through extensions to the Triana data-flow programming environment All previous workflow based systems have in common that the workflow application (components and dependencies) should be specified by the user. Also, to the authors knowledge, GRID superscalar is the only system that uses an imperative language as the means of describing the implicit job workflow. Together with the automatic task concurrency detection makes the GRID superscalar approach very innovative. Recently the GGF Grid Remote Procedure Call Working Group (GridRPCWG) [19] has specified a Grid RPC API. Examples of implementation of that API are Ninf-G [15] and GridSolve [30]. Although these systems help gridifying applications, and parallelism can be expressed, the concurrency is not automatically detected and the user has to insert explicit calls to the asynchronous call and synchronization points. Also, the server in these systems is assumed to be a remote library, which can be already installed or installed by the user. As with the workflow based systems, GRID superscalar solves a higher level problem, and in all cases the referenced systems can be used as underlying technology for specific versions of the GRID superscalar. Other Grid programming environments can be found in the literature. Satin [25] is a Java based programming model for the Grid which allows to express divide-and-conquer parallelism. Satin uses marker interfaces to indicate that certain invocations method need to be considered for potentially parallel (spawned) execution. Moreover, synchronization is also explicitly marked whenever it is required to wait for the results of parallel method invocations. Satin implements a very efficient load balancing algorithm which automatically adapts both to heterogeneous processor speeds and varying network performance. Javelin 2.0 presents a branch-and-bound computational model where the user partitions the total amount of work [31]. ATLAS provides a global computing model [34], based on Java and on the Cilk programming model, that is best suited to tree-based computations. ATLAS ensures scalability using a hierarchy of managers. The current implementation uses native libraries, which may raise some portability problems. AppLes is a framework for adaptively scheduling applications on the Grid [33]. It also provides a template for master-worker applications, and was also taken into account as a possible basic middleware for GRID superscalar.

index_jogc.tex; 12/01/2004; 20:57; p.26

Programming Grid Applications with GRID superscalar

27

Although Globus has been initially used as basic middleware for GRID superscalar, it can be adapted to other basic middlewares: Javelin [32] (based in Java), which focuses its effort to achieve global computing in the ease of participation property, Condor [17], which was used in a earlier prototype of GRID superscalar, or the Grid Application Toolkit [18], which will provide an API for upper layers. 6. Open issues, future work and some conclusions This paper presents the ideas of GRID superscalar, a programming paradigm for Grid applications. We have demonstrated that there is a viable way to ease the programming of Grid applications. GRID superscalar allows us to make use of the resources in the Grid by exploiting the existent parallelism of the applications at the program level. However, this paper does not present a finalized work; rather it introduces the GRID superscalar ideas and an initial prototype based in Globus. The current version of GRID superscalar is statically installed in the hosts by the application programmer. Part of the ongoing work is devoted to a graphical interface that allows an automatic static deployment of the client and server binaries. It will also help the programmer to generate the configuration files for the broker. As future work, we foresee new language bindings, as for example shell script. Enhancements in the run-time performance must be implemented. These enhancements will be guided by the performance analysis of the current and new benchmarks. Also, the broker must be improved, allowing it to interface with the GRIS/GIIS services. Another part of the ongoing work is the implementation of an OGSA oriented resource broker, based on Globus Toolkit 3.x. Finally, we foresee the GRID superscalar as something not bound to a single underlying middleware, as in this case is bound to Globus 2.x, but as something that can interact with workers specified in different underlying systems: as services specified with Globus Toolkit 3, or BPEL4WS, CORBA components [14] or others. In such cases, the behavior of the tasks would be described by the user in a given language, and the worker would be a service in WSDL or others automatically generated by GRID superscalar. References 1.

J.E. Smith and G.S. Sohi, The microarchitecture of superscalar processors, Proc. of the IEEE, Vol. 83 , No. 12, pp. 1609-1624, 1995. 2. The Globus project, www.globus.org 3. R. F. Van der Wijngaart, M Frumkin, NAS Grid Benchmarks: A Tool for Grid Space Exploration, Proc. of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC-10’01),San Francisco, USA, August 7-9, 2001, pp. 315–324.

index_jogc.tex; 12/01/2004; 20:57; p.27

28

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

4.

J. L. Hennessy, D. A. Patterson, D. Goldberg, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo, CA, 2002. 5. Resource Management Client API 3.0, www-unix.globus.org/api/c-globus2.2/globus_gram_documentation/html/index.html 6. Dyninst : An Application Program Interface (API) for Runtime Code Generation, http://www.dyninst.org/ 7. Dimemas, http://www.cepba.upc.es/dimemas/ 8. Markus Ast, Hartmut Manz, Jesús Labarta, A. Perez, J. Solé, Uwe Schulz: A general approach for an automatic parallelization applied to the finite element code PERMAS, HPCN Europe 1995, Milan, Italy, May 3-5, 1995, pp. 844–849. 9. Uwe Schulz, Markus Ast, Jesús Labarta, Hartmut Manz, A. Perez, J. Sole, Experiences and Achievements with the Parallelization of a Large Finite Element System, HPCN Europe 1996, Brussels, Belgium, April 15-19, 1996, pp. 82–89. 10. Markus Ast, Cristina Barrado, José M. Cela, Rolf Fischer, Jesús Labarta, Óscar Laborda, Hartmut Manz, Uwe Schulz, Sparse Matrix Structure for Dynamic Parallelisation Efficiency, Euro-Par 2000, Munich, Germany, August 29 - September 1, 2000, pp. 519–526. 11. Business Process Execution Language for Web Services version 1.1, http://www106.ibm.com/developerworks/webservices/library/ws-bpel/ 12. Web Services Choreography Working Group, http://www.w3.org/2003/01/wscwgcharter 13. ST-ORM, Stochastic optimization and robustness management, http://www.easi.de/storm/ 14. CORBA Components Model, http://www.j2eeolympus.com/J2EE/CCM/CCM.html 15. Ninf-G, http://ninf.apgrid.org 16. "MW Homepage", http://www.cs.wisc.edu/condor/mw/ 17. "Condor Project Homepage", http://www.cs.wisc.edu/condor/ 18. GridLab: A Grid Application Toolit and Testbed, http://www.gridlab.org 19. GGF Grid Remote Procedure Call Working Group, http://graal.ens-lyon.fr/GridRPC 20. David M. Beazley, David Fletcher, Dominique Dumont , "Perl Extension Building with SWIG", O’Reilly Perl Conference 2.0, San Jose, California, August 17-20, 1998. 21. S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, Network File System (NFS) version 4 Protocol, Request for Comments 3530, April 2003. 22. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda, "Mapping Abstract Complex Workflows onto Grid Environments", Journal of Grid Computing 1: 25-39, 2003. 23. I. Taylor, M. Shields, I. Wang, R. Philp, Distributed P2P Computing within Triana: A Galaxy Visualization Test Case, in proceedings of IPDPS 2003, Nice, France, 22-26 April 2003. 24. J. Frey, T. Tannenbaum et al., "Condor-G: A Computation Management Agent for MultiInstitutional Grids," Cluster Computing, Vol. 5, pp. 237–246, 2002. 25. R. van Nieuwpoort, J. Maassen, T. Kielmann, H. E. Bal, "Satin: Simple and Efficient Java-based Grid Programming", in AGridM 2003, Workshop on Adaptive Grid Middleware, pp. 28 and 38–48, New Orleans, Louisiana, USA, September 28, 2003. 26. I. Foster, J. Voeckler, et al., "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation", in 14th International Conference on Scientific and Statistical Database Management (SSDBM 2002), Edinburgh, Scotland, July 24-26, 2002. 27. N. Freed, N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, http://www.faqs.org/rfcs/rfc2045.html 28. "OMG IDL Syntax and Semantics", http://www.omg.org/cgi-bin/doc?formal/02-06-39 29. Paraver, http://www.cepba.upc.es/paraver

index_jogc.tex; 12/01/2004; 20:57; p.28

Programming Grid Applications with GRID superscalar

29

30. 31.

GridSolve, http://icl.cs.utk.edu/netsolve/ Michael O. Neary, Alan Phipps, Steven Richman, and Peter Cappello, "Javelin 2.0: JavaBased Parallel Computing on the Internet", Euro-Par 2000, Munich, Germany, August 29 - September 1, 2000. 32. Peter Cappello, Bernd Christiansen, Mihai F. Ionescu, Michael O. Neary, Klaus E. Schauser, and Daniel Wu, "Javelin: Internet-Based Parallel Computing Using Java", ACM Workshop on Java for Science and Engineering Computation, Las Vegas, NV, June 21, 1997. 33. F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, "Application-level Scheduling on Distributed Heterogeneous Networks", In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’96), Pittsburgh, PA, November 1996. 34. E.J. Baldeschwieler, R. Blumofe, and E. Brewer, "ATLAS: An Infrasctructure for Global Computing.", in Proceedings of the Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, Conemara, Ireland, September 1996, pp. 165–172. 35. H. Takemiya, "Constructing Grid Applications on the Worl-Wide Testbed - Climate Simulation and Molecular Simulation", 6th HLRS Metacomputing and GRID Workshop, Stuttgart, Germany, May 2003.

index_jogc.tex; 12/01/2004; 20:57; p.29

30

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

#include ... int gs_result; void Filter(file referenceCFG, double latency, double bandwidth, file newCFG) { char buff_latency[GS_GENLENGTH]; char buff_bandwidth[GS_GENLENGTH]; /* Parameter marshalling */ sprintf(buff_latency, "%.20g", latency); sprintf(buff_bandwidth, "%.20g", bandwidth); Execute(FilterOp, 1, 2, 1, 0, referenceCFG, buff_latency, buff_bandwidth, newCFG); } ... Figure 6. Example of stubs generated for the user functions

index_jogc.tex; 12/01/2004; 20:57; p.30

Programming Grid Applications with GRID superscalar

31

#include ... int main(int argc, char **argv) { enum operationCode opCod = (enum operationCode)atoi(argv[2]); IniWorker(argc, argv); switch(opCod) { case FilterOp: { double latency; double bandwidth; latency = strtod(argv[4], NULL); bandwidth = strtod(argv[5], NULL); Filter(argv[3], latency, bandwidth, argv[6]); } break; ... } EndWorker(gs_result, argc, argv); return 0; } Figure 7. Example of skeleton generated for the server code

index_jogc.tex; 12/01/2004; 20:57; p.31

32

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

Figure 8. GRID superscalar application files organization

app.idl

stubgen

Worker app-stubs.c

app.i

app-worker.pl

app-functions.pm

swig

app_wrapper.c

GRIDsuperscalar.so

C compiler

Master app.pl

app.so

app.pm

Figure 9. Automatic code generation code for Perl applications

index_jogc.tex; 12/01/2004; 20:57; p.32

33

Programming Grid Applications with GRID superscalar









 







Report











 







































Launch LU

LU

LU

MF

MF

MF

MG

MG

MG

MF

MF

MF

FT

FT

FT

Report

Figure 10. Data Flow Graphs of the NAS Grid Benchmarks

index_jogc.tex; 12/01/2004; 20:57; p.33









SP





SP SP



SP SP



SP

SP



SP



SP











Launch

34

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

45 40 35 30 25 20 15 10 5 0 15

13

11

9

7

5

3

Task duration Active to Done Request to Active

1

time (s)

Globus overhead

Task N

Figure 11. Task elapsed time composition for the NAS Grid Benchmark VP, size W

index_jogc.tex; 12/01/2004; 20:57; p.34

Programming Grid Applications with GRID superscalar

35

Figure 12. NAS Grid Benchmark VP: task assignment to servers. In dark color, tasks assigned to server Kadesh; in light color, tasks assigned to server Khafre. Dashed lines lines represent file transfers between servers.

index_jogc.tex; 12/01/2004; 20:57; p.35

36

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela, and Rogeli Grima

45 40 35 30 25 20 15 10 5 0 15

13

11

9

7

5

3

Task duration Active to Done Request to Active

1

time (s)

VP S in two servers (1+1)

task N

Figure 13. Task elapsed time composition when two servers are used, NAS Grid Benchmark VP, Size S.

index_jogc.tex; 12/01/2004; 20:57; p.36

Programming Grid Applications with GRID superscalar

37

Figure 14. NAS Grid Benchmark VP, size S: task scheduling when two servers are used

index_jogc.tex; 12/01/2004; 20:57; p.37

index_jogc.tex; 12/01/2004; 20:57; p.38