The GRID superscalar project: current status

XVII JORNADAS DE PARALELISMO—ALBACETE, SEPTIEMBRE 2006

1

The GRID superscalar project: current status Rosa M. Badia, Pieter Bellens, Vasilis Dialinos, Jorge Ejarque, Jes´ us Labarta, Josep M. P´erez and Ra¨ ul Sirvent Abstract— GRID superscalar is a Grid programming environment that allows to easily program applications that will be efficiently run on a computational Grid. Is able to parallelise, at runtime and at task level, a sequential application and execute it in the Grid. The used approach is able to take benefit from those applications that are composed of coarse grained tasks. These tasks can be the size of a simulation, a program, a solver... These kinds of applications are very common in bioinformatics, computational chemistry and other scientific fields. From the very initial prototype in Condor, GRID superscalar has evolved in a robust framework based in Globus and other middlewares. The effort of the GRID superscalar project goes beyond the Grid computing field, tackling now the newest field of programming multi-core chip’s platforms. This paper describes the currently available versions of GRID superscalar. Keywords— Grid computing, Grid programming models, Grid workflows, Cell BE processor, multicore programming models

I. Motivation

G

RID computing is a buzzword coined in the 1990s by Ian Foster to describe a collection of geographically distributed resources connected through a wide area network [1]. The resources are not only computers, but also storage, sensors, PDAs, etc. In such an organisation, the Grid appears to an end user as one large virtual system. The objective is not only to provide computing power but also to allow access to data or enable collaboration. Between the most well known tools that contributed to enable the Grid we find the Globus Toolkit [2], Condor [3] and Unicore [4]. GRID superscalar is a programming framework for Grid environments. The first prototype of GRID superscalar was developed in 2002 as the last year project (PFC) of one of the authors and was based in Condor. The motivation to create such a system was clear: the existent tools for programming the Grid at that time were very low level. The application’s developer had to take care of all aspects related with the Grid to execute his/her programs: selecting the resource, transferring the required data for the computation to the resource, submitting the computation, transferring the results from the resource, etc. One of the objectives of GRID superscalar is to provide a programming environment where the applications are run in the Grid but where the Grid is transparent to the programmer. GRID superscalar takes care of all the activities related to the Grid: resource selection, scheduling, file transfer, job submission, etc. Barcelona Supercomputing Center - Centro Nacional de Supercomputaci´ on (BSC-CNS) and Dept. Arquitectura Computadors (UPC). E-mail: {rosa.m.badia, pieter.bellens, vasilis.dialinos, jorge.ejarque, jesus.labarta, josep.m.perez, raul.sirvent}@bsc.es

Besides that, the other goal considered by GRID superscalar is to increase the performance of the application executed in the Grid whenever possible. The paradigm used to implement this optimisation has been imported from the processor design field. The analogy is easy to understand: in a superscalar processor there are several functional units, but assembly code is written in a sequential flow and at runtime the dependencies between the different instructions are analysed. The parallelism is increased with techniques such as register renaming, then instructions are executed concurrently, even out of order, but finally the instructions are reordered and committed in a way that for the programmer it may seem that the sequential flow has been respected. In the analogy between a processor and GRID superscalar, the processor functional units are the computational resources distributed in the Grid, the registers are files, and the scale of time is different, since the processor instructions last some nanoseconds and the operations run in a resource in a Grid last minutes or hours. When designing GRID superscalar we applied many of the ideas described above to parallelise a sequential flow application. The main requirement is that the application should be composed of coarse grained functions, so that GRID superscalar is able to find at runtime the data dependencies between these functions (tasks) and to exploit the inherent parallelism at a task level. From the fist prototype in Condor, we have implemented versions for Globus Toolkit version 2, Globus Toolkit version 4, Ninf-G [6], and SSH/SCP (see section II). In section III we describe the second version of GRID superscalar which is based on a source to source compiler (whereas stub code generation was used in version 1) and is able to exploit the parallelism in the local client with threads. The compiler and runtime of GRID superscalar version 2 are very generic, and it has been possible to adapt them to create a programming model for multi-core chips: Cell Superscalar (CellSs)(see section IV). Finally, section V concludes the paper and presents the ongoing work in the GRID superscalar project. II. GRID superscalar version 1 The process of generating and executing an application with GRID superscalar [5] is composed of a static phase and a dynamic phase. In the static phase, a set of tools for automatic code generation, deployment and configuration checking are used. In the dynamic phase the GRID superscalar runtime library and basic Globus Toolkit [2] or other Grid middleware services are used (more precisely, ssh/scp

2

ROSA M. BADIA ET AL.: THE GRID SUPERSCALAR PROJECT: CURRENT STATUS

and Ninf-G [6]). The application developer provides a sequential program, composed of a main program and application tasks, together with an IDL file that identifies the coarse grain tasks. For each task, the list of parameters is specified, indicating the type and direction of the parameter (if it is an input, output or both at the same time). The main code may also require minor adjustments so that all data required in tasks is passed as parameters, comply with the types supported by the IDL and to use files outside of the tasks. In the static phase, two sets of files are automatically generated by the GRID superscalar tools (gsstubgen) that can be used to build an application ready to run in the Grid. With these two sets of files a client-server based application is built, which has the same functional behaviour of the initial user application. The client binary is run in the localhost and submits calls to the server binaries in remote hosts of the computational Grid. The server workers will only execute the functionality of the tasks listed in the IDL file. The main program (client) executed in the local host will execute the rest of the application code. Still in the static phase, a graphical interface is provided to help users setting the Grid configuration and deploying the application (Deployment Center). After this step the application is ready to be run in a computational Grid. In the dynamic phase, the application is started in the localhost the same way it would have been started originally. While the functional behaviour of the application will be the same, the GRID superscalar library will exploit the inherent parallelism of the application at the coarse grain task level and execute the tasks independently in remote hosts of the computational Grid. To exploit the parallelism of the application, the GRID superscalar runtime builds a data dependence graph where each node of the graph represents one of the tasks of the application. The edges between the nodes of the graph represent file dependences between those tasks, which are due to files that are read or written by the tasks. In this sense, a task that writes to a given file should be executed before another that reads the same file. Therefore, the data dependence will be represented by an edge from the first task to the second on the graph. The edges define a relative execution order that should be respected. From this task graph, the GRID superscalar runtime is able to exploit the parallelism, by sending tasks that do not have any dependencies between them to remote hosts on the Grid. Figure 1 represents the behaviour described above. In each case, the GRID superscalar broker will select from the set of available hosts, which is the best suited. This selection favours reducing the total execution time, which is computed not only as the estimated execution time of the task in the host but also taking into account the time that will be spent

Fig. 1. Overview of the dynamic behaviour of GRID superscalar

to transferring all the files required by the task to the host. This allocation policy exploits the file locality, reducing the total number of file transfers. All Grid related actions (file transfers, job submission, end of task detection, result collection) are done transparently to the user. For each task, the GRID superscalar runtime transfers the required files from their current locations to the selected host, submits the task for execution, and detects when the task has finished. At the end of the execution of a task, the data dependence graph is updated and the used resource is deallocated. As a result, any ready task may be ellegible for submission to that resource and tasks that depended on the finished task may become ready for execution. When the application has finished, the application output files are collected and copied to their final locations in the localhost, all working directories in the remote hosts are cleaned, and everything is left as if the original application had been run locally. In general, techniques such as file renaming, file locality, disk sharing, checkpointing or ClassAds [7] constraints specification are applied to increase the application performance, save computation time already performed or select resources in the Grid. Also a set of tools is offered to assist the user in the process of working with GRID superscalar: gsstubgen, gsbuilder, GRID superscalar monitor and the Deployment Center. Summarising, from the user point of view, an execution of a GRID superscalar application looks as an application that has been run locally, except that the execution has been hopefully much faster by using the Grid resources. A. Tailored version for clusters The ssh GRID superscalar is a version tailored and optimised for applications that will be executed on clusters. The development of this version started off with the GRID superscalar version 1.5.2 and followed its own development branch. The objectives of this version were to substitute calls to the Globus middleware [2] (typically not found in clusters) with calls to ssh and scp, providing the same functionality. While ssh GRID superscalar was intended to be a version tailored for clusters, during the development and debugging period,


it proved to be efficient and robust also for executions in large distance Grids. In order to apply further optimisations to the runtime but at the same time maintain the generality that a Grid version required, efforts were made to keep the library configurable for both execution environments. Compile time flags can switch off certain optimisations only applicable to a cluster environment. Assuming a global shared file system and no need for encrypted communication, we can reduce the file transfers to a minimum and adopt a faster and simpler TCP/IP socket end of task notification mechanism. The runtime is composed of a master and worker library, with which the master and worker code of a GRID superscalar application should be dynamically linked respectively. Since to our knowledge at that time there was no C ssh/scp API, shell scripts cooperating with the library make the actual calls to the ssh and scp (or simple cp) commands. While all tools available in GRID superscalar version 1 have been ported to work with the ssh GRID superscalar version, we saw no need in doing the same with the Deployment Center. Indeed, only part of the functionality provided by the Deployment Center is necessary for the ssh version. For that reason, we found it more convenient to provide the users a simple configuration script to setup the execution environment. Work also has been done in the task scheduling field. Newer versions of the ssh GRID superscalar incorporate a new scheduling algorithm that scans the dependency graph for critical paths and promotes them for execution, resulting in significant speedups on some applications. Concluding, GRID superscalar, relieved from the overhead that the Globus middleware was introducing, appears to be a perfectly valid paradigm for cluster applications, while in some cases it can even outperform its message passing counterparts [12]. B. Using Ninf-G as Grid middleware Ninf-G implements the GridRPC API [13], that offers Grid primitives for task submission and session control. GRID superscalar and GridRPC adhere both to the master-worker paradigm and relieve the programmer from the low-level intricacies of a Grid environment. The Ninf-G runtime (Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology, Japan) basically provides its users with the GridRPC interface. The GridRPC API tries to standardise and implement a portable and simple remote procedure call (RPC) mechanism for Grid computing, as pursued by the Global Grid Forum Research Group on Programming Models. Early implementations of Grid middleware offering client-level access to Grid resources, like NetSolve [14] and Ninf, came equipped with a comparable interface. The culmination of these similarities is the GridRPC API. We’ve ported GRID superscalar to Ninf-G, with

3

the intention to speed up our runtime. Both the advanced Ninf-G file transfer protocol as well as the ability to create persistent worker process seemed to justify this decision. Another motivation was the desire to clean up the GRID superscalar code, so as to get rid of the tedious RSL-mechanisms from the Globus Toolkit version. The latter goal we achieved easily: the Ninf-G interface is very clean and functional as opposed to the Globus Toolkit one. Experimentally, the ported runtime proved to be faster than the original one, although one should exhibit caution when interpreting this result. Of course, the aforementioned properties of the Ninf-G runtime improve the speed of GRID superscalar in the case of small-grained tasks. When communication cost is significant, we can improve the overall performance by speeding up this aspect. However, for larger-grained tasks, communication cost is insignificant when compared to computation cost. Then, Amdahl’s law points out that the speedup we can expect by improving the communication properties of the code is small to non-existent. Further, this port introduces a worker-daemon approach to enable worker-worker communication in the Ninf-G runtime, and a handle-management mechanism. These aspects not only improve GRID superscalar efficiency, but increase the functionality of Ninf-G as a whole. III. GRID superscalar version 2 During the development of GRID superscalar we realised that, although it had many features, we wanted to enhance it beyond what its architecture allowed. Hence, we started a new version of GRID superscalar from scratch [11] but taking into account our experience with the first version. The leading ideas behind the new version are two: providing an enhanced programming model and structuring the code into a component based model that allows us to enhance it further. The new codebase has also allowed us to add other important features like tracing and threading support. A. Component based implementation The new GRID superscalar version has been designed using a loosely coupled component architecture. This design allows us to develop several parts of the library in parallel with little interference. Figure 2 shows the component organisation of this version. GRID superscalar API

Core Component

Task Management

Task Scheduler & Resource Selector

Resource Directory

File Manager

Remote Execution

Data Serialization

Local Execution

Middleware Abstraction

Threads Abstraction

Globus 2.4

Pthreads

Tracing

Fig. 2. Component organisation of GRID superscalar

4


Additionally it provides a good framework for investigating several approaches on different areas of the library. For example, it allows us to have several implementations of the middleware component in the same codebase and decide at compilation time the grid toolkit that will be used. B. Extended support for scalar types with full renaming One of the main aspects of the programming model that we wanted to enhance was type support. In the first version, whenever scalars were used as output parameters in a task, the execution would have to stop until the task had been finished. This limition was a consequence of the master library limited type support, which internally only supported files and strings. As a result, the stubs generated by the stubgen tool required performing data type conversions to interface user code with the master library which required waiting until the task had finished before executing the code that followed it. In the new version, this limitation is overcome by fully supporting scalars in the master library. First, the API of the master library has been enhanced for taking data types into account. In the new API, each parameter is passed together with a description of its type. Furthermore, variables are passed by reference so that the runtime can handle them. Second, the master library has been enhanced to support full renaming of scalars. This is accomplished by identifying the data storage behind a parameter. Originally, files were identified by their name. Similarly, variables are identified by their addresses. All renaming mechanisms that were applied to files in the previous version are now also applied to scalar variables. C. Support for structs and arrays A new feature of GRID superscalar is support for structs and array parameters. This support is currently limited to multidimensional arrays and structs that only contain scalars. Nevertheless, the renaming mechanism is applied to them as it is applied on scalars. D. Annotation based programming model The programming syntax has been changed greatly. Whereas the previous version used an interface description file that indicated which functions were tasks and specified their parameters, the new version uses annotations in the source code in a similar fashion to OpenMP [8]. The new annotations allow specifying all the information that the IDL files did in the previous version plus array lengths and array access ranges. Additionally, the old barriers have been replaced by a new synchronisation mechanism that allows stoping the execution until a given set of variables contain their correct values.

Code annotations are handled by compiling the source code using a new source to source compiler based on the Mercurium compiler [9]. E. Threading support Performance is an important aspect for GRID superscalar. In the past, when we ran an application on an SMP host and we wanted to take advantage of the local CPUs, we had to submit tasks locally using the middleware. The overhead of using the middleware, in this case Globus 2, has been shown to be significant for small tasks [5]. To improve this aspect, we have added a mechanism to GRID superscalar that allows it to run tasks in the main process. This mechanism uses additional threads in the main process to execute tasks concurrently. Since tasks are run in the same process address space, no data encoding or decoding to files is required and no expensive middleware calls are needed, considerably reducing the total overhead. F. Tracing Tracing is a new feature of the new version. It allows recording the evolution of a GRID superscalar application execution for post-mortem analysis. Our implementation generates traces suitable for the Paraver visualisation tool [10]. The recorded data includes information about the events that happened during the application execution and at what time they happened. The events include: calls to the GRID superscalar API (adding a task, waiting on data, ...), scheduling decisions (starting a task), file transfers, task execution on the resources and internal locking mechanisms. All that data can be used to extract valuable information like resource usage, task duration on the resources, time spent transferring files, etc, that allows the programmer to make informed decisions about how to improve the performance of his GRID superscalar application. IV. Cell Superscalar The design of processors has reached a technological limitation in recent years. Designing better performing processors is every time more and more difficult, mainly due to power usage and heat generation. Manufacturers are currently building chips with multiple processors [15]. With the appearance of these multi-core architectures, the developers are faced with the challenge of adapting their applications to be able to use threads that can make use of all the hardware possibilities. The first generation of the Cell Broadband Engine (BE)TM [16] includes a 64-bit multithreaded R processor element (PPE) and eight synPowerPC ergistic processor elements (SPEs), connected by an internal high-bandwidth Element Interconnect Bus (EIB). The PPE has two levels of on-chip cache and


also supports IBM’s VMX to accelerate multimedia applications by using VMX SIMD units. However, the main computing power of the Cell BE is provided by the eight SPEs. The SPE is a processor designed to accelerate media and streaming workloads. The local memory of the SPEs is not coherent with the PPE main memory, and data transfers to and from the SPE local memories must be explicitly managed by using a DMA engine. Most SPE instructions operate in a SIMD fashion on 128 bits representing, for example, two 64-bit doubleprecision floating-point numbers or long integers, or four 32-bit single-precision floating-point numbers or integers, etc. The 128-bits operands are stored in a 128-bit register file. Memory instructions also address 128-bit operands that must be aligned at multiple of 16 bytes addresses. Data is transferred by DMA to the SPE local memory in units of 128 bytes, enabling up to 16 concurrent DMA requests of up to 16KB of data each. Cell Superscalar (CellSs) is an adaptation of GRID superscalar version 2 to this architecture. In this case, the main code will be run in the PPE while the tasks will be run in the SPEs. Similarly to version 2, those user functions that have to be run in the SPEs are annotated. The source to source compiler generates the code that runs on the PPE and the code runs on the SPE. The annotations are identical to those for GRID superscalar v2. Also, the code generated by the source to source compiler is very similar to the one generated for the Grid, since the interfaces between the code and the library are almost identical and most of the changes only affect the runtime library internally. The GRID superscalar runtime has been modified by adding a specific Cell Execution component that lied side by side to the Remote Execution and Local Execution components. In fact, it has characteristics similar to both components. While it uses threads (SPE threads) to execute the tasks in the SPEs, the SPE threads do not share memory between themselves nor with the PPE. Therefore, similarly to the Grid tasks, the runtime has to prepare and send the parameters to the SPE thread before invoking the execution of the task. The size and location of the parameters, that have to be read by the SPE before executing a task, are codified in a data structure that is called task control buffer. This structure also indicates which type of function should be executed in the SPE, the size and location of the output parameters (to be written at the end of the execution of the task) and other control information. The code executed in the SPE is an independent program that is started at the beginning of the application by the main program. Each SPE program waits for a task to be executed by reading a mailbox. When a task is ready for execution in a SPE, the PPE program writes in the mailbox the size and location of the task control buffer. The SPE program gets

5

from main memory the task control buffer, then gets the parameters, executes the task, sends the output parameters to main memory and waits again for a new task or for a termination indication. Figure 3 illustrates this behaviour.

Fig. 3. Communication between the PPE and SPE in Cell Superscalar

Since there is no shared address space between any of the cell processing elements, the cell version has a specific Task Scheduler and Resource Selector component that not only takes data locality into account, but also has a specific scheculing policy which divides the pending tasks in groups. Its objective is to reduce the amount of data that is transferred between the PPE and the SPEs and between the SPEs. The current CellSs implementation is able to keep the results of a computation in the SPE instead of transferring them back to the main memory. Then, if another task needs a previous result stored in the SPE that the task will be executed on, the stored value will be used, avoiding having to transfer the data. During the application execution, the scheduling policy considers the subgraph composed of the tasks in the ready list and the subsets of nodes in the task dependency graph which are connected to the ready nodes up to a given depth. This subgraph is then partitioned in as many groups as available SPEs, guided by the parental relations of the nodes. In this sense, the scheduling policy tries to cluster together source with sink nodes, trying to reduce the number of transfers while preserving the concurrency of the execution. Each partition is then initially assigned to one SPE. The tasks in a partition are sent for execution independently (not as a whole). The static scheduling can be dynamically changed to solve workload unbalance, at the cost of some data transfer in some cases. This will happen when it is detected that an SPE is idle and there are no tasks left in its corresponding partition, while other SPEs have ready tasks waiting for execution in their partition. Some of the tasks will then be reassigned dynamically to other partitions. The tracing features of version 2 have been adapted for the Cell BE architecture. In this case, the tracing records directly the behaviour of the tasks in the PPE and SPEs: when a task is issued for execution, when the data is DMA in or out, when the

6


function are executed, etc. This tracing is done directly in the PPE and SPE processors. The current prototype of CellSs has been tested and the preliminary results are promising. V. Conclusions and ongoing work The paper has presented the current status of the GRID superscalar project. Currently, GRID superscalar binaries are distributed at no cost and are used by some external users and also by internal users at BSC. Also, GRID superscalar will be extended in FP6 project BEinGRID and used as input for projects Brein and XtreemOS. In our opinion, these facts demonstrate the validity of the environment and assure the continuity of the project, at least for some years. Besides the enhancements of the developments described on the previous sections, the ongoing work in the project includes a fault tolerant version of GRID superscalar, an interface with the GridBUS broker [17] and the development of a web based version of the Deployment Center. Acknowledgements Work partially funded by the Ministry of Science and Technology of Spain (TIN2004-07739-C02-01), and the CoreGRID European Network of Excellence (FP6-004265). References [1] I. Foster and C. Kesselman (eds.), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999. [2] I. Foster, C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, Int. Journal of Supercomputer Applications, 11(2):115-128, 1997. [3] Douglas Thain, Todd Tannenbaum and Miron Livny, Distributed computing in practice: the Condor experience, Concurrency - Practice and Experience, Vol. 17, No. 2-4, pp. 323–356, 2005. [4] Dietmar W. Erwin and David F. Snelling, UNICORE: A Grid Computing Environment, Lecture Notes in Computer Science, Vol. 2150, pp. 825-834, 2001. [5] R. M. Badia, J. Labarta, R. Sirvent, J. M. P´ erez, J. M. Cela, R. Grima, Programming Grid Applications with GRID Superscalar, Journal of Grid Computing, 1(2):151170, 2003. [6] Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, S. Matsuoka, Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing, Journal of Grid Computing, 1(1):41-51, 2003. [7] R. Raman, M. Livny, M. Solomon, Matchmaking: Distributed Resource Management for High Throughput Computing, Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. [8] OpenMP Application Program Interface v2.5, OpenMP Architecture Review Board, May 2005. [9] M. Gonzalez, J. Balart, A. Duran, X. Martorell, E. Ayguad´ e, Nanos Mercurium: a Research Compiler for OpenMP, Proceedings of the European Workshop on OpenMP 2004, October 2004. [10] Jes´ us Labarta, Sergi Girona, Vincent Pillet, Toni Cortes, Luis Gregoris, DiP : A Parallel Program Development Environment, 2nd International EuroPar Conference (EuroPar 96), Lyon (France), August 1996. [11] J. M. Perez, R. M. Badia, J. Labarta, Scalar-aware GRID superscalar, Technical report, UPC-DAC-RR-200612, UPC-DAC, April 2006. [12] Vasilis Dialinos, Rosa M. Badia, Ra¨ ul Sirvent, Josep M. P´ erez and Jes´ us Labarta, Implementing Phylogenetic Inference with GRID superscalar, Cluster Computing and Grid 2005 (CCGRID 2005), Cardiff, UK, 2005.

[13] GridRPC Working Group, https://forge.gridforum.org/projects/gridrpc-wg/ [14] Arnold, D. and Agrawal, S. and Blackford, S. and Dongarra, J. and Miller, M. and Seymour, K. and Sagi, K. and Shi, Z. and Vadhiyar, S., Users’ Guide to NetSolve V1.4.1, University of Tennessee, 2002, Innovative Computing Dept. Technical Report, ICL-UT-02-05 [15] D. Geer, Chip Makers Turn to Multicore Processors, IEEE Computer Society, May 2005. [16] D. Pham et al., The Design and Implementation of a First-Generation Cell Processor, in Proceedings of the 2005 IEEE International Solid-State Circuits Conference (ISSCC), 2005. [17] The Gridbus project, http://www.gridbus.org/middleware/