Language Extensions and Compilation Techniques for Data Intensive ...

Language Extensions and Compilation Techniques for Data Intensive Computations Gagan Agrawal Department of Computer and Information Sciences University of Delaware, Newark DE 19716 [email protected] Renato Ferreira Joel Saltz Department of Computer Science University of Maryland, College Park MD 20742 frenato,[email protected] Abstract

Processing and analyzing large volumes of data plays an increasingly important role in many domains of scienti c research. Typical examples of very large scienti c datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state, archives of raw and processed remote sensing data, and archives of medical images. High-level language and compiler support for developing applications that analyze and process such datasets has, however, been lacking so far. We are developing language extensions and a compilation framework for expressing the applications that process large multidimensional datasets in a high-level data-parallel fashion. We have chosen a dialect of Java for expressing these applications. Our dialect of Java includes data-parallel extensions for specifying collection of objects, a parallel for loop, and reduction variables. Our compiler will analyze parallel loops and optimize the processing of datasets through the use of an existing runtime system, called Active Data Repository (ADR), developed at University of Maryland. We present design of a compiler/runtime interface which allows the compiler to eectively utilize the existing runtime system. We show how interprocedural static program slicing can be used by the compiler to extract relevant information for the runtime system. Implementation of these compiler techniques is currently underway using the Titanium infrastructure.

1 Introduction Processing and analyzing large volumes of data plays an increasingly important role in many domains of scienti c research. Typical examples of very large scienti c datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state [14, 48, 43], archives of raw and processed remote sensing data (e.g. AVHRR [40], and archives of medical images [20]). These datasets are usually multi-dimensional. The data dimensions can be spatial coordinates, time, or varying experimental conditions such as temperature, velocity or magnetic eld. The increasing importance of such datasets has been widely recognized. There are a number of systems that have been designed to target queries into such large scale multi-dimensional datasets. These systems, however, are not optimized to carry out the kinds of combined data subsetting and preprocessing required for such datasets. As a result, applications that process large datasets are usually decoupled from data storage and Author Agrawal was supported by NSF CAREER award ACI-9733520.

1

management, resulting in ineciency due to copying and loss of locality. Furthermore, every application developer has to implement complex support for managing and scheduling the processing. We are developing methods that make it possible to produce ecient programs for carrying out multi-dimensional dataset processing and analysis using a high-level data parallel language. This is being accomplished by developing appropriate language extensions, a compilation framework and utilizing an existing runtime system called Active Data Repository [8, 9, 11]. We have chosen a dialect of Java for expressing this class of computations. We have chosen Java because the computations we target can be easily expressed using the notion of objects and methods on objects and because a number of projects are already underway for expressing parallel computations in Java and obtaining good performance on scienti c applications [4, 7, 15, 16, 22, 21, 37, 54]. Our chosen dialect of Java includes data-parallel extensions for specifying collection of objects, a parallel for loop, distribution functions and reduction variables. However, the approach and the techniques developed are not intended to be language speci c. Our overall thesis is that a data parallel framework will provide a convenient interface to large multi-dimensional datasets residents on a persistent storage. Our compiler extensively uses the existing runtime support ADR for optimizing the resource usage while compiling data intensive applications. ADR integrates storage, retrieval and processing of multidimensional datasets on a parallel machine. While a number of applications have been developed using ADR and high performance has been demonstrated [8, 9, 10, 11, 47], developing applications in this style requires detailed knowledge of the design of the system. In comparison, our proposed data-parallel extensions to Java enable programming of data intensive applications at a much higher level. It is now the responsibility of the compiler to utilize the services of ADR for memory management, data retrieval and scheduling of processes. In this paper, we describe the design of compiler/runtime interface, which enables the compiler to eectively utilize this existing runtime system. We then describe how the technique of interprocedural program slicing is used by the compiler to extract the range, subscripting and aggregation functions from the input code, which are then passed to the runtime system. The rest of the paper is organized as follows. In Section 2, we further describe the charactestics of the class of data intensive applications we target. Our chosen language extensions are described in Section 3. The design of the compiler/runtime interface is described in Section 4. The compiler techniques for extracting relevant functions to be passed as arguments to the runtime system are described in Section 5. We compare our work with existing related research eorts in Section 6 and conclude in Section 7.

2 Data Intensive Applications In this section, we rst describe some of the scienti c domains which require large datasets. Then, we describe some of the common charactestics of the applications we target. Data intensive applications from three scienti c areas are being studies currently as part of our project.

Satellite data processing Earth scientists study the earth by processing remotely-sensed data con-

tinuously acquired from satellite-based sensors, since a signi cant amount of earth science research is devoted to developing correlations between sensor radiometry and various properties of the surface of the earth. A typical analysis processes satellite data for ten days to a year and generates one or more composite images of the area under study. Generating a composite image requires projection of the globe onto a two dimensional grid; each pixel in the composite image is computed by selecting the \best" sensor value that maps to the associated grid point.

Virtual Microscope and Analysis of Microscopy Data The Virtual Microscope [20] is an appli-

cation to support the need to interactively view and process digitized data arising from tissue specimens. The Virtual Microscope provides a realistic digital emulation of a high power light microscope. The raw data for such a system can be captured by digitally scanning collections of full microscope slides under high power. At the basic level, it can emulate the usual behavior of a physical microscope including 2

continuously moving the stage and changing magni cation and focus. Used in this manner, the Virtual Microscope can support completely digital dynamic telepathology. In addition, it enables new modes of behavior that cannot be achieved with a physical microscope, such as simultaneous viewing and manipulation of a single slide by multiple users, and three dimensional image reconstruction and registration from multiple microscope slides marked by various special stains.

Water contamination studies Environmental scientists study the water quality of bays and estuaries

using long running hydrodynamics and chemical transport simulations. The hydrodynamics simulation imposes an unstructured grid on the area of interest and determines circulation patterns and uid velocities over time. The chemical transport simulation models reactions and transport of contaminants, using the uid velocity data generated by the hydrodynamics simulation. This simulation is performed on a dierent unstructured spatial grid, and often uses signi cantly coarser time steps. This is achieved by mapping the uid velocity information from the circulation grid, averaged over multiple ne-grain time steps, to the chemical transport grid and computing smoothed uid velocities for the points in the chemical transport grid. Data intensive applications in these and related scienti c areas share many common charactestics. Access to data items is described by a range query, namely a multi-dimensional bounding box in the underlying multi-dimensional space of the dataset. Only the data items whose associated coordinates fall within the multi-dimensional box are retrieved. The basic computation consists of (1) mapping the coordinates of the retrieved input items to the corresponding output items, and (2) aggregating, in some way, all the retrieved input items mapped to the same output data items. The computation of a particular output element is a reduction operation, i.e. the correctness of the output usually does not depend on the order input data items are aggregated. Another common characteristic of these applications is their extremely high storage and computational requirements. For example, ten years of global coverage satellite data at a resolution of four kilometers for our satellite data processing application Titan consists of over 1.4TB of data [11]. For our Virtual Microscope application, one focal plane of a single slide requires over 7GB (uncompressed) at high power, and a hospital such as Johns Hopkins produces hundreds of thousands of slides per year [1]. Similarly, the computation for one ten day composite Titan query for the entire world takes about 100 seconds per processor on the Maryland sixteen node IBM SP2. Ecient processing for these data intensive applications clearly requires use of multiple processors, with an associated very large disk farm.

3 Java Extensions for Data Intensive Computing In this section we describe a dialect of Java that we have chosen for expressing data intensive computations. We have chosen Java as our base language for at least two important reasons:

The targeted applications can be conveniently expressed as objects and operations on objects.

Speci cally, Fortran-like languages are not suitable for programming these applications. A number of projects are currently in progress for expressing parallel computations in Java and obtaining good performance on scienti c applications [4, 7, 15, 16, 22, 21, 37, 54]. While our project includes several new optimization techniques not addressed by these projects, we also believe that leveraging the tools and techniques developed by these projects is critical for the success of our project.

Though we propose to use a dialect of Java as the source language for the compiler, the techniques we will be developing will be largely independent of Java and will also be applicable to suitable extensions of other languages, such as C or C++.

3

Interface Reducinterface f // Any object of any class implementing // this interface is a reduction variable

static Point[2] hipoint = [Xdimen-1,Ydimen-1]; static RectDomain[2] VMSlide = [lowpoint : hipoint]; static VMPixel[2d] VScope = new VMPixel[VMSlide]; public static void main(String[] args) f Point[2] lowend = [args[0],args[1]]; Point[2] hiend = [args[2],args[3 ]]; int subsamp = args[4]; RectDomain[2] Outputdomain = [[0,0]:(hiend lowend)/subsamp]; VMPixelOut[2d] Output = new VMPixel[Outputdomain] ; RectDomain[2] querybox ; Point[2] p; foreach(p in Outputdomain) f Output[p].Initialize();

g

public class VMPixel f char[] colors; void Initialize() f colors[0] = 0 ; colors[1] = 0 ; colors[2] = 0 ;

g

void Accum(VMPixel Apixel, int avgf) f colors[0] += Apixel.colors[0]/avgf ; colors[1] += Apixel.colors[1]/avgf ; colors[2] += Apixel.colors[2]/avgf ;

g

g

g

public class VMPixelOut extends VMPixel implements Reducinterface; public class VMScope f static int Xdimen = ... ; static int Ydimen = ... ; static Point[2] lowpoint = [0,0];

querybox = [lowend : hiend] ; foreach(p in querybox) f Point[2] q = (p - lowend)/subsamp ; Output[q].Accum(VScope[p],subsamp*subsamp) ;

g

g

g

Figure 1: Example Code

3.1 Data Parallel Constructs

We borrow two concepts from object-oriented parallel systems like Titanium [54], HPC++ [5] and Concurrent Aggregates [13]. Domains and Rectdomains are collections of objects of the same type. Rectdomains have a stricter de nition, in the sense that each object belonging to such a collection has a coordinate associated with it that belongs to a pre-speci ed rectilinear section of the domain. The foreach loop, which iterates over objects in a domain or rectdomain, and has the property that the order of iterations does not in uence the result of the associated computations. We further extend the semantics of foreach to include the possibility of updates to reduction variables, as we explain later. We introduce a Java interface called Reducinterface. Any object of any class implementing this interface acts as a reduction variable [27]. The semantics of a reduction variable are analogous to those used in version 2.0 of High Performance Fortran (HPF-2) [27] and in HPC++ [5]. A reduction variable has the property that it can only be updated inside a foreach loop by a series of operations that are associative and commutative. Furthermore, the intermediate value of the reduction variable may not be used within the loop, except for self-updates. Though a number of recent projects have focused on automatic detection and parallelization of reduction operations [23, 24, 25, 45], we prefer that reductions be explicitly annotated by the programmer.

3.2 Example Code

Figure 1 outlines an example code with our chosen extensions. This code shows the essential computation in a virtual microscope application [1, 20]. A large digital image is stored in disks. This image can be 4

thought of as a two dimensional array or collection of objects. Each element in this collection denotes a pixel in the image. Each pixel comprises of three shorts, which denote the colors at that point in the image. The interactive user supplies two important pieces of information. The rst is a bounding box within this two dimensional box, this implies the area within the original image that the user is interested in scanning. We assume that the bounding box is rectangular, and can be speci ed by providing the x and y coordinates of two points. The rst 4 arguments provided to the main are integers and together, they specify the points lowend and hiend. The second information provided by the user is the subsampling factor, an integer denoted by subsamp. The subsampling factor tells the granularity at which the user is interested in viewing the image. A subsampling factor of 1 means that all pixels of the original image must be displayed. A subsampling factor of n means that n2 pixels are averaged to compute each output pixel. The computation in this kernel is very simple. A querybox is created using speci ed points lowend and hiend. Each pixel in the original image which falls within the querybox is read and then used to increment the value of the corresponding output pixel. There are several advantages associated with specifying the analysis and processing over multidimensional datasets in this fashion. The above model assumes a single processor and in nite memory. It also assumes that the data is available in arrays of object references, and is not in persistent storage. It is the responsibility of the compiler to locate individual elements of the arrays from disks. Performance enhancing transformations like tiling, loop interchange, etc. are also the responsibility of the compiler and are hidden from the programmer.

3.3 Restrictions on the Loops

The primary goal of our compiler will be to analyze and optimize (by performing both compile-time transformations and inserting calls to ADR runtime library routines) nested foreach loops that satisfy certain properties. We now elaborate on the restrictions on the loop nests that our compiler will target for optimizations. We require that no Java threads be spawned within such loop nests, and no memory locations read or written to inside the loop nests may be touched by another concurrent thread. Our compiler will also assume that no Java exceptions are raised in the loop nest and the iterations of the loop can be reordered without changing any of the language semantics. For parallelization and for carrying out query planning, the compiler needs to determine the set of objects that will be read and written to in the loop nest. The semantics of the foreach loop imply that object references in a particular iteration are independent of the computations performed in other iterations. The semantics of reduction objects further imply that updates to them do not in uence what objects are read or written in the loop nest. With these two properties, the compiler can determine the set of iteration, input, and output objects that will be read and written in the loop. The analysis for doing this is similar to the analysis required in analyzing Fortran codes with irregular data access patterns for compiling to distributed memory machines [2, 31, 44, 53].

4 Compiler-Runtime Interface In this section, we describe a compiler/runtime interface and explain how the compiler and runtime system can work jointly towards performing data intensive computations. Initially, we give some background information on the existing runtime system.

4.1 Overview of the Runtime System

We give a quick overview of the runtime infrastructure, called the Active Data Repository (ADR) [8, 9, 10], that integrates storage, retrieval and processing of multi-dimensional datasets on a parallel machine. Processing of a data intensive data parallel loop is carried out by ADR in two phases: loop planning and loop execution. The objective of loop planning is to determine a schedule to eciently process a 5

range query based on the amount of available resources in the parallel machine. A loop plan speci es how parts of the nal output are computed. The loop execution service manages all the resources in the system and carries out the loop plan generated by the loop planning service. The primary feature of the loop execution service is its ability to integrate data retrieval and processing for a wide variety of applications. This is achieved by pushing processing operations into the storage manager and allowing processing operations to access the buer used to hold data arriving from disk. As a result, the system avoids one or more levels of copying that would be needed in a layered architecture where the storage manager and the processing belonged to dierent layers. A dataset in ADR is partitioned into a set of chunks to achieve high bandwidth data retrieval. A chunk is usually of the size of a disk block, or a small number of disk blocks. A chunk consists of one or more objects, and is the unit of I/O and communication. Particularly, choosing this unit to be a disk block or a small set of consecutive disk blocks allows very ecient disk access. The processing of a loop on a processor progresses through the following three phases: (1) Initialization { output chunks (possibly replicated on all processors) are allocated space in memory and initialized, (2) Local Reduction { input data chunks on the local disks of each processor are retrieved and aggregated into the output chunks, (3) Global Combine { if necessary, results computed in each processor in phase 2 are combined across all processors to compute nal results for the output chunks. ADR runtime support has been developed as a set of modular services implemented in C++. ADR allows customization for application speci c processing (i.e., mapping and aggregation functions), while leveraging the commonalities between the applications to provide support for common operations such as memory management, data retrieval, and scheduling of processing across a parallel machine. Customization in ADR is currently achieved through C++ class inheritance. That is, for each of the customizable services, ADR provides a set of C++ base classes with virtual functions that are expected to be implemented by derived classes. Adding an application-speci c entry into a modular service requires the de nition of a class derived from an ADR base class for that service and providing the appropriate implementations of the virtual functions. Current examples of data intensive applications implemented with ADR include Titan [11, 12, 47], for satellite data processing, the Virtual Microscope [1, 20], for visualization and analysis of microscopy data, and coupling of multiple simulations for water contamination studies [33].

4.2 Compiler and Runtime Interface for Loop Processing

We now describe how the ADR internals use the various functions provided through customization to generate code for the runtime system. Customization can be performed by hand or can be generated directly by the compiler. For the discussion in this subsection, we assume that the foreach loop iterates over elements of input array, i.e, the input elements are accessed using the loop index. A function of the loop index is used to access the output collection in any given iteration. As we mentioned previously, we are speci cally targeting applications in which each input element is mapped to an output element and its value is used to increment the value of the output element. Therefore, while our compiler is being designed to generate correct code for all possible foreach loops as described earlier, it focuses on optimizing performance for such reduction type loops. We, therefore, introduce the concept of a canonical loop. Our canonical loop has the following property. The loop only processes at most two collections of objects, at most one of which is modi ed in the loop. The collection of object whose elements are modi ed is referred to as the left hand side or lhs collection, and the other collection referred to in the loop is considered as the right hand side or rhs collection. Loops which do not have this property can be replaced by a series of loops which have this property by using the technique of loop ssion [52]. Two examples of such transformations are shown in Figure 2. Clearly, if a single loop is replaced by several consecutive loops, the resulting locality may be poor. However, we do not nd this to be a problem for the applications we target.

6

(a)

foreach (p in box) f A[f1(p)] = C[p] B[f2(p)] = C[p]

(b)

foreach (p in box) f A[f(p)] += B[p] + B[g(p)]

g #

g #

foreach (p in box) f A[f(p)] += B[p]

foreach (p in box) f A[f1(p)] = C[p]

g

foreach (p in g?1 (box)) f A[f(g?1(p))] += B[p]

g

foreach (p in box) f B[f2(p)] = C[p]

g

g

Figure 2: Examples of Transformations of Loops to Canonical Forms

4.2.1 Initial Processing of the Input

The rst important information to be given to the runtime system is the name of the rhs collection of objects over which we iterate in the loop, and the domain of the object over which processing is done. We denote this by function R, which stands for the range of the processing. For the purpose of our discussion, we denote the rhs collection of objects for this loop by A and the domain which is analyzed in this loop by I . The system stores information about how the input collection of objects A is stored across disks. The compiler inserts calls to appropriate functions from the runtime system to analyze the meta-data about A and the input domain I , and compute the list of disk blocks of A that intersect with the domain I , and therefore, are accessed in this loop. It is important to note that if a disk block is included in this list, it is not necessary that all elements in this disk block are accessed during the loop. However, for initial query planning phase, we focus on the list of disk blocks. We assume a model of parallel data intensive computation in which a set of disks is associated with each node of the parallel machine. This is consistent with systems like IBM SP-2/3 and cluster of workstations. Let the set P = fp1 ; p2; : : :; png denote the list of processors in the system, and the sets Di ; (i = 1; : : :; n); = fd1i; d2i; : : :; dmig denote the list of disks associated with the processor i. Then, the information computed by the runtime system after analyzing the domain I and the meta-data about the collection A is as follows: 8 i; 8 j; Bij = fbjb is resident on disk dij and b intersects with I g Further, for each disk block bijk belonging to the set bij , we compute the domain ID(bi jk), which stands for Input Domain and denotes the subset of the original domain r which is resident on the disk block b. Clearly, [ ID(b ) = I ijk 8 i; 8 j; 8 k

4.2.2 Work Partitioning

One of the issues in processing any loop in parallel is work or iteration partitioning, i.e., deciding which iterations are executed on which processor. 7

As we mentioned previously, in our target applications the size of output data is much smaller than the size of input data. Under the assumed charactestics of the target loops, the work distribution policy we use is that each iteration is performed on the owner of the element read in that iteration. This policy is opposite to the owner's compute policy [28] which has been commonly used in distributed memory compilers, in which owner of the lhs element works on the iteration. Since only the reduction computations are performed and the size of the output created is much smaller than the size of input (the input is disk resident), we prefer to avoid any communication before the start of the loop. Only the reduction operations over the temporary collections of objects created on each processor need to be performed. Note that the assumptions on the nature of loops we have placed requires replacing an initial loop by a sequence of canonical loops, which may also increase net communication between processors. However, we do not consider it to be a problem for the set of applications we target.

4.2.3 Allocating Output Buers and Stripmining

The distribution of rhs collection is decided before performing the processing, and we have decided to partition the iterations accordingly. We now need to allocate buers to accumulate the local contributions to the nal lhs objects. Our focus is on reduction operations in which the size of output is much smaller than the size of input. Therefore, we logically replicate the output space on each processor and collect local contributions to each output element on each processor. The output elements also need to be appropriately initialized on each processor. The memory requirements of the replicated output space is typically higher than the available memory on each processor. Therefore, we need to divide the replicated output buer into chunks that are allocatable on each processor's main memory. This is the same issue as stripmining or tiling used for improving cache performance. We have so far used only a very simple stripmining strategy. We query the runtime system to determine the available memory that can be allocated on a given processor. Then, we divide the lhs space into blocks of that size. Formally, we divide the lhs domain O into a set of s smaller domains (called strips) fO1; O2; : : :; Osg. In performing the computations on each processor, we will iterate over the set of strips, allocate that strip and compute local contributions to each strip, before allocating the next strip. To facilitate this, we compute the set of rhs disk blocks that will contribute to each strip of the lhs.

4.2.4 Mapping Input to the Output

One key information that the compiler needs to pass to the runtime system is the subscripting function, which denotes the lhs element that is updated in a particular iteration of a loop. We denote this function by S . We now discuss how this information is used by the runtime query planning routine. The rst use of S is for computing the set of rhs disk blocks that will contribute to each strip of the lhs as indicated above. To do this we apply the subscripting function to each disk block's ID(bijk ) to obtain the corresponding domain in the lhs region. These output domains that each disk block can contribute towards are denoted as OD(bijk ). If the ID(bijk ) is a rectangular domain and if the subscripting function is monotonic, OD(bijk ) will be a rectangular domain and can easily be computed by applying the subscripting function to the two extreme corners. If this is not the case, the subscripting function needs to be applied on each element of ID(bijk ) and the resulting OD(bijk ) will just be a domain and not a rectangular domain. Formally, we compute the sets Lil , for each processor i and each output strip l, such that Lil = fbijkj (OD(bijk ) \ Ol ) 6= g The subscripting function S is then used during the execution phase for its original purpose.

8

For each output strip Ol Execute on each Processor Pi : Allocate and initialize the output strip on main memory Foreach input block bijk in the set Lil Read block from the disk j Foreach element e of ID(bijk ) Apply the accumulation function A to S (e) Global reduction to nalize the values for Ol

Figure 3: Loop Execution on Each Processor

4.2.5 Actual Execution

The computation of sets Lil marks the end of the query planning phase of the runtime system. Using this information, now the actual computations are performed on each processor. The structure of the computation is shown in Figure 3. The only other information required for the execution of this loop is the accumulation function A, which takes one element of the input object, and increments the value of on output element. In practice, the computation associated with each rhs disk block and retrieval of disk blocks is overlapped, by using asynchronous I/O operations.

5 Compiler Analysis In the previous section, we described what kind of information the compiler can provide to the runtime system for executing data parallel loops. In this section, we describe how the compiler extracts this information from the source code. We are primarily concerned with extracting three functions, the range function R, the subscripting function S and the aggregation function A. Similar information is often extracted by various data-parallel Fortran compilers. One main dierence in our work is that we are working with an object-oriented language (Java), which is signi cantly more dicult to analyze. This is because of two main reasons:

Unlike Fortran, Java has a number of language features like object references, polymorphism, exceptions, and aliases, which make program analysis harder. The object-oriented programming methodology frequently leads to small procedures and frequent procedure calls. As a result, analysis across multiple procedures may be required in order to extract range, subscripting and aggregation functions.

We now show how the notion of program slicing can be used for extracting these three functions. Initially, we give background information on program slicing and mention how program slicing can be performed across procedure boundaries, and in the presence of language features like polymorphism, aliases, and exceptions.

5.1 Background: Program Slicing

The basic de nition of a program slice is as follows. Given a slicing criterion (p; x), where p is a point in the program and x is a variable, the program slice is a subset of statements in the programs such that these statements, when executed on any input, will produce the same value of the variable x at the program point p as the original program. Note that in the standard slicing literature, this is the de nition of executable, backward and static slices [50]. Other de nitions have also been used. The basic idea behind any algorithm for computing program slices is as follows. Starting from the statement p in the program, we trace any statements on which p is data or control dependent. The same 9

is repeated from any statement which has already been included in the slice, till no more statements can be added in the slice. The notion of program slicing was initially given by Mark Weiser [51]. Slicing has been very frequently used in software development environments, for debugging, program analysis, merging two dierent versions of the code, and software maintenance and testing. A number of techniques have been presented for accurate program slicing across procedure boundaries [29, 46]. Since object-oriented languages have been frequently used for developing production level software, signi cant attention has been paid towards developing slicing techniques in the presence of object-oriented features like object references, polymorphism, and more recently, Java features like threads and exceptions. Harrold et al. and Tonnela et al. have particularly focused on slicing in the presence of polymorphism, object references, and exceptions [26, 35, 36, 41]. Slicing in the presence of aliases and reference types has also been addressed [3, 19].

5.2 Extracting Range Function

The rst important piece of information that the compiler needs to provide is the range function R. We need to determine the rhs collection of objects and the domain over which the loop iterates. The rhs collection of objects can be computed easily by inspecting the assignment statements inside the loop and in any functions called inside the loop. We have assumed that at most two dierent collection of objects are touched in our canonical loop (otherwise we do some preprocessing to reduce the loop into a sequence of canonical loops). Of these two, the one which is never modi ed is considered the rhs collection. For computing the domain, we inspect the foreach loop and look at the domain over which the loop iterates. Then, we compute a slice of the program using the entry of the loop as the program point and the domain as the variable.

5.3 Extracting Subscripting Function

The subscripting function S is particularly important for the runtime system, as it determines the size of lhs collection written in the loop and the lhs disk blocks which contribute to each strip of the rhs collection. This function can be extracted using slicing as follows. Consider the statement in the loop which modi es the lhs collection. We focus on the variable or expression used to access an object in the lhs collection. The slicing criterion we choose is the value of this variable or expression at the beginning of the statement where the lhs collection is modi ed. Typically, this value of the loop index will be included in such a slice. Suppose the loop index is p. After rst encountering p in the slice, we do not follow data-dependencies for p any further.

5.4 Extracting Aggregation Function

Extracting aggregation function is straight-forward. We just extract the body of the loop being analyzed.

6 Related Work Several runtime support libraries and le systems have been developed to support ecient I/O in a parallel environment [17, 18, 32, 49]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. They also usually provide a collective I/O interface, in which all processing nodes cooperate to make a single large I/O request. Our work is dierent in two important ways. First, we are supporting a much higher level of programming by involving a compiler. Second, our target runtime system, ADR also diers from these systems in several ways. ADR is able to carry out range queries directed at irregular spatially indexed datasets. The computation is an integral part of the ADR framework. With the collective I/O 10

interfaces provided by many parallel I/O systems, data processing usually cannot begin until the entire collective I/O operation completes. Third, data placement algorithms optimized for range queries are also integrated as part of the ADR framework. Our work on providing high-level support for data intensive computing can be considered as developing an out-of-core Java compiler. Compiler optimizations for improving I/O accesses have been considered by several projects. The PASSION project at Northwestern University has considered several dierent optimizations for improving locality in out-of-core applications [6, 30], including loop transformations. Some of these optimizations have also been implemented as part of the Fortran D compilation system's support for out-of-core applications [42]. Mowry et al. have shown how a compiler can generate prefetching hints for improving the performance of a virtual memory system [39]. These projects have concentrated on relatively simple stencil computations written in Fortran. Besides the use of Java, our work is signi cantly dierent in the complexity of the applications we focus on. Our applications can have data with complex distributions across processors and disks, only limited information about access patterns may be known at compile-time and their complex loop structures may require interprocedural transformations. Several recent projects have explored the use of Java for numerical and high-performance computing [37, 38]. With careful hand-tuning and/or compiler optimizations, they have demonstrated performance up to 80% of that of C and Fortran. Many researchers have developed aggressive optimization techniques for Java, targeted at parallel and scienti c computations. javar and javab are compilation systems targeting parallel computing using Java [4]. Data parallel extensions to Java have been considered by at least two other projects: Titanium [54] and HP Java [7]. Loop transformations and techniques for removing redundant array bounds checking have been developed [16, 37]. The Java Grande forum and Sun Microsystems have proposed modi cations to the Java language to allow better numerical precision and enable more aggressive compiler optimizations [22, 21]. Our compilation eort builds on the techniques developed by such projects for enabling good numerical performance from Java. However, our eort is also unique in considering persistent storage, complex distributions of data on processors and disks, interprocedural transformations, and the use of a sophisticated runtime system for optimizing resources. Several research projects have focused on parallelizing irregular applications, such as computational

uid dynamics codes on irregular meshes and sparse matrix computations. This research has demonstrated that by developing runtime libraries and through compiler analysis that can place these runtime calls, such irregular codes can be compiled for ecient execution [2, 31, 34, 53]. Our project is related to these eort in the sense that our compiler will also be inserting calls to runtime routines. However, our project is also signi cantly dierent. The data distributions we need to handle are signi cantly more complex, the language we need to handle can have aliases and object references, the applications involve disks accesses and persistent storage, and the runtime system we need to interface to works very dierently.

7 Conclusions In this paper, we have addressed the problem of expressing data intensive computations in a high-level language and then compiling such codes to eciently manage data storage, retrieval and processing on a parallel machine. We have developed data-parallel extensions to Java for expressing this important class of applications. Using our extensions, the programmers can specify the computations assuming that there is a single processor and in nite at memory. Our compiler uses an existing runtime support system Active Data Repository (ADR) for optimizing resource usage. We have described how the compiler can use the technique of interprocedural static program slicing to extract relevant information from data parallel Java loops for the runtime system. The implementation of our compiler is currently underway, using the Titanium infrastructure from Berkeley.

11

References

[1] Asmara Afework, Michael D. Beynon, Fabian Bustamante, Angelo Demarzo, Renato Ferreira, Robert Miller, Mark Silberman, Joel Saltz, Alan Sussman, and Hubert Tsang. Digital dynamic telepathology - the Virtual Microscope. In Proceedings of the 1998 AMIA Annual Fall Symposium. American Medical Informatics Association, November 1998. [2] Gagan Agrawal and Joel Saltz. Interprocedural compilation of irregular applications for distributed memory machines. In Proceedings Supercomputing '95. IEEE Computer Society Press, December 1995. [3] H. Agrawal, R. A. DeMillo, and E. H. Spaord. Dynamic slicing in the presence of unconstrained pointers. In Proceedings of the ACM Fourth Symposium on Testing, Analysis and Veri cation (TAV4), pages 60{73, 1991. [4] A. Bik, J. Villacis, and D. Gannon. javar: A prototype Java restructing compiler. Concurrency Practice and Experience, 9(11):1181{91, November 1997. [5] Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, and Shelby X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scienti c Programming, 2(3), Fall 1993. [6] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A model and compilation strategy for out-of-core data parallel programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 1{10. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8. [7] Bryan Carpenter, Guansong Zhan, Georey Fox, Yuhong Wen, and Xinyng Li. HPJava: Data-parallel extensions to Java. Available from http://www.npac.syr.edu/projects/pcrc/July97/doc.html, December 1997. [8] C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizable parallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58{66, March 1998. [9] Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, April 1999. [10] Chialin Chang, Tahsin Kurc, Alan Sussman, and Joel Saltz. Query planning for range queries with userde ned aggregation on multi-dimensional scienti c datasets. Technical Report CS-TR-3996 and UMIACSTR-99-15, University of Maryland, Department of Computer Science and UMIACS, February 1999. [11] Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Sussman, and Joel Saltz. Titan: A high performance remote-sensing database. In Proceedings of the 1997 International Conference on Data Engineering, pages 375{384. IEEE Computer Society Press, April 1997. [12] Chialin Chang, Alan Sussman, and Joel Saltz. Scheduling in a high performance remote-sensing data server. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scienti c Computing. SIAM, March 1997. [13] A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 187{196. ACM Press, March 1990. [14] Srinivas Chippada, Clint N. Dawson, Monica L. Martnez, and Mary F. Wheeler. A Godunov-type nite volume method for the system of shallow water equations. Computer Methods in Applied Mechanics and Engineering (to appear), 1997. Also a TICAM Report 96-57, University of Texas, Austin, TX 78712. [15] B. Christiansen, P. Cappello, M. F. Ionescu, M. O. Neary, K. E. Schauser, and D. Wu. Javelin: Internetbased parallel computing using Java. In Proceedings of the 1997 ACM workshop on Java for Science and Engineering Computation, June 1997. [16] M. Cierniak and W. LI. Just-in-time optimizations for high-performance Java programs. Concurrency Practice and Experience, 9(11):1063{73, November 1997. [17] Peter F. Corbett and Dror G. Feitelson. The Vesta parallel le system. ACM Transactions on Computer Systems, 14(3):225{264, August 1996.

12

[18] P. E. Crandall, R. A. Aydt, A. C. Chien, and D. A. Reed. Input/Output characteristics of Scalable Parallel Applications. In Proceedings Supercomputing '95, December 1995. [19] M. Ernst. Practical ne-grained static slicing of optimized code. Technical Report MSR-TR-94-14, Microsoft Research, Redmond, WA, 1994. [20] R. Ferreira, B. Moon, J. Humphries, A. Sussman, J. Saltz, R. Miller, and A. Demarzo. The Virtual Microscope. In Proceedings of the 1997 AMIA Annual Fall Symposium, pages 449{453. American Medical Informatics Association, Hanley and Belfus, Inc., October 1997. Also available as University of Maryland Technical Report CS-TR-3777 and UMIACS-TR-97-35. [21] Java Grande Forum. Issues in numerical computing in Java. Document available at URL http://math.nist.gov/javanumerics/issues.html, March 1998. [22] James Gosling. The evolution of numerical computing in Java. Document available at URL http://java.sun.com/people/jag/FP.html, March 1998. [23] M. W. Hall, S. Amarsinghe, B. R. Murphy, S. Liao, and Monica Lam. Detecting Course-Grain Parallelism using an Interprocedural Parallelizing Compiler. In Proceedings Supercomputing '95, December 1995. [24] Mary W. Hall, Brian R. Murphy, and Saman P. Amarasinghe. Interprocedural analysis for parallelization: Design and experience. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scienti c Computing, pages 650{655. SIAM, February 1995. [25] Hwansoo Han and Chau-Wen Tseng. Improving compiler and run-time support for irregular reductions. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, August 1998. [26] M. J. Harrold and Ning Ci. Reuse-driven interprocedural slicing. In Proceedings of the International Conference on Software Engineering, 1998. [27] High Performance Fortran Forum. Hpf language speci cation, version 2.0. Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/ les/hpf-v20.ps.gz, January 1997. [28] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributedmemory machines. Communications of the ACM, 35(8):66{80, August 1992. [29] S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems, 12(1):26{60, January 1990. [30] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the performance of out-of-core computations. In Proceedings of International Conference on Parallel Processing, August 1997. [31] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440{451, October 1991. [32] David Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61{74. ACM Press, November 1994. [33] Tahsin M. Kurc, Alan Sussman, and Joel Saltz. Coupling multiple simulations via a high performance customizable database system. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scienti c Computing. SIAM, March 1999. [34] Antonio Lain and Prithviraj Banerjee. Exploiting spatial regularity in irregular iterative applications. In Proceedings of the Ninth International Parallel Processing Symposium, pages 820{826. IEEE Computer Society Press, April 1995. [35] Loren Larsen and Mary Jean Harrold. Slicing object-oriented software. In Proceedings of the International Conference on Software Enginering, 1996. [36] Donglin Liang and Mary Jean Harrold. Slicing objects using system dependence graph. In Proceedings of the International Conference on Software Maintainance, 1998. [37] S. P. Midki, J. E. Moreira, and M. Snir. Optimizing bounds checking in Java programs. IBM Systems Journal, August 1998. [38] Jose E. Moreira, Samuel P. Midki, Manish Gupta, and Richard D. Lawrence. Parallel data mining in Java. Technical Report RC 21326, IBM T. J. Watson Research Center, November 1998. [39] Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for outof-core applications. In Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI '96), Nov 1996.

13

[40] NASA Goddard Distributed Active Archive Center (DAAC). Advanced Very High Resolution Radiometer Global Area Coverage (AVHRR GAC) data. http://daac.gsfc.nasa.gov/CAMPAIGN DOCS/ LAND BIO/origins.html. [41] R. Fiutem P. Tonnela, G. Antonio and E. Merlo. Flow-insensitive c++ pointers and polymorphism analysis and its application to slicing. In Proceedings of the International Conference on Software Enginering, 1997. [42] M. Paleczny, K. Kennedy, and C. Koelbel. Compiler support for out-of-core arrays on parallel machines. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 110{118. IEEE Computer Society Press, February 1995. [43] G. Patnaik, K. Kailasnath, and E.S. Oran. Eect of gravity on ame instabilities in premixed gases. AIAA Journal, 29(12):2141{8, Dec 1991. [44] Ravi Ponnusamy, Yuan-Shin Hwang, Raja Das, Joel H. Saltz, Alok Choudhary, and Georey Fox. Supporting irregular distributions using data-parallel languages. IEEE Parallel & Distributed Technology, 3(1):12{24, Spring 1995. [45] L. Rauchwerger and D.A. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Transactions on Parallel and Distributed Systems, 10(2):160{180, February 1999. [46] T. Reps, S. Horwitz, M. Sagiv, and G. Rosay. Speeding up slicing. In Proceedings of the Conference on Foundations of Software Engineering, 1994. [47] Carter T. Shock, Chialin Chang, Bongki Moon, Anurag Acharya, Larry Davis, Joel Saltz, and Alan Sussman. The design and evaluation of a high-performance earth science database. Parallel Computing, 24(1):65{90, January 1998. [48] T. Tanaka. Con gurations of the solar wind ow and magnetic eld around the planets with no magnetic eld: calculation by a new MHD. Jounal of Geophysical Research, 98(A10):17251{62, Oct 1993. [49] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70{78, June 1996. [50] Frank Tip, J-D Choi, John Field, and G. Ramalingam. Slicing class hierarchies in c++. In Proceedings of the Conference on Object-Oriented Programming Systems and Languages (OOPSLA), 1996. [51] Mark Weiser. Program slicing. IEEE Transactions on Software Engineering, 10:352{357, 1984. [52] Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1995. [53] Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed memory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737{753, June 1995. [54] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hil nger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency Practice and Experience, 9(11), November 1998.

14

Language Extensions and Compilation Techniques for Data Intensive ...

Language Extensions and Compilation Techniques for Data Intensive ...

Suggest Documents

Runtime Compilation Techniques for Data Partitioning ... - CiteSeerX

Runtime Compilation Techniques for Data Partitioning ... - CiteSeerX

Runtime Compilation Techniques for Data Partitioning and ... - CUCIS

Unified Compilation Techniques for Shared and ... - CiteSeerX

parallel architectures and compilation techniques

Data-intensive applications, challenges, techniques ...

Compilation Techniques for Multimedia Processors - Complang

Compilation Techniques for Block-Cyclic Distributions ...

Compilation Techniques for Energy Reduction in Horizontally ...

Compilation Techniques for Fair Execution of

High Performance Fortran Compilation Techniques for ... - CiteSeerX

VLIW compilation techniques - Semantic Scholar

Cameron: High Level Language Compilation for Reconfigurable ...

parallel architectures and compilation techniques

Using Aspects and Compilation Techniques to ...

ODMG language extensions for generalized ... - Semantic Scholar

Composable Language Extensions for ... - Semantic Scholar

PDE-Oriented Language Compilation and Optimization ... - CiteSeerX

Intensive Japanese Language - Edublogs

ERASMUS INTENSIVE LANGUAGE COURSES

ERASMUS INTENSIVE LANGUAGE COURSES

types and compilation for a hybrid synchronous language - Inria

DATA-INTENSIVE COMPUTING FOR ...

Understanding Scripting Language Extensions