Concurrent Composition Using Loci - FTP Directory Listing

3 downloads 396 Views 407KB Size Report
A new software framework called Loci provides an automatic system for generating the communication and .... rule, and the pointwise keyword indicates that the computation .... position so that software architects don't have to worry about such ...
Par allel Programming

Concurrent Composition Using Loci

A new software framework called Loci provides an automatic system for generating the communication and synchronization required to correctly compose software modules. This results in a much cleaner software architecture and clear separation between numerical algorithms and distributed dataset management.

D

esigners of parallel high-­performance software face unique software engineering challenges. Obviously, performance is crucial and thus can close some avenues available in nonperformance-critical designs—for example, the dynamic binding that occurs in inner loops can significantly degrade performance, thus designers must use some objectoriented features with caution. However, the most significant challenges involve managing the parallel coordination of tasks and data. For data management, software architecture design usually revolves around determining which module is responsible for communicating key information: if a module accesses data, can we assume that the data it accesses on its processor is consistent with the state of other processors? If we assume another module has made the state consistent, then we’ve introduced an implicit coupling between modules. On the other hand, if the modules internally communicate needed data to guarantee consistency, frequent communication will degrade performance and could introduce the possibility of deadlock. Thus, performance dictates that distributed data synchronization should occur outside of a module, where we can aggregate the needs of many modules, amortize costs, and analyze implications. Unfortunately, concurrency management seems to come with a price: increased coupling and a loss of system cohesion.

Computing in Science & Engineering

These challenges represent the curse of concurrency, which appears in the design of any highly concurrent software system. In threaded applications, the curse arises in the design and placement of interlocking sections in the software architecture; in message-passing applications, it shifts to communication aggregation and management within the application. The design of these sections is critical not only to an application’s performance but also its correctness. In fact, concurrent systems are notoriously difficult to test precisely because failure can involve subtle timing relationships between software components. Thus, all concurrent applications, including those found in high-performance computing, have unique software engineering needs. We designed a parallel software development framework, called Loci, to address such concerns by utilizing a programming paradigm that enables automated coordination of concurrent software components. Such automation improves parallel software architecture by providing a generic framework for the composition of numerical software components, effec1521-9615/09/$25.00 © 2009 IEEE Copublished by the IEEE CS and the AIP

Yang Zhang and Edward Luke Mississippi State University

This article has been peer-reviewed.

27

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

tively making parallel coordination transparent to software design goals.

Available Approaches

In a typical threaded computation model, the determination that a program can deadlock or has races isn’t computable—instead, we can only detect at runtime when a failure might occur. Thus, any software architecture that avoids deadlock and races by design places judicious limits on the model. Most experienced designers either use a model that avoids these issues completely (such as Bulk Synchronous Parallel1 or Software Transactional Memory2) or develop a toolkit with “rules of thumb” to avoid dangerous concurrent constructs. The manual development approach, which uses a sequential language (often C or Fortran) with a parallel primitives library such as the messagepassing interface, is still common practice in parallel numerical application development. Developers must not only be familiar with parallel algorithm design but also the quirks needed to tune a particular architecture’s performance. OpenMP provides an alternative approach that uses compiler directives to alleviate some of the problems in concurrency management, but its shared-memory platform implementation precludes it from being used in more common and cost-effective distributed-memory platforms. Some specialized domains in computational science have toolkits that aim to hide complexity. Notable examples are the Overture framework3 for overset grids and Cactus4 for Cartesian adaptive mesh refinement solvers, but the major drawback with such approaches is that they lack a clear and concise programming model. Instead, they make specific assumptions about solution approaches and often impose detailed component interaction requirements, thus limiting the applicability area. Researchers have long studied the functional programming model for parallel computation due to its clean semantics, and they’ve used languages such as SISAL5 and NESL6 in scientific computing with some success. However, these languages’ cost models are best suited for the shared-memory architecture; implementations on the distributedmemory platform aren’t yet efficient or scalable enough. Modern efforts in parallel functional languages such as Haskell7 have demonstrated some success on distributed-memory architectures, but unlike SISAL and NESL, they don’t appear to be specifically designed for numerical simulation, and their applicability to high-performance computing isn’t certain. 28

Our software framework—Loci—provides an automatic system for generating the communication and synchronization required to correctly compose software modules. Concurrency management within this system is limited in that it can map nicely onto a deadlock-free model, thus all programs for which we automatically generate communication and synchronization constructs will be free of deadlocks and races. The basic approach (outlined in a recent paper8) employs two key technologies: a relational abstraction of data used in computations that helps automatically generate communication schedules and a simple rule-based computation model that facilitates flexible composition of software components. In this framework, users provide computational kernels in the form of rules that, through the use of relational operators, can completely describe their input and output communication needs. Users design programs by providing an initial dataset and a query for sets of expected outcomes. The rule-based framework then provides an assembly of the software components along with the required communication and synchronization operations that will correctly obtain the requested outcome. Because the software components only document their own needs, such a software architecture removes the coupling between components implied by data consistency requirements, as the framework automatically generates the required dependency information and component interaction. Although the programming model that Loci uses isn’t general, it’s concise, makes no assumption about particular numerical strategies, and is therefore well-suited to a wide range of numerical applications. Although the Loci framework is a novel approach, some of the technology employed in it is similar to existing approaches described in the literature. Researchers have used the application of relational algebra abstractions to numerical software optimization to describe optimizations for generic sparse matrix algorithms,9 and recent implementations of the Sierra framework10 use a set-oriented approach to manage application entities such as processors, mechanics, mesh objects, and fields. Such approaches are similar to the relational abstractions that Loci uses to generate computation and communication schedules, but the Loci framework provides a greater level of automation and abstraction.

Rule-Based Programming

A relational data model provides a good abstraction for describing the distributed data access within a Computing in Science & Engineering

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

scientific program without explicit parallel notations. Imagine a problem of two-dimensional tiles as shown in Figure 1. Each tile has an associated temperature, and we want to calculate the average temperature of all the neighbors of each tile. A tile’s neighbors are its surrounding eight tiles, so tiles 0 to 3 and 5 to 8 are tile 4’s neighbors. In a conventional manual coding approach, we would select a physical data structure (an array, for example) and start coding and reasoning from that. The problem is that in a parallel setting, these tiles might be distributed among different processors, thus the reasoning process inevitably involves parallel data movement and communication, which can become rather complicated. Table 1 shows an alternative view. The tiles’ temperature and data structures now become abstract tables with two columns. Consequently, the designer can use powerful relational data queries to succinctly describe intended data access patterns. This is a more abstracted notation than programming with arrays (or other physical storage formats) because it lets developers reason naturally and separates their view from the actual physical implementation. If we distribute the initial tables properly, Loci can automatically bring in all data from a query operation without needing specific parallel instructions from developers. A simple rule system is another cornerstone of the Loci framework. A good analogy would be to compare its rule system with the commonly known Makefile rule definition for the make compilation utility tool. Makefile defines how to compile a source file by first specifying dependence information and then an action to perform the actual compilation. Similarly, a Loci rule also consists of a rule head and tail with dependence information, as well as a user-supplied C++ routine to perform the actual transformation from the rule antecedents. Also similar to the make utility, users can supply a target (similar to, say, make target) to the Loci system; Loci then assembles a complete execution according to the dependency information provided. We implemented the Loci framework in C++ and provide a mechanism for storing any C++ class in the relational data structures (that is, Loci provides C++ types and APIs for the relational table construction and manipulation, and rule creation). Thus, Loci expands the programming paradigms available to the object-oriented design methodology. Because the C++ API for Loci can be verbose, we provide a preprocessor that can convert a compact representation of Loci rules into May/June 2009 

0

1

2

3

4

5

6

7

8

Figure 1. A simple tile problem. We want to calculate the average temperature of all the neighbors of each tile; in this example, tile 4’s neighbors are tiles 0 to 3 and 5 to 8.

Table 1. A relational model as a data access descriptor. Temperature (T) id

Kelvin

0

100

1

105

2

106

3

99

4

110





Neighbors (N) id

idn





4

0

4

1

4

2

4

3





Select N.id, T.Kelvin From N, T, Where N.idn = T.id id

Kelvin





4

100

4

105

4

106

4

99





29

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

$rule pointwise(averageTempTemperature) { int sz = $Neighbors.size() ; double sum = 0 ; for(int i=0;i$Temperature ; $averageTemp = sum/sz ; }

Figure 2. Loci average temperature code. This example computes an average value of tiles regardless of their distribution.

C++ code. The input to the preprocessor is a C++ file augmented with special directives— for example, we would implement the average temperature problem described here as the code fragment in Figure 2. Here, the $ symbol activates the preprocessor, the $rule directive indicates that we’re about to define a Loci rule, and the pointwise keyword indicates that the computation will occur for all table entries point by point. In the parenthesis, we describe the computation’s inputs and outputs, with all outputs preceding the “Temperature describes that we’ll combine the tables Neighbors and Temperature such that the last column of Neighbors is equivalent to the first column of Temperature. The $ sign in the code before a variable name indicates access to a Loci-provided variable. Note that the system automatically builds loops for tables consisting of many rows. Sometimes when many rows of data in the table correspond to the same item, Loci provides an internal representation that would collapse those rows into a single row to facilitate rule coding. For example, the use of $Neighbors[i]->$Temperature in Figure 2 refers to the access of all rows associated with a particular entity.

Automated Concurrency Management

An application is usually composed of many simpler computations. The tile averaging problem in Figure 1 is very similar to problems that occur in numerical computations, just as most numerical integration procedures become some form of weighted sum. However, composition is also key to developing flexible software because we typically build one operator from a composition of others— for example, we could view the Laplacian operator used to describe diffusion processes as the composition of a divergence operator and a gradient operator. This type of operation chaining is typical in 30

numerical software, but it introduces complexities in the management of concurrent data and tasks. Consider the implications of computing a double averaged temperature (an average of average temperatures). If we used the averaged temperatures computed as described in the previous section to compute the double averaged temperature, we would need extra communication to make sure the second average is computed correctly for all tiles after we distribute them to the other processors. Note that the tiled temperature problem is an extremely simplified example—in practice, the tiles would be polytopes of mixed types, including triangles, quadrilaterals, tetrahedrons, hexahedrons, pyramids, prisms, and arbitrary polytopes connected with irregular neighbor relationships. Moreover, these tiles would be divided into different regions in which different physical laws apply—for example, in some tiles, the equations of solid mechanics would apply, but in some regions, the equations of fluid mechanics would apply, and interface regions that act as boundary conditions would couple different sets of equations. Typically, designers manage this by generating index subsets that describe where different equations will apply. In the concurrent setting, the situation is more complex because tiles are usually communicated to “clone” versions of themselves in remote processors, thus for every index set, we must determine those that the processor owns and those that a processor will communicate to other processors. Figure 3 shows the dependencies involved in computing the correct index sets and communication schedule for a composition of four rules. In the figure, rule 2 uses the results of rule 1 and rule 4 as its inputs, and rule 3 consumes the output from rule 2 to produce a final result. Figure 3 shows all the components and their orders that the designer must develop and arrange to achieve the intended computations defined in rules 1 through 4. First, the program must determine the index set for local computations (illustrated with a diamond-shaped box in Figure 3). Once the program computes a local index set, it must generate a communication schedule to synchronize the computed results (as represented with a hexagonshaped box in the figure). Only after it generates these schedules are we ready to perform the composed computation that involves an interleaving of computations (ovals) and communication/synchronization (rectangular boxes). In a traditional parallel application, designers create the index set and communication schedules with knowledge of the expected communication Computing in Science & Engineering

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

Data requests rule1

Data requests rule2

Index set rule2

Index set rule1

Index set rule4

Rule1

Rule4

Data requests rule4

Synchronize results rule1, rule4 Rule2 Index set rule3

Synchronize results rule2

Data requests rule3 Rule index set calculator

Rule3

Data communication pattern inferencer Rule kernel computation

Synchronize results rule3

Rule data synchronizer

Figure 3. Composition strategy dependencies. All dependency and components involved in a simple parallel program must be composed first.

and computational needs—for example, it might be possible to define a few key index sets that many computations can share. Communication steps can overcommunicate data to support many different compositions, but inherent in such designs are assumptions about the composed components, and any change in compositional structure can invalidate this analysis. The Loci framework automatically tunes both the index set and communication scheduling exactly to the rule composition so that software architects don’t have to worry about such issues. In this regard, Loci plays a role similar to garbage collection in sequential programming: automation of this process improves software architecture design by eliminating unessential coupling between the components that create data and those that consume it. In a traditional parallel application, the architecture would provide the synchronization and index sets. To make the reasoning logic in designing the parallel program tenable, the designer would need to make some assumptions about the composed operations, such as how many compositions would be supported. If later developments invalidated any of these assumptions, the infrastructure for computing index sets and communication steps would need to be redesigned. Thus, Loci makes it May/June 2009 

much simpler to compose computations in parallel high-performance software.

Modules and Application Development

A rule-based system augmented with relational data descriptors provides a basic mechanism for automatically managing concurrency within an application. However, rules in this system are simple primitives from which the application is developed. In practice, designers make functional modules that result from the composition of many rules to achieve a larger objective. In Loci, applications are typically built by composing modules, and each module consists of a library of rules. Modules are collections of rules stored in a dynamically loaded executable. When the Loci framework loads these modules, the rules become part of a bigger collection of rules used to satisfy user queries. These modules also form encapsulations of common functionality and are enabled by three additional features: namespaces, rule priorities, and parametric rules. The namespace functionality lets a module decide which variables to share with other modules and which variables should be used internally. This lets designers build a clean, 31

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

$rule pointwise(average(X)X) { int sz = $Neighbors.size() ; double sum = 0 ; for(int i=0;i$X ; $average(X) = sum/sz ; }

Figure 4. Parametric tile averaging code example. This Loci code illustrates the use of parameter substitution to achieve the definition of generic computation.

functional, user interface while also minimizing problems with internal naming conflicts. The priority rule functionality lets the rules from one module override the operations defined by another set of rules. This facility not only lets modules add functionality but also replace existing functionality with more advanced or complete versions. The parametric rule functionality allows for the development of a family of related computations and employs a simple functional programming model using parameter substitution. For example, the computation of averageTemp described earlier could only average the variable Temperature, but when parameterized as shown in Figure 4, the capability extends to apply to any variable. In this parametric rule, a user-supplied variable replaces the substitution variable, X, such that given this rule, another rule could simply input average(Temperature) to obtain the tile averaged temperature. Note that recursions naturally apply such that designers could request a computation of average(average(Temperature)) to compose the double average operations as described in the previous section. In practice, we’ve found that the module facility has been extremely effective for managing simulation software. The module interface provides some obvious benefits, such as encapsulation and loose coupling, but we’ve also found that it provides a useful mechanism to deal with some of the more bureaucratic aspects that can arise in numerical simulation. For example, we can encapsulate simulation capabilities that have International Traffic in Arms Regulations (ITAR) sensitivities in independent modules that contractors outside the university environment develop and maintain. The module facilities along with the rule-based programming model let modules change almost any behavior of an existing application, giving the greatest flexibility to modules that completely change the simulation software’s character. 32

Notational Similarity

Another strategy that we’ve employed is the use of programming notation that resembles the mathematical notation found in numerical algorithm literature—for example, the use of the functional programming model for parametric rules mimics the way composition is frequently described in mathematical literature. A more obvious alignment with mathematical notation comes in the way in which we describe iterations. Researchers frequently use superscripts in numerical literature to describe operations on sequences of steps—for example, they usually describe the forward Euler time integration method as un+1 = un + ∆tRn, where the superscript identifies that the next iteration of variable u is defined from the present iterations of variables u and R. In Loci, we use a similar notation for iteration; Figure 5 shows the partial implementation of a solver for the time-dependent heat equation. Above each rule in the figure is a comment that contains the mathematical notation for that part of the algorithm. The two notations are very similar—the superscript is enclosed in braces following the variable name. Also note that the definition of the residual variable, R, uses the parametric rule to define the composition of the diffusion and gradient operations. Perhaps more subtly, it also uses reasoning about iteration dependencies: because the backward Euler time step requests R from time step n, and u depends on time step n, the evaluation of the last rule in Figure 5 will be computed at each iteration with u{n} as an input to the div(grad(u)) operator. This mimics how numerical algorithms are typically described, where we assume relationships between variables without superscripts are invariant to iteration. We employ the same notation in the Loci framework. The result of this notational similarity is a more straightforward mapping from the theoretical algorithm description to practical implementation.

Solving Practical Problems

We’ve already put the Loci framework into practice developing large, sophisticated simulation codes. The most mature of these is the Loci/ CHEM solver, which models gases, liquids, droplets, radiation, heat conduction, combustion, and structural deformations. The NASA Marshall Space Flight Center (MSFC) uses the solver extensively in a production setting to simulate and design rocket systems; consequently, the solver (CHEM) has undergone extensive verification and validation. In production work, NASA engineers routinely execute the code on many hundreds of Computing in Science & Engineering

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 09,2010 at 23:23:40 UTC from IEEE Xplore. Restrictions apply.

// Numerically solve the partial differential equation: //

∂u = v ∇⋅ ∇u ∂t

// Using Forward Euler time integration // Set Initial Value: un = 0 = uinitial $rule pointwise(u{n=0}