FINDING AND EXPLOITING PARALLELISM IN A ... - CiteSeerX

FINDING AND EXPLOITING PARALLELISM IN A PRODUCTION COMBUSTION SIMULATION PROGRAM

BY GREGG MACLEAN SKINNER B.A., Goshen College, 1990

THESIS Submitted in partial ful llment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1994

Urbana, Illinois

ABSTRACT

In pursuit of a systematic method for parallelizing large production FORTRAN codes, a parallel version of a combustion simulation was developed. The development was aided by an examination of the problem being solved, its mathematical formulation, and the computation methods employed. The ease with which the simulation was modi ed for parallel execution lends support to the hypothesis that the modular nature of production codes has important implications for parallelization. The success of the various analysis techniques and transformations lends insight to the design of automatic parallelization tools.

iii

ACKNOWLEDGMENTS

I am indebted to a number of people for their assistance in producing this thesis. Rudi Eigenmann was instrumental in providing insight into the current state of restructuring compiler technology and the need for interprocedural analysis techniques. My analysis of the Premix bene ted from Paul Petersen's powerful tools for symbolic and runtime dependence analyses. Bill Blume sparked my interest in array section analysis with a prototype of his array section analysis tool. For an understanding of the chemical problem being solved by the Premixed Flame Code, I am indebted to Joe Grcar at Sandia National Laboratories and David Schneider. My sincere thanks go to David Schneider for hatching the ideas contained herein and pursuing them with patience and diligence.

iv

TABLE OF CONTENTS

CHAPTER

PAGE

1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1 1 2 2 6 8 11

2 EXPERIENCE AND RESULTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

14 16 19 22 28 29 37 43 48 48 50

3 IMPLICATIONS AND SUMMARY : : : : : : : : : : : : : : : : : : : : : : : : : : : :

52 52 52 53 53 53 54 54 58 59 62

1.1 1.2 1.3 1.4

Motivation and goals : : : : : : : : : : : : : : : : : Outer loop parallelism : : : : : : : : : : : : : : : : Production versus research codes : : : : : : : : : : Production code modularity : : : : : : : : : : : : : 1.4.1 Implications for parallelization : : : : : : : 1.5 Premix: A production combustion chemistry code

2.1 Overview of the original code : : : : : : : : : : : : 2.1.1 Mathematical model : : : : : : : : : : : : : 2.1.2 Computational methods : : : : : : : : : : : 2.1.3 Parallelism inherent to the problem : : : : 2.1.4 Speci c implementation : : : : : : : : : : : 2.1.5 Parallelism inherent to the implementation 2.2 Optimization : : : : : : : : : : : : : : : : : : : : : 2.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : 2.3.1 Metric : : : : : : : : : : : : : : : : : : : : : 2.3.2 Recap : : : : : : : : : : : : : : : : : : : : : 3.1 Premix as a production code : : : : : : 3.1.1 Generality : : : : : : : : : : : : : 3.1.2 Flexibility : : : : : : : : : : : : : 3.1.3 Robustness : : : : : : : : : : : : 3.1.4 Extensibility : : : : : : : : : : : 3.1.5 Portability : : : : : : : : : : : : 3.1.6 Implications : : : : : : : : : : : : 3.2 Production code parallelizing essentials : 3.3 Automatability : : : : : : : : : : : : : : 3.4 Conclusions : : : : : : : : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

v

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : : : : : : : : : : : : : : : :

REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : APPENDIX A GLOSSARY OF PROCEDURE NAMES : : : : : : : : : : : : : : : : : : : : : : : : B NITROGEN CHEMISTRY IN COMBUSTION : : : : : : : : : : : : : : : : : : : :

63

C SAMPLE PROGRAM DRIVER : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : D LIBRARY INITIALIZATION ROUTINES : : : : : : : : : : : : : : : : : : : : : : :

76

B.1 Chemical species : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : B.2 Chemical reactions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vi

67 70 71 72 78

CHAPTER 1

INTRODUCTION 1.1 Motivation and goals Developing a systematic method for parallelizing large production FORTRAN codes is an ambitious task. In this thesis we oer a preliminary eort. We have optimized a large production code for execution on a shared-memory multiprocessor, and attempt to generalize our experience to suggest a number of issues that must be addressed in developing a systematic method for identifying and exploiting outer loop parallelism in large production codes. We explore the thesis that the constraints which must be applied in the design and implementation of large production codes written in standard FORTRAN

77

(ANSI X3.9{1978) imply that certain structures must exist in the code, and that by identifying and exploiting these structures we can greatly reduce and simplify the analysis required to produce a parallel version with large-scale granularity. This chapter introduces several important distinctions between research and production codes. Chapter 2 contains a description of our experience optimizing the code. Application knowledge and program documentation played an important role in discovering and exploiting potential parallelism. Chapter 3 explores the implications our experience may have on directions for the evolution of compiler technology.

1

1.2 Outer loop parallelism We distinguish between inner loop parallelism, which is parallelism found in the loops in the leaves of the calling tree, and outer loop parallelism, which typically appears deeper within the branches of the calling tree. Outer loops are de ned as those containing signi cant internal structure, such as subroutine calls and serial inner loops, which tend to make parallelization dicult. An outer loop is not necessarily an outermost loop. On machines that can exploit multiple levels of concurrency, inner and outer loop parallelism are complementary. Any parallelism exploited in outer loops simply multiplies the performance gains due to parallelism in the inner loops. Inner loop parallelism (including vectorization) can usually be found automatically by today's compilers, using intraprocedural analysis with inline subroutine expansion. Many high performance libraries, such as Linpack [1], exploit inner loop parallelism. Some production codes can bene t directly from inner loop parallelism by replacing innermost serial loops with their vector or parallel counterparts. However, in many production codes inner loop parallelism is too ne-grained to yield appreciable speedup. The signi cant parallelism in the code we examine in this thesis is found in outer loops, as we will see in Chapter 2. For many production codes, optimization is only successful if parallelism can be discovered and exploited in outer loops. This typically corresponds to executing a sequence of library calls independently. Thus, the libraries must be designed or modi ed to allow concurrent invocation.

1.3 Production versus research codes Schneider [2] notes several characteristic properties of production codes which favor better overall organization. He suggests meaningful progress toward developing automated parallelization methodologies must be based on recognizing common structural motifs in these codes. He puts forth several provocative hypotheses with important implications for parallelization. A major purpose of this thesis is to discuss to what extent his hypotheses hold for a popular production combustion chemistry code.

2

We begin by clarifying the nature of a production code. Schneider suggests two major classes of scienti c and engineering codes: tactical and strategic. The two classes dier strongly in the nature of the problem solved, leading to fundamental dierences in the strategies for their implementation. Programs developed to solve strategic problems, often of considerable economic importance, we refer to as production codes. They are typically developed and maintained by a team of programmers over several years. As a result, much emphasis is placed on software engineering principles and good programming practice. Production codes are distinguished from research codes, which are programs written to solve speci c tactical problems not addressed by existing codes. Research codes typically solve a few speci c problems and are subjected to few, if any, maintenance cycles. A research code is like a piece of laboratory equipment custom built to perform a speci c task. It is created by a handful of contributors whose primary interest is in interpreting the results of the computation. Emphasis during development is on getting correct results fast, which is not necessarily the same as computing them eciently. Code functionality which does contribute directly to that goal is super uous. Thus, little attention is devoted to issues of performance, elegance, maintenance, or generality. Because these tactical codes are targeted to a small number of speci c problems, their functionality tends to be limited. They typically accept only a limited class of input data sets and are only executed a relatively small number of times. Research codes tend to have short lifetimes, as their usefulness is tied to a small number of projects. Often development time is reduced by taking advantage of programming shortcuts. These tricks of the trade make research codes dicult to understand, maintain, and extend. The pejorative term dusty deck may be well-suited for many members of this class of programs. If a research code is a piece of customized laboratory equipment, then a production code is an entire apparatus. A single production code is used to solve thousands of dierent problems. Production codes accept a wide class of input data sets and are executed numerous times. With this increase in general functionality comes a signi cant increase in organization and size. Production codes are typically comprised of semi-autonomous libraries and modules, developed and tested independently. 3

They are quite large, consisting of tens or hundreds of thousands of lines. The goal of production code developers, who may have speci c training in software engineering or extensive experience with large software projects, is to produce a stable, reliable, general-purpose program uniformly useful for all instances of a general class of problems. Typically a number of alternative algorithms are included for each phase of the computation. The input data may include instructions from a simple metalanguage which determines what portions of the code to execute [3,4]. The validity of the input data is usually veri ed, and mechanisms are included to handle errors and failures. Large production codes are designed with portability, maintenance, and extensibility in mind. Due to the increased diculty and expense of developing these general purpose codes, they are costlier and fewer in number. Many important production codes developed at universities and national laboratories have been commercialized by independent software vendors [4{6]. Schneider suggests that, for several reasons, research codes have often been mistaken by computer scientists as largely representative of real applications. Because research codes are produced mainly in university environments, they are more freely available to the academic community. A research code's size is appealing to computer science researchers, as it is typically larger than a single algorithm or kernel, but not so big as to appear to unwieldy or require correspondence with the code's author. Production codes are often assumed to be simply larger versions of research codes, with the attendant multiplication of unwieldiness. This mistaken assumption may, in part, be responsible for a delay in the development of general parallelization strategies for large production codes. The conventional wisdom is that large production FORTRAN codes are the dustiest of the dusty deck programs, having little structure and organization. Schneider believes there is good reason to reject this notion. The complexity of a large code suggests, even demands, a strong underlying organization. Complex codes could not be manageable, or even execute correctly, if they were not comprised of simpler components. To cope with the task of creating a large production code, developers must inevitably turn to simplifying methodologies, such as structured programming [7,8].

4

Present compiler research has been oriented toward discovering loop-level parallelism (including some outer loops) in tactical research codes. This is as it should be, for the number of research codes is large, and their shorter lengths and lifetimes do not diminish their importance. Clearly most FORTRAN compiler invocations are for this large class of codes. However, Schneider [2] suggests production codes, though considerably fewer in number, consume a majority of high performance CPU cycles. Thus, they should be of greater interest to machine designers and performance researchers. We emphasize the importance of focusing attention on general strategies to exploit parallelism in this class of codes. A production code is de ned as a program which adheres principally to the following objectives [2]:

Generality { A single code, or small set of codes implementing logically distinct phases of the computation, is capable of handling all aspects of the solution strategy, without modi cation.

Flexibility { The program provides alternative solution strategies and/or algorithms for individual computational phases. Alternatives are compatible in that they use the same input and output data structures. The program is able to handle problems of any reasonable size; no unnecessary limitations are placed on the shape or size of data structures.

Robustness { For any valid data set, the program executes correctly and eciently. Input data is checked for validity, and error-checking routines are incorporated throughout the code to handle critical run-time errors such as running over array bounds. Errors in the input data are handled gracefully. Checkpointing mechanisms allow partial recovery from machine failures.

Extensibility { New functionality can be added without dramatic changes to the existing program, especially the data structures or the procedure parameter lists. Backward compatibility is an important concern.

Portability { Versions of a production code are generally produced for several dierent machines. To simplify this task, machine and operating system speci c code is localized. Nonstandard language extensions are minimized, or not used at all. 5

While these goals are restrictive, they represent the kind of rigor to which production code developers must subject themselves. We can expect to nd strategic codes which adhere to these objectives consuming much of the CPU time on today's supercomputers.

1.4 Production code modularity Because a large production code must be developed in pieces, a number of conceptual walls can be found within, corresponding to the interfaces that connect distinct tasks. Schneider [2] notes that progress toward developing a parallelization methodology must be based on recognizing these common structural motifs. Modularity is perhaps the most important of these. Consider as an example a nite elements structural mechanics code (Figure 1.1) used to determine an object's natural vibrational modes and frequencies. There are several computationally distinct phases: a) de ne the geometry of the object; b) impose a mesh; c) compute the mass and stiness matrices; d) compute the solution to a generalized eigenvalue problem; e) process the results. Each phase has unique complexity and may employ diering algorithms and data structures. By recognizing the conceptually independent tasks, one can design, implement, and test each phase as a separate program or module. Because it is often the case that no one method is best for every problem, a program which is exible and general will include several dierent methods for advancing each computational phase. The user selects methods and algorithms by setting parameters in the input data, or, for more sophisticated codes, via a graphical user interface. A \driver" in each individual module will setup and initialize data structures, then invoke the routines corresponding to the particular methods selected for that phase of the computation. The nite elements structural mechanics example is illustrative. The proper choice of element type (e.g. brick, triangular prism, tetrahedron) and the location and number of grid points depend on the geometry of the object and the desired accuracy of the result. A general purpose code should handle objects of arbitrary geometry and include many dierent types of meshes and elements. Once the elements and node placements have been selected, 6

' $

De ne a geometry

Impose a mesh #

Compute mass and stiness matrices " ! # Solve generalized eigenvalue problem " ! # Process the results "(Visualize) !

Structural Mechanics Code (Finite Element Method)

& %

Figure 1.1: An example of modularity in a production code from structural mechanics. Each computational phase on the right is conceptually independent. If the code is general and exible, there are several dierent methods available for each phase and alternative algorithms for each method. However, each shares a common data structure.

7

numerical integrations are performed to compute the elements of the mass and stiness matrices. Thus, at least one speci c numerical integration routine must be included for each type of element. Often more than one numerical integration algorithm is included for each element type to allow the user to balance speed and accuracy. Regardless the choices of element type, node placement and integration algorithm, a generalized eigenvalue problem results whose solutions de ne the normal modes and natural frequencies of vibrations. The choices made in earlier phases aect the sparsity and size of the system, and thus in uence the characteristics of the preferred solution algorithm. A general program will provide a number of dierent solution algorithms for the generalized eigenvalue problem. Breaking the development of a computer simulation into independent phases is consistent with the reductionist world view which permeates modern science. Modular construction is strongly associated with structured, top-down software engineering techniques which lend themselves naturally to functional procedure decomposition. Importantly, empirical observations indicate the assembly of large codes by loosely coupling smaller, independent subprograms tends to decrease overall development time [9]. Thus, we can expect numerous programs to meet the general criteria of production codes set out in Section 1.3.

1.4.1 Implications for parallelization For a shared memory system, parallelism is expressed by the parallel loop , whose iterations are executed concurrently by dierent processors [10]. A FORTRAN loop is inhibited from parallel execution by loop-carried dependences . Dependences have two avors: data and control. A data dependence is a relationship between two statements that refer to the same memory location. A control dependence results when an ordering on a set of statements is imposed by an explicit control statement such as a conditional. Ignoring special cases such as input and output, any arrangement of statements which respect dependences will maintain a program's semantics. Thus, it suces to preserve the dependence relations when applying transformations. A loop-carried dependence is a data or ow dependence which crosses an iteration boundary. Dependence analysis is the process of identifying data and ow dependences in

8

the code. Automatic dependence analysis can be aided by using compiler directives, which provide the analyzer with additional dependence information known to the programmer. The top-down modular construction found in a production code imposes procedure boundaries which must be crossed when analyzing program loops which contain procedure calls. Either the modules must be expanded inline or interprocedural analysis must be performed. A summary of interprocedural analysis is generally less precise than the analysis of inlined code. The drawback of inline expansion is the potentially exponential growth in code size [11], making it generally infeasible for all but the smallest modules. It does have value when used selectively, as when it is incorporated into an interactive compiler. Interprocedural analysis is eective, but dicult to perform. A manual interprocedural analysis of a large production code is a time-consuming and error-prone task. Eorts have been undertaken to automate interprocedural analysis [12{14]. Even a rudimentary success at automating interprocedural analysis could have a signi cant impact on the parallel optimization of production codes. To motivate this point Schneider [2] introduces several hypotheses which have important implications to developing a general parallelization methodology. First, Each computational phase is conceptually independent and implemented as an integrated subprogram which communicates with other subprograms through well de ned interfaces.

For example, all methods of implementing a particular computational phase expect to nd their input data in the same locations and place their output data in the same locations. However, dierent methods may have dierent requirements for temporary workspace and so on. To achieve the desired level of exibility, the interfaces between procedures will typically consist of array sections de ned by parameters read from the input dataset. Thus, they will require a symbolic dependence analysis, resolved at runtime to determine if statically detected potential dependencies do, in fact, exist. A symbolic dependence analysis, in which the nal decision as to whether a dependence exists is deferred to runtime, in conjunction with programmer assertions as to the valid limits of input parameters are more general in that it may be possible to determine if dependences can exist for any valid dataset. Second, 9

S1: S2:

do j = 1, m do i = 1, n a(i,j) = c(i,j-1) end do b(j) = a(1,j) end do

Figure 1.2: A potentially parallel outer loop. Within a single invocation of a subprogram the various methods are typically selected in a mutually exclusive manner.

The particular path through the conditional statements enforcing this exclusivity is usually dependent on parameters read from the input dataset. The implementation of this exclusivity may occur at dierent levels. The coarsest-grain implementation is at the level of an if-then-else construct governing the execution of a large segment of the calling tree. In other cases where dierent methods require a substantial amount of common code, or if a data parallel programming model is used, the exclusivity may be implemented within loops at the leaves of the tree. The rst two hypotheses lead to a third. Because dierent methods within a procedure are invoked in a mutually exclusive manner, control dependences are important in eliminating large classes of potential data dependences between routines called from that procedure.

A general code is really many codes, with a particular execution path chosen by input parameters. Input dataset dependent control statements are often used to select which modules are invoked, and in which order. This class of control dependences is extremely important because they determine whether arrays or array sections are initialized. If a location is not initialized by a module, the analyzer cannot locally determine whether the data that resides there is input. Consider the loop in Figure 1.2. If input variable n

is zero then the outer loop is parallel; otherwise, statement S1 carries a data dependence across the loop

boundary. Input dataset dependent control dependences are expected to be dicult to analyze statically 10

because the branching behavior is de ned by the problem input. However, they appear to be crucial in determining which potential data dependences actually exist at execution time and are signi cant to simplifying interprocedural analysis. Developing identi cation and analysis techniques for these control dependence dominated data dependences in order to make meaningful progress toward useful automatic interprocedural analysis of production codes will probably be necessary. Sets of procedures which one expects to be concurrently callable are not rare. Application knowledge can assist us in discovering independence among these sets of routines. For instance, library procedures generally come with sucient documentation to determine whether the physical problem being solved by them is independent, in principle. Consider as a simple example a library which performs numerical quadrature. Because each approximation to an integral does not depend on any previous approximation, we expect no dependences between successive calls to integration procedures.

1.5

Premix:

A production combustion chemistry code

Chemical kinetics computations have been of interest since long before the advent of the electronic computer. It was in the context of chemical kinetics that Curtiss and Hirschfelder [15] rst identi ed the problem of stiness in ordinary dierential equations in 1952. The combustion eld is rich with examples of problems requiring an understanding of elementary chemical kinetic processes, including control of combustion-generated pollutants, knocking in internal combustion engines, environmental impact of compounds emitted from combustion, and disposal of toxic waste [16]. A common element in all the work is a need to understand the chemical kinetics behavior of large chemical reaction systems and the associated convective and diusive transport of mass, momentum, and energy. These reactive

ow modeling problems are governed by conservation equations coupled with a hydrodynamic system driven by the energy released or absorbed by the chemical reactions. Combustion modeling problems are chemically interesting because the large energy release associated with burning gives rise to high temperatures and many exotic chemical species. The high temperatures 11

resulting from the transfer of chemical energy to heat lead to rapid expansion of the gases which in turn aect the convective ow. Stiness arises as a result of the diering time scales of the chemical kinetics and the hydrodynamics [17]. Chemical reactions occur on the order of picoseconds, while the convective ow occurs on the order of seconds. Stiness also results from combustion's large temperature gradients. To overcome these numerical diculties one must use time-implicit algorithms and adaptive gridding [18]. Combustion problems provide an unique opportunity to study the properties of chemical species which do not occur elsewhere. Other reactive ow modeling problems are similarly interesting and challenging, even though they do not involve combustion per se { for example stratospheric ozone depletion by chloro uorocarbons driven by high solar energy absorption. A group at Sandia has developed a number of software tools which facilitate simulation of reactive

ow. Three basic tools lie at the heart of their eort. The Chemkin library [19] and the Chemkin Thermodynamic Database [20] are used to analyze gas-phase chemical kinetics. The Transport [21]

library is used for evaluating gas-phase multicomponent transport properties. Surfkin [22] is a package for analyzing heterogeneous chemical kinetics at a solid-surface { gas-phase interface. These three combustion libraries undergo continual revision as part of an ongoing eort to provide the numerical combustion community with standardized software. This approach is successful because the governing equations for each reactive ow applications must share a number of features. A general discussion of this structured approach to simulating reactive ow is found in [16]. Several codes have been built by Sandia to exploit Chemkin, Transport, and Surfkin. All are general purpose codes for use in solving a class of combustion problems. A common computational description of the chemical reaction rates is used in all. Simulation of shock heating of a reactive gas mixture is accomplished using Shock [23], a general purpose code for predicting chemical kinetic behavior behind incident and re ected shocks. Spin [24] is used to model one-dimensional rotating-disk or stagnation- ow chemical vapor deposition reactors. Psr [25] predicts the steady-state temperature and species composition in a perfectly stirred reactor. Senkin [26], computes the time evolution of a 12

homogeneous reaction gas mixture in a closed system. Premix [3] is used to predict the steady state temperature and species concentrations in one-dimensional burner-stabilized and freely propagating premixed laminar ames. These programs can be used in concert to investigate the properties of a speci c class of combustion reactions. For example, Premix, Psr, and Senkin have been used to model nitrogen chemistry in combustion to study, among other things, the environmental impact of the nitrogen compounds emitted from combustion [27]. The group at Sandia has developed interests in applications where time to solution is critical, such as CAD environments for designing reactors or pseudo real-time applications such as monitoring reactors. Sequential execution is not fast enough for these applications. To assist in this eort, we undertook the task of modifying Premix, which uses the Chemkin and Transport libraries, to execute eciently on a shared memory multiprocessor with a modest number of CPU's, such as the Alliant FX series, the Convex C2 series, high end of the Sun SPARCstation, HP Apollo, IBM RS/6000, and Silicon Graphics Iris 4D series. Such machines are becoming more inexpensive and widely available. The original (sequential) version of the ame code was developed subject to the constraints similar to those outlined in Section 1.3. Only one version of the FORTRAN

77

source is distributed. This code must execute without signi cant

modi cation on all machines from a personal computer to a Cray. To insure the software can still be used by the large established user base, modi cations to the code are strictly backward compatible; that is, the subroutine interfaces are xed. Our main concern, then, will be extracting parallelism from the chemical and thermodynamic computations performed by the Chemkin and Transport libraries. In Chapter 2 we will see that this goal is further justi ed by the

execution pro le, which indicates the chemical kinetics and hydrodynamics computations consume the vast majority of the serial execution time.

13

CHAPTER 2

EXPERIENCE AND RESULTS Parallel optimization has a simple goal: Reduce the actual time a program requires to produce a solution to a given problem through ecient use of multiprocessing hardware. To accomplish this, a certain degree of independence must be present in the code so that dierent subproblems can be executed by independent processors. Often the desired independence, if it exists, is apparent from the mathematical description of the physical problem. However, this mathematical or conceptual independence may not be re ected in the code. A sequential FORTRAN code de nes a total ordering of memory operations. In many cases, only a partial ordering is needed at the conceptual level. The role of automatic parallelization technology is to detect the instances where a partial ordering is sucient. A discussion of this technology can be found in a number of references, including [10,28,29]. We must therefore make a reasonably sharp distinction between the mathematicalmodel of a problem, the computational method for its solution and the particular implementation of the method. We examine each of these separately in Section 2.1, and observe how well the original version of Premix expresses parallelism inherent to the mathematical model and computational method. That is, we determine to what extent orderings not required by the computational method occur in the speci c implementation. In Section 2.2 we consider the task of removing extraneous true dependences from the existing code. Section 2.3 details the resulting performance improvement. 14

' $

' & ' & ' & ' &

Chemkin ckinit

Transport

Driver

mcinit

Premix

Twopnt

Linpack

& %

$ % $ Libraries % $ % $ %

Figure 2.1: Premix consists of a driver and several libraries, some with built-in initialization routines.

15

Machine Organization Operating system Compiler Memory Cache I/O processors Computation processors

Alliant FX/80 shared-memory MIMD Xylem 2(3) based on Concentrix 3.0 FX/FORTRAN version 3.1.33 64mb 128kb 6 8

Table 2.1: Con guration of Alliant FX/80 [31]. The processors are register-based with chained functional units and memory port. The computation processors are connected by a concurrency bus, which keeps the overhead for concurrency small. Serial Concurrent Recursive Pro le

-Og -Ogc -recursive -pg

sequential optimization sequential and concurrent optimization local variables allocated on the stack produce an execution pro le with gprof

Table 2.2: FX/FORTRAN compiler ags and their meanings [31].

2.1 Overview of the original code Premix is a typical example of a library-oriented production

FORTRAN

code. It is a exible code

developed to analyze general problems involving combustion of premixed gases in a ame. Figure 2.1 shows the decomposition of Premix into a driver and four libraries: Chemkin [19], used to analyze gas-phase chemical kinetics; Transport [21], used to evaluate gas-phase multicomponent transport properties; Twopnt [30], a two point boundary value problem solver; and Linpack [1], a popular numerical linear algebra package. Each is a standardized, extensible library intended for use on a wide variety of platforms. The code, approximately thirty thousand lines of standard FORTRAN

, is highly

77

modular, robust, and portable. A sequential pro le for an execution of a nitrogen combustion model [27] appears in Figures 2.2 and 2.3. Our testing environment is an Alliant FX/80 with eight processors. Table 2.1 gives the system speci cations, and Table 2.2 lists some relevant compiler options. The program spends most of its execution time in routines from the Chemkin and Transport libraries. Approximately 65% of the sequential execution time is consumed performing chemical kinetics

16

ckinit point

ckindx

1 1

mcinit

0 1 0 1 0 1

others : : : twopnt

3069 1391

driver

9229 1

fun

premix

9228 1

jacob

fldriv

1 2802

:::

4855 21

newton

6 261

timstp

48 3561

others : : : 20 21 4834 2289

dcopy fun

dasum

9227 1

newton 3557

others : : :

:::

6 14448

idamax 614343 dgbco

dgbfa

603 21

dscal

0 452

655 21

dscal daxpy

5 14343 543 1231466

4 14343 daxpy 11 28707

ddot

dgbsl

551 1269

others : : :

daxpy

503:0 1733819

Figure 2.2: Execution pro le of sequential program for nitrogen combustion problem on Alliant FX/80 with serial optimization (compile command: fortran -Og -pg). Elapsed times (in seconds) are superscripted and the number of events is subscripted. Procedure times include time spent in called subprocedures. Total elapsed time is 9305 seconds. Appendix A contains a glossary of procedure names and functions. The \fun" subtree is reproduced in its entirety in Figure 2.3.

17

ckytx mtrnpr

1710 1412

25416

mcadif

1660 25416

mcedif

1654 25416

mcacon

28 3330

mceval

15 3330

mceval

878 864144

39 69920 ckmmwy 18 66240 ckrhoy 18 66240

ckytx mdifv

fun

7903 3680

182 3680

1 102858 ckrhoy 24 102858

area

ckwyp

5225 62560

ckhml

1887 62560

ckcpbs

212 62560

ckcpms

197 62560

temp

ckytcp ckrat

36:0 62560

4267 62560

ckcpms

cksmh

189 62560

197 62560

2 47358

Figure 2.3: Execution pro le for procedure fun on Alliant FX/80 with serial optimization (compile command: fortran -Og -pg). Elapsed times (in seconds) are superscripted and the number of events is subscripted. Procedure times include time spent in called subprocedures. Total elapsed time is 9305 seconds. Appendix A contains a glossary of procedure names and functions.

18

computations in routines ckytx, ckmmwy, ckwyp, ckrat, ckhml, ckcpbs, ckrhoy and ckcpms. Another 20% of the execution time is consumed by transport computations in routines mtrnpr, ckytx, mcadif, ,

mcedif mceval

and mcacon. (A glossary of procedure names and functions appears in Appendix A.)

Solving systems of linear equations consumes most of the remaining time. The Twopnt library simply controls the ow of the computations and thus contributes little to the execution time. We compiled the code \as is" using the FX/FORTRAN parallelizing compiler and executed it with varying numbers of processors on the Alliant FX/80. A performance curve for the resulting executable appears in Figure 2.4. With eight processors the code executed only 2.5 times faster. This is a poor improvement in performance on the Alliant FX/80. (Throughout this paper we do not report results for vector optimizations, as they invariably resulted in worse execution times.) A description of the mathematical model and the computational method assists in discovering which level of outer loop parallelism is best to obtain a granularity sucient to saturate available processors with reasonably sized parcels of independent work [32]. A mathematical description of the general problem appears in several references [3,33]. Section 2.1.1 contains a brief review of that description. In Section 2.1.2 we consider the numerical methods incorporated in Premix. In Section 2.1.3 we discuss the granularity of the parallelism inherent to the mathematical model and computational method. Section 2.1.4 details the speci c implementation of Premix and assesses and how well it expresses the parallelism inherent to the computational methods used to frame the physical problem.

2.1.1 Mathematical model Premix computes the steady state temperature and species concentrations in one-dimensional burner-

stabilized and freely propagating premixed laminar ames. The steady state is de ned by the following conservation equations [3]: M_ = uA = constant

19

(mass);

(2:1)

8

original version (fortran ideal

)

-Ogc

6 Sp ee du p 4

2

2

4 6 Number of processors

8

Figure 2.4: Ideal and measured parallel speedup for the original version of Premix using the Alliant FX/FORTRAN parallelizing compiler. Speedups are computed from elapsed times (T1 =Tp ).

20

K K X dT dT 1 AX d _ M dx ? c dx A dx + c1 (AZk )cpk dT + dx cp k=1 !_ k hk Wk = 0 p p k=1

d k M_ dY dx + dx (AZk ) ? A!_ k Wk = 0

(k = 1; : : :; K)

(energy);

(momentum);

(2:2) (2:3)

where K is the number chemical species. Thus there are K + 2 conservation equations governing the steady state. The symbols appearing in these equations are de ned in Table 2.3. x A M_ T p u = pW RT Zk = Yk Vk Vk Yk Wk W R cp cpk !_ k hk

spatial coordinate along ow direction cross-sectional area of the stream tube encompassing the ame mass ow rate (independent of x) temperature pressure (constant) velocity of the uid mixture (constant) mass density mass fraction times diusion velocity for the kth species diusion velocity of the kth species mass fraction of the kth species molecular weight of the kth species mean molecular weight of the mixture universal gas constant thermal conductivity of the mixture constant pressure heat capacity of the mixture constant pressure heat capacity of the kth species molar rate of production of the kth species per unit volume speci c enthalpy of the kth species

Table 2.3: Symbols appearing in the premixed ame equations (2.1) { (2.4). The high temperatures of a burning ame give rise to many exotic chemical species and complicated chemical reactions. Appendix B lists the species and reactions for the nitrogen chemistry model we used in our study [27]. This model has been used throughout this thesis for performance analysis. The chemical kinetics computations occur in evaluating the molar rates of species production !_ k , the speci c form of which is determined by the input dataset [3], !_ k =

K X i=1

21

k;iqi

(2:4)

where the k;i are user-speci ed integer stoichiometric coecients and qi the computed reaction rates. Determining the value of qi is computationally intensive. The equations for some of the more common reaction types appear in Figures 2.5, 2.6, and 2.7, with their symbols de ned in Table 2.4. The heat generated or absorbed by these reactions strongly aects the gas ow. Once the chemical kinetics have been computed from the input data, the hydrodynamic system governed by the conservation equations (2.1) { (2.3) can be solved in the presence of the chemical reactions.

2.1.2 Computational methods Equations (2.2) and (2.3) are discretized using nite dierence approximations. A grid is numbered

from 1 at the cold (input) boundary to J at the hot (output) boundary. The convective terms, M_ dT dx from the energy equation and M_ dYdxk from the momentum equation, are modeled by either rst order windward or central dierences as necessary. The other derivatives are approximated by rst and second order central dierences. The diusive term of the species conservation equation,

d dx (AZk )

(2.3), is

approximated in the same manner. Appropriate boundary conditions are implemented for both the cold and hot boundaries, yielding a two point boundary value problem. (See equations (10){(21) in [3] and discussion therein for a detailed description.) The nitrogen combustion problem is solved rst using windward dierences for the convective terms. This initial solution is used as a starting condition for a run using central dierences. The nite dierence approximations reduce the sti two point boundary value problem to a system of nonlinear algebraic equations. The boundary value problem is modeled rst on a coarse mesh. When necessary, new grid points are added (nonuniformly) in regions where the solution or its gradients change rapidly. Assuming a unique solution exists, this process ends when the solution has been resolved to a speci ed degree. The nonlinear system is solved using the modi ed Newton-Raphson algorithm. We seek a vector which satis es F(

) = 0:

22

(2:26)

qi = qfi ? qri ; qfi = k qri = k

K Y

(2:5)

[Xk ]k;i ;

(2:6)

[Xk ]k;i ;

(2:7)

0

k=1 K Y

00

k=1

k0 = A0 T exp(?E0 =RcT); k1 = A1 T exp(?E1 =RcT); P r k = k1 1 + P F ; r Pr = k0k[M] ; 1

(2:8) (2:9)

0

1

Lindemann form: Troe form:

SRI form:

(2:10) (2:11)

F = 1:0 "

(2:12) #?1

2 r +c log F = 1 + n ?logP log Fcent; d(logPr + c) c = ?0:4 ? 0:67 logFcent; n = 0:75 ? 1:27 logFcent; d = 0:14; Fcent = (1 ? a) exp(?T=T ) + a exp(?T=T ) + exp(?T =T)

F = a exp ?Tb + exp ?cT 1 X= 1 + log2 Pr

X

dT e;

(2:13) (2:14) (2:15) (2:16) (2:17) (2:18) (2:19)

Figure 2.5: Computing reaction rates qi for pressure-dependent fall-o reactions [3]. See Table 2.4 for de nitions of the symbols.

23

qi = qfi ? qri ; qfi = kfi qri = kri

K Y

(2:20)

[Xk ]k;i ;

(2:21)

[Xk ]k;i ;

(2:22)

0

k=1 K Y k=1

00

C B ? E i i i ; kfi = Ai exp R T + +

c

kri =

P 0 exp Kk=1 k;i SRk

kfi

T

T

1 3

Hk ? PKk=1 k;i RT 0

(2:23)

2 3

?

Patm RT

PK

k=1 k;i

(2:24)

Figure 2.6: Computing reaction rates qi for Landau-Teller reactions [3].

qi =

!

K X

(k;i)[Xk ] (qfi ? qri ) ;

k=1

Figure 2.7: Computing reaction rates qi for third body reactions [3]. K T [Xk ] [M] kfi kri 0 k;i 00 k;i 0 ? 00 k;i = k;i k;i Ai ; Bi ; Ci k;i i Ei Rc Kci ; Kpi Sk0 Hk0

number of species temperature molar concentration of the kth species concentration of the mixture forward rate constant of the ith reaction reverse rate constant of the ith reaction forward stoichiometric coecient reverse stoichiometric coecient stoichiometric coecient user-speci ed factors user-speci ed eciency coecient user-speci ed temperature exponent user-speci ed activation energy universal gas constant (in units compatible with Ei ) equilibrium constants entropy (standard state) enthalpy (standard state)

Table 2.4: Symbols appearing in the chemical kinetics equations (Figures 2.5{2.7). 24

(2:25)

We begin with only a (usually poor) approximation ^ to . It is clear that F (^) is not zero. The quantity y

= F (^)

(2:27)

is called the residual. _ is treated In order to obtain a block-tridiagonal structure in the Jacobian, the mass ow rate, M, as an independent variable M_ j at each grid point, and the additional equation stating that they are all equal,

dM_ j = 0 dxj

(j = 1; : : :; J)

(2:28)

is added with a suitable boundary condition. This mass conservation equation, coupled with the energy conservation equation (2.2) and the K equations of momentum conservation (2.3) yield a total of K + 2 equations. The approximate solution vector ^ has the form, ^ = (^1 ; ^2; : : :; ^J )

(2:29)

^j = (Tj ; Yj;1; Yj;2; : : :; Yj;K ; M_ j )

(2:30)

where This corresponds to the independent variables for temperature, species concentration, and mass ow rate at each grid point. The modi ed Newton-Raphson algorithm produces a sequence f(n)g which converges to the solution of the nonlinear equations F () given a suciently good starting estimate (0) . The sequence f(n)g is rejected if it does not converge. The algorithm can be described as follows. Given an initial estimate (0) = ^ the modi ed Newton iteration is, ?1 F ((n) ) (n+1) = (n) ? (n) @@F

This is too expensive and delicate to be used in practice. Evaluation of the Jacobian matrices

(2:31)

@F @

is time consuming and convergence requires a very good initial estimate (0). As a result the Jacobian 25

matrix is replaced by one, J (n), inherited from a previous step and the full step from (n) to (n+1) is cut short by a damping parameter (n). Equation 2.31 becomes (n+1) = (n) ? (n) (J (n) )?1 F ((n)) where, 0 < (n) 1, and J

(n)

= J ? or J (n) = @ F (n 1)

@

(n)

(2:32)

(2:33)

The Jacobian is approximated by a nite dierence perturbations suggested in [34] J i;j

F i (i + j ) ? F i (j )

(2:34)

= rj + a

(2:35)

j

where j

with the relative and absolute perturbations, r and a respectively, chosen to be the square root of the unit roundo. Because J is block tridiagonal, several of its columns can be computed simultaneously [35]. This is done by perturbing (n) at every third grid point, evaluating the residual, and forming the corresponding diagonal columns of J , as in Figure 2.8. The criterion [36] for accepting (n+1) is a decrease in magnitude of the undamped steps, (n) ?1 (n+1) J F ( ) <

(n)?1 (n) J F ( ) :

(2:36)

If (n+1) is rejected, the step is retried with a halved damping parameter or a new Jacobian matrix. The iteration continues until the maximum norm of the undamped correction vector falls within a user-speci ed tolerance. Should the Newton algorithm fail to converge, a user-speci ed number of arti cial time integrations are performed to improve the conditioning of the nonlinear system. Time derivatives are added to equations (2.2) and (2.3) to produce a system of parabolic partial dierential equations [3], K K X @T @T 1 AX @ @T _ ? A @t = ?M @x + c @x A @x ? c1 (AYk Vk )cpk @T @x cp k=1 !_ k hk Wk p p k=1

26

(2:37)

J(K+2) . .. .

.. ..

.. . .. . .. . .. . .. . .. . .. . .. .

.. . . . . .. .. .. .. ..

.. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .

.. .. . .. .

.. .. .. .. . .. K+2 .. . .. .. .. .. ..

.. . .. . .. . .. . .. . .. . .. . .. .

.. .. .. .. .. ..

.. . .. . .. . .. . .. . .. . .. . .. .

.. K+2

0

.. . .. .

.. . . .. . . . . . .. .

J(K+2)

.. . .. . .. . .. . .. .. .. . .. .. .. .. . .. . .. . .. . .. . .. .. .. . .. .

0

.. .. .. .. .. .. .. ..

.. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .

.. .. .. .. .. .. ..

.. . .. . .. . .. . .. . .. . .. . .. .

.. ..

Figure 2.8: The numerical Jacobian is block tridiagonal. In forming the Jacobian, for each species the

solution vector is perturbed at every third grid point, the residual is computed, and the corresponding block diagonal columns are updated. There are J grid points and K + 2 equations, where K is the number of chemical species.

27

and

k _ @Yk ? @ (AYk Vk ) + A!_ k Wk A @Y = ? M @t @x @x

(k = 1; : : :; K)

(2:38)

The backward Euler method is used to obtain the solution. Time derivatives are approximated by nite dierences. All other terms are approximated with windward or central dierences as before, but at time step t+1. The discretized problem is again a system of nonlinear equations. The modi ed NewtonRaphson method is employed to solve the nonlinear system, but in this case it is much more likely to converge because the Jacobian has a factor 1=h on the diagonal, where h is the size of the time step. Thus, a smaller time step corresponds to a reduced condition number for the numerical approximation of the Jacobian.

2.1.3 Parallelism inherent to the problem Based on the analysis of the physical problem in the previous section, where can we expect to nd independence in the code? It is evident we will not nd the independence necessary for parallelism across the Newton or the time stepping iterations. Each iteration forces a global synchronization. However, the Jacobian evaluation contains considerable parallelism, in that all residual dierences can be computed simultaneously (see Equation 2.34). What of the computation of the residual itself? The Newton iteration is typical. Let (n) represent the vector of independent variables after Newton iteration n. In order to investigate what independence lies in these operations, we must know what relationship exists between

y

and the old values (n) in the computation of y = F ((n)). Table 2.5 lists the quantities

computed during residual evaluation. A careful examination of the table reveals y depends only on j(n?)1, j(n) , j(n+1) , (jn??1n ) , (jn?n ) , and (jn+1?n ) . The dependence on some previous evaluation n ? n0 arises 0

0

0

from the fact that the transport coecients are not recomputed each time. It follows that y depends only on (n) and (n?n ) , both of which are known values at the beginning of Newton iteration n + 1. 0

That is, y = F ((n)) is a completely explicit computation. That suggests the computations for each grid point sectioning of y can be performed simultaneously. Finally, many of the properties evaluated

28

for each species and reaction within a single residual evaluation are independent in principle. Others are not, but several have the form of a reduction , a computational form amenable to partial parallelization. Thus, there exists the potential for several levels of signi cant parallelism in Premix. The code is therefore well-suited for execution on a hierarchical machine (for example, the CEDAR machine [37]). Note the hierarchy is not strict; because Jacobians are reused, a signi cant number of residual evaluations occur which are not part of the Jacobian evaluation. Table 2.6 gives the proportion of these events for a nitrogen combustion problem. For the Nitrogen chemistry problem we used, one third of the residual evaluations occur independent of Jacobian evaluation. Therefore, if a single level of parallelism is implemented, it must be done at the level of residual evaluation.

2.1.4 Speci c implementation The control ow of the code can be viewed as in Figure 2.9. The Chemkin Interpreter [19] and Transport Property Fitting Code [21] are each external modules which access databases to create so-called linking les to be read during execution. The Chemkin and Transport libraries require access to many problem-speci c constants, such as the molecular weights of the species. In addition, each library requires some scratch space, or memory locations used to store values needed only temporarily. These scratch array are signi cant when analyzing for parallelism. Because the libraries are general-purpose and used in a wide variety of applications, these work arrays must be of arbitrary size. Thus, a \dynamic" memory allocation scheme is required. Both Chemkin and Transport implement dynamic memory allocation in a way common to scienti c programs written in FORTRAN. For each data type employed by one of the program libraries (character, integer, double precision oating point), a single, large array is carved into sections by a sequence of integer osets computed at runtime. The indices are computed during initialization and stored in COMMON blocks for future use. They are never modi ed after initialization. The work arrays for each of the libraries are passed as arguments down the calling tree. Thus, a COMMON block for each of the libraries encapsulates the pointers into their respective

29

Quantity T ( +1) temperature n

j

Depends on n

(n)

n p

n

n+1)

mass fractions

n

1 2

j

k;j

M_ (

n+1)

j

( ?

n0 )

n

j

mass ow rate

j

n

n

1 2

n

j

n

K;j n

n

k;j n

n

k;j

j

n

n

j

j

n

n

j

n

1 2

1 2

K;j

k;j

k;j

n

1 2

n

1 2

;j n

k;j

n

n p

p1

j

;j

n

1 2

j

n

k;j

(n) (n) j +1 j n0 ) (n)

j

n0

1 2

j

Y(

Reference Equation (2.2) Subroutine fun [3] Eqns. (10){(13)

p; T ? ; T ; T ; M_ ; ) ; ( +? ) ; ( ?? ; c ;j ; : : : ; c( K;j ( ) ( ) ( ) ( ) ( ) c j ;A + ;A ? ; ? ; + ; !_ 1( ) ; : : : ; !_ ( ) ; h(1 ) ; : : : ; h( ) ) p; Y ( ?) 1 ; Y ( ) ; Y ( +1 ; Z ( ) ; Z ( )?1 ; !_ ( ) ; ( ?) ; ( +) ; A( ?) ; A( +) ; M_ ( ) M_ (?)1 ; M_ ( ) ; M_ (+1) (n) j 1

n

1 2

1 2

j

j

n

j

j

thermal conductivity T ( ? 0 ) ; X1( ? 0 ) ; : : : ; X ( ? n

n

n

j

n

n0 )

n

;j

K;j

D( ?+ ) diusion coecients p; T ( ? ) ; X1( ? ) ; : : : ; X ( ? n

n0

n

1 2

k;j

n0

n

j

n0

n0 )

n

;j

K;j

( ?+ ) diusion ratios

T ( ? ) ; X ( ? ) ; ( +? ) ; W ( +?

Z(

p; D1( ?+ ) ; : : : ; D( ?+ ) ; X1( ); : : : ; X ( ) ; ( +) ; T (+) 1( ?+ ) ; : : : ; ( ?+ ) p; W ( +) ; R; T (+)

n

n0

1 2

k;j

n)

k;j +

1 2

diusion velocities

n

n0

n

j

n

n

1 2

n

n

mass density

n

n

n

K;j

n

1 2

n

j

j

n0

K;j n

1 2

j

n

n0

1 2

j

n0

K;j

n0

;j

j

n0

;j

;j

(n) 1 j+ 2

n0

k;j

1 2

1 2

n

1 2

j

1 2

1 2

h( ) enthalpies

T(

c( k;j) speci c heats

T(

c( j ) mean speci c heat

) c( ;j) ; : : : ; c( K;j ; Y1( ) ; : : : ; Y (

!_ ( ) molar production rates

p; T ( ) ; Y (

n

k;j

n p

n p

n

k;j

n)

j

n)

j

n p1

n p

n

n)

;j

K;j

n)

n

j

k;j

W ( ) mean molecular weight W1 ; : : : ; W ; Y1( ) ; : : : ; Y ( n

K

j

X ( ) mole fractions

Y ( ); W ; W

A( +) area p pressure W molecular weight R universal gas constant

user supplied constant constant constant

n

k

n

j

k

1 2

n

k;j

n

n)

;j

K;j

k

n0 )

1 2

Equation (2.3) Subroutine fun [3] Eqns. (14){(17) Equation (2.28) Subroutine fun [3] Eqns. (31){(35) Subroutines mcmcdt, mcacon [21] Eqns. (50), (60){(62) Subroutine mcadif [21] Eqns. (48), (49); [3] Eqn. (8) Subroutines mcatdr, mtrnpr [21] Eqns. (51){(56) Subroutine mdifv [21] equations (41), (42), (71){(73); [3] Eqns. (6), (7), (9), (14){(16) Subroutine ckrhoy [19] Eqns. (2) Subroutine ckhml Subroutine ckcpms [19] Eqns. (26) Subroutine ckcpbs [19] Eqns. (34) Subroutines ckwyp, ckrat [19] Eqns. (49){(72) Subroutine ckmmwy [19] Eqn. (3) Subroutine ckytx [19] Eqn. (6) Subroutine area

Table 2.5: Quantities computed during residual evaluation. Appendix A contains a glossary of proce-

dure names and functions. Nonsuperscripted values are formed from values available at the end of the previous iteration.

30

Transport Property Fitting Code

Chemkin Interpreter

?

?

Initialize (read ^ from dataset) (0)

?^ ?

Select intervals; Re ne ^ 6Y HH Re ne? HH N

Perform modi ed Newton-Raphson (Figure 2.14)

H H t

? HH Converged? N? ^ 0; x(0)

(0)

?

-

Y

F ()

^

6

(n)

6

x(t)

?

Perform modi ed Newton-Raphson (Figure 2.14) N

H H (jjx t

H H

( +1)

?

H

Converged? H Y ? x(t+1) (n)

?

HH ? x(t) jj > tol) and (t < T)?

N

(0)

?t

Y-

t

t+1

x( +1)

?

Perform modi ed Newton-Raphson (Figure 2.14)

? H Y Converged? H -?N Finish, solution not found

H H

Finish, solution found

?

Figure 2.9: Flow diagram for the Premix. The nonlinear discretized system is solved using the modi ed

Newton-Raphson algorithm. Should the Newton algorithm fail to converge, a user-speci ed number of arti cial time integrations are performed to improve the conditioning of the nonlinear system. The time stepping method also uses the Newton method. A ow diagram for computing F (x) appears in Figure 2.16 31

Grid points 19 19 28 39 53 55 55 55 60 61 Total:

Jacobian evaluations 13 13 2 1 1 1 1 5 3 1 41

Residual evaluations 1,214 1,269 34 19 16 14 12 66 34 13 2,691

Total residual evaluations 2,631 2,686 252 128 125 123 121 741 847 122 7,776

Residual evaluations due to Jacobian % 1,417 53.86 1,417 52.76 218 86.51 109 85.16 109 87.20 109 88.62 109 90.08 675 91.09 813 95.99 109 89.34 5,085 65.37

Table 2.6: Summary of program events for sample data. integer and oating point work arrays. It is important to note that the COMMON blocks for a particular library are declared only in procedures within the library. The ame code documentation [3] instructs the programmer to write a small main program that opens all les, allocates the working storage space, and calls the ame program, premix, with the working storage areas and its sizes as parameters. An example of a program driver appears in Appendix C. Procedure premix immediately calls procedure point, which carves an array section from the declared work space for each of the libraries by initializing integer osets. Figure 2.10 shows a portion of this initialization. In the gure, the size parameters used to determine the osets (lenrck and lenrmc) are read as input data. If the largest oset (ntot) is larger than the size of the work space, the program halts with an error message indicating how large the working storage parameter should be for proper execution. It is a simple matter to adjust the declarations, recompile one routine, and restart the code. If the working storage is sucient, locations rwork(nckw) and rwork(nmcw) are aliased, via parameter passing, to rckwrk

and rmcwrk, respectively. These arrays are are further subdivided by pointers initialized in the

Chemkin and Transport library initialization routines,

ckinit

and mcinit, respectively. We will

examine ckinit with the understanding that mcinit is similar (full listings of both routines appear in an Appendix D). Within ckinit the pointers into rckwrk are initialized as in Figure 2.11. The quantities

nmm

and

nkk

are input parameters for the number of atomic weights and the number of 32

molecular weights, respectively. Hereafter, the array section rckwrk(ncaw): : : rckwrk(ncaw+nmm-1) is reserved for the atomic weights, and array section rckwrk(ncwt): : : rckwrk(ncwt+nkk-1) is reserved for molecular weights. Pointers such as ncwt appear in a COMMON block along with problem-speci c integer constants such as nmm and nkk. This COMMON block is declared only in Chemkin routines, and the values are never modi ed after initialization. Figure 2.12 contains a typical Chemkin procedure which exhibits the use of these pointers. The variables in the

COMMON

block are initialized high in the calling tree,

then extensively used, but never modi ed, in the lowest levels of the tree. This has implications on the complexity of interprocedural analysis, as we will see in Chapter 3. The code in module Twopnt forms an independent library; it can be used to solve other problems not related to ames. Twopnt contains a procedure, aptly named twopnt, which is intended to function as a coroutine [38,39] with a user-written routine in the main program. For Premix the coroutine is , which contains all ame-speci c code. After initialization, fldriv is invoked. The working

fldriv

storage is loaded either from the so-called linking les created by the Chemkin Interpreter and the Transport Property Fitting Code, or from restart les created by a previous execution of the

code (used for checkpointing). Procedure

fldriv

issues a call to

twopnt

with one of its arguments

acting as a buer for the data relevant to the two point boundary value problem. Procedure

twopnt

does not run to completion; it returns control after partial execution. When the twopnt solver requires a residual evaluation, a Jacobian evaluation, or a banded linear solve, it stops, returning an appropriate request to fldriv. The communication mechanism for this call-back from twopnt and its coroutines to the required Premix routines is a series of boolean variables set by twopnt to indicate what actions should be undertaken. Procedure fldriv invokes the appropriate procedure to perform the requested action on the data in the buer. The result of this action is stored in the same buer. Procedure fldriv

then reinvokes twopnt, without changing to the communication parameters. Through a number

of assigned GOTO statements twopnt determines how to process the data returned by fldriv and continue the solution process where it had stopped. Figure 2.13 shows how fldriv and twopnt interact during a typical portion of the execution. 33

C C

real nckw real nmcw ntot

chemkin work space = 1 transport work space = nckw + lenrck = nmcw + lenrmc

Figure 2.10: A portion of procedure point.

C atomic weights ncaw = 1 C molecular weights ncwt = ncaw + nmm C temperature fit array for species nctt = ncwt + nkk

.. . C pressure ncpa C internal nck1 nck2 nck3 nck4 nci1 nci2 nci3 nci4 ntot

of one atmosphere = ncrc + 1 work space of lengths kk = ncpa + 1 = nck1 + kk = nck2 + kk = nck3 + kk = nck4 + kk = nci1 + ii = nci2 + ii = nci3 + ii = nci4 + ii - 1

.. .

Figure 2.11: A portion of procedure ckinit.

34

SUBROUTINE ckmmwy (y, ickwrk, rckwrk, wtm) DIMENSION y(*), ickwrk(*), rckwrk(*) COMMON /ckstrt/ nmm , nkk , nii , mxsp, 1 ncp2, ncp2t,npar, nlar, 2 nthb, nrlt, nwl, icmm, 3 icnt, icnu, icnk, icns, 4 icwl, icfl, icfo, ickf, 5 ncwt, nctt, ncaa, ncco, 6 nckt, ncwl, ncru, ncrc, 7 nck4, nci1, nci2, nci3,

mxtb, nfar, ickk, icnr, ictb, ncrv, ncpa, nci4

mxtp, nlan, icnc, iclt, ickn, nclt, nck1,

ncp , nfal, icph, icrl, ickt, ncrl, nck2,

ncp1, nrev, icch, icrv, ncaw, ncfl, nck3,

c

150

sumyow=0.0 DO 150 k = 1, nkk sumyow = sumyow + y(k)/rckwrk(ncwt + k - 1) CONTINUE wtm = 1.0/sumyow RETURN END

Figure 2.12: Procedure ckmmwy, with COMMON block ckstrt. An integer oset (ncwt) is used to access

a section of array rckwrk, the Chemkin oating point work space, which is passed as an argument to all Chemkin procedures. Withing the COMMON block, nmm to nwl are integer constants, \ic" values are integer pointers into the integer workspace, and \nc" values are pointers into the oating point work space.

35

premix

rdkey newton jacob newton fun

fldriv

newton

twopnt

dgbco newton dgbsl refine

dgbsl newton

premix

Figure 2.13: Coroutines fldriv and twopnt. The names within the boxes represent procedures invoked

by fldriv and twopnt, respectively. The arrows indicate the ow of control. Procedure fldriv acts as a messenger between the problem-independent two point boundary value problem solver and the problem-speci c routines the solver must invoke.

36

2.1.5 Parallelism inherent to the implementation Returning to Figure 2.9, we see that each time the outer control loop iterates either the Newton solver or time stepping is invoked. The Newton solver is always invoked rst; time stepping is only performed when the Newton solution phase fails to converge. Figure 2.14 exhibits the ow of the Newton solver. Note that a single Newton iteration consists of the following work:

calculate the residual (fun), evaluate the Jacobian (jacob) and factor, if necessary (dgbco), and backsolve (dgbsl). Because the transport terms involve a grid block and its immediate neighbors, the chemistry is local. We saw in Section 2.1.2 that several columns of J can be formed at once. This is re ected in the ow of Figure 2.15, which shows the Jacobian computation for Newton iteration n. The residual is rst evaluated at (n) . Then, for each equation k, the approximate solution vector is perturbed at every third grid point. The residual is again evaluated, and the scaled dierences are computed and stored in the appropriate columns of J . This process must be repeated three times to obtain a complete approximation to the Jacobian. We noted in Section 2.1.3 the residual evaluations are independent of one another. Therefore no conceptual reason exists that these residuals cannot be computed all at once. As can be seen in Figure 2.16, computing the residual (fun) requires numerous chemical and thermodynamic property evaluations at each grid point. The computation has three distinct steps. First, the transport coecients are evaluated, if necessary (Figure 2.17). Then the diusion velocities are computed (Figure 2.18). Finally, the chemical kinetics terms are evaluated and the residuals of the governing equations (2.2), (2.3) and (2.28) are determined. We also noted in Section 2.1.3 that we cannot nd the independence necessary for parallelism across either the Newton or the time stepping iterations because each iteration forces a global synchronization. Inside these serial loops are three levels of parallelism. The outermost is the nest of function evaluations 37

Figure 2.9 n

?

0

?

Compute Jacobian (Figure 2.15) J (n)

(n+1) 1

J(

1)

?

(n) ? (J (n))?1 F ((n) )

HH jj(J n )? F ( n ( )

? n?

( +1)

?

H N )jj < jj(J (n) )?1F ((n))jj? H Y ? Compute (n) ?

Compute current Jacobian, J (n) (Figure 2.15) (n+1)

Y HH jj(J n )? F ( n ? HH ? n jj tol ( )

n

N n+1

HH jj n

( +1)

?

(n) ? mu(n)(J (n))?1 F ((n) )

1

( +1)

?

)jj < jj(J (n))?1 F ((n) )jj? H H N

( )

Y

?

(convergence)

?

(no convergence) To Figure 2.9

Figure 2.14: Flow diagram for the modi ed Newton-Raphson method (procedure diagram for computing F (x) appears in Figure 2.16

38

). A ow

newton

Figure 2.14 x0

? ? i 1 -? k 1 -? z n ? j i -?

F ((n))

( )

j rjz(k; j)j + a z(k; j) z(k; j) + j

j

j+3

Y HH

J (kk; j)

kk

? HH N ? x F (z) ? j i -? kk 1 -? jJ ?

(x(kk; j) ? x0 (kk; j)) j

HHkk K?+ 2 ?HH N YHH j ?J ? HH N YHH k K?+ 2 ? HH YHH i ?3N? HH ?N Factor J ? Y

kk + 1

j

j+3

k

k+1

i

i+1

Figure 2.14

Figure 2.15: Flow diagram for numerical Jacobian approximation (procedure jacob). A ow diagram for computing F (x) appears in Figure 2.16 39

Figures 2.9, 2.14, and 2.15

?

Evaluate transport coecients, if necessary (Figure 2.17)

?

Evaluate diusion velocities (Figure 2.18) j

?

-

1

?

Evaluate mass density: j +

?

1 2

Evaluate cross-sectional tube area: Aj +

?

Evaluate enthalpies: h1;j + ; : : :; hK;j + 1 2

?

1 2

Evaluate speci c heats: cp ;j ; : : :; cpK;j

?

1

+1 2

1 2

+1 2

Evaluate mean speci c heat: cpj

?

Evaluate molar production rates: !_ 1;j ; : : :; !_ K;j

?

Evaluate mass fractions: Y1;j ; : : :; YK;j

?

Evaluate temperature: Tj

?

Evaluate mass ow rate: M_ j yj

j

Y j +1

?

(Tj ; Y1;j ; : : :; YK;j ; M_ j )

? HHj J ?HH ?N

Figures 2.9, 2.14, and 2.15

Figure 2.16: Flow diagram for computing the residual (procedure fun). 40

Figure 2.16

? 1 ?

j

Evaluate thermal conductivity: j

?

Evaluate diusion coecients: D1;j + 21 ; : : :; DK;j + 12

?

Evaluate thermal diusion ratios: 1;j + 21 ; : : :; K;j + 21

HHj ?J ?HH Y-j ?N

j +1

Figure 2.16

Figure 2.17: Flow diagram for evaluating transport coecients (procedure mtrnpr).

Figure 2.16

? 1 ? Evaluate V ;j ; : : :; VK;j ? Evaluate Z ;j ; : : :; ZK;j HHj ?J ?HH Y-j ?N j

1

1 1 +2

+1 2

j+1

Figure 2.16

Figure 2.18: Flow diagram for evaluating diusion velocities (procedure mdifv). 41

in jacob. Each of these loops is independent, in principle, because the residual evaluations do not in any way depend on each other. At the next inner level are the loops over the grid points in fun. As we noted in Section 2.1.3 computing the residual is an explicit operation, so it too is independent in principle, and the iterations over the grid points should be, in principle, parallelizable. At the innermost level are calculations for the reactions and species at each grid point. These loops are not shown explicitly in the gures. The species and reactions are interdependent to a degree; many of these loops contain legitimate true dependences. In a rough way we can calculate the granularity for each loop level using Table 2.6, beginning with the innermost loops, those over the K species and N reactions. For our sample data, K is 34 and N is 151. Considerable work is performed within some of these loops. The granularity would be moderately large-scale, except that most of these loops are reductions, and thus must be carved into smaller pieces to achieve parallelism. The next outermost loop level, residual evaluation, yields loops of length 19 to 61 (as the grid is re ned the loops grow longer), with each iteration initiating the loops over the K species and N reactions. The granularity is large-scale. At the outermost loop level, a Jacobian evaluation is comprised of this individual residual evaluations, each independent in principle. The Jacobian is block tridiagonal, with diagonal blocks, each the size of the number of conservation equations, K + 2 (see Section 2.1.2). Thus, the number of independent calls to fun per Jacobian is 3(K + 2) = 108. The work within each of these calls is considerable, yielding a very large-scale granularity. How well does the existing code express the parallelism inherent to the implementation? At innermost level, the loops over the species and reactions are expressed mostly as serial reductions. At the next outmost level, the loops in fun are inhibited from executing in parallel by several memory dependences and a few true dependences, which we describe in more detail. A manual interprocedural analysis of the library procedures called during of the loops over the grid points in fun and consultation of available documentation [3,19,21] reveals the last argument to each of the called procedures is its sole output argument. These output arguments, along with any variables local to the called procedures in the loop, create memory dependences which inhibit parallelism. Additionally, interprocedural analysis reveals 42

that some of the Chemkin and Transport procedures use work array sections as temporary scratch space. This is to say they write to and read from (in that order) global memory locations which are not read by the rest of the program without rst being overwritten. (Recall that the work space is divided into array sections identi ed by integer osets stored in a COMMON block.) This reuse of globally accessible of scratch space creates memory dependences between loop iterations. Three true dependences also appear in

. The recurrences used to communicate reaction area and density (Figure 2.16),

fun

and mole fractions (Figure 2.18) to neighboring grid points are true dependences. At the outermost level of parallelism, the loops over residual evaluations in procedure jacob, all the dependences in fun reappear. In addition, the locations used to store the residuals are reused by the loop iterations, as are all locations local to procedure fun (assuming static allocation at runtime). These memory dependences inhibit parallel execution. There are no true dependences in jacob. Overall, we see that a number of memory dependences and a few true dependences inhibit the expression of the parallelism inherent to the method and its implementation.

2.2 Optimization We saw in Section 2.1 that the program spends considerable time computing molar production rates and gas transport coecients. Our analysis there demonstrated these computations can, in principle, be performed eciently in parallel. For the test case used here, the Alliant FX/80's model of concurrency can, for all practical purposes, exploit only one level of parallelism. Because the Jacobian is formed only when necessary, there are a signi cant number of residual evaluations (fun) which do occur outside . These consume about 33% of the sequential execution time. If we had excluded 33% of the

jacob

serial execution time from our parallelization strategy, we could have only hoped to speed up the code by a factor of 2.4 on eight processors. The performance of this approach is similar to that obtained by the FX/FORTRAN compiler. To circumvent this performance limitation we chose to parallelize the property evaluation loops in fun. These loops over the grid points are responsible for about 85% of 43

the serial execution time and have suciently large-scale granularity to execute eciently on a modest number of processors. The daxpy linear algebra operation is also inherently parallel and is responsible for another 10% of the execution time. We thus hoped to parallelize code responsible for about 95% of the serial execution time. The corresponding speedup factor of six using eight ideal processors would be a good performance improvement on the Alliant FX/80. The remaining 5% of the code mainly performs triangular solves and input/output operations. These operations are essentially sequential. In the existing code are a number of arrays passed as arguments to subprocedures with parameters whose shape does not match the declaration in the calling procedure. This happens each time one of the array sections indexed by pointers stored in COMMON blocks is used as a multidimensional array. For example, array section rckwrk(ncfl): : : rckwrk(ncfl+nfar*nfal-1) is passed by ckwyp to ckrat, where it is declared a two-dimensional array with

nfar

as its rst dimension. This structure hides

the inherent parallelism from parallelizing compilers which rely on inline-expansion to cross procedure boundaries. When inline expansion fails, as it does when array shapes do not match, an optimizing compiler must make worst-case assumptions. Almost without exception this will cause it to assume a loop cannot be executed concurrently. Unfortunately, the failure of inlining is in this instance a direct result of the \dynamic" allocation scheme which provides the code with the exibility to handle problems of any size. Because inlining fails and the FX/FORTRAN compiler can perform no interprocedural analysis, we must manually modify the loops for parallel execution, then ask the compiler not to make worst-case assumptions for the modi ed loops (the directive CVD$

CNCALL

does this for FX/FORTRAN).

Independent copies of local variables (scalars and xed-size arrays) are created by placing the body of the loop in a separate procedure to be compiled \reentrant" (using the -recursive ag), allowing it to be invoked in parallel. That is, each instantiation of the loop body will be allocated its own local copy of local variables on the program stack. All procedures invoked by the parallel loop will also be compiled \reentrant." Thus, local variables will not give rise to memory oriented dependences. Additionally, we employ scalar and array privatization to provide each iteration with an unique storage location for each

44

of the procedure output arguments identi ed earlier, and in doing so remove the associated memory dependences. We saw in the previous section that the loops involving reaction area and density (Figure 2.16), and mole fractions (Figure 2.18) contain true dependences. We can overcome the true dependence using loop distribution . We split the loop in Figure 2.16 into two loops; Figure 2.19 is the result. The rst new loop evaluates (for simplicity) all the needed values, including density and area, and places them in expanded arrays or scalars. The second consumes those values to produce the residuals of the conservation equations (2.2) (2.3) (2.28). The residuals are the output of fun. We similarly split the loop in Figure 2.18 into two loops, the rst computing the mole fractions and the second consuming them, resulting in a structure that matches Figure 2.20. We also saw that some of the Chemkin and Transport procedures use array sections as temporary scratch space. Three of these routines which require scratch space are called within the outer loops we selected as targets for parallelization (Figures 2.16 through 2.18). One solution to this problem is to expand the array as we did earlier. However, the size of the oating-point work space is prohibitive (over 640,000 double precision words for the nitrogen combustion model). An alternative is to provide copies of only the modi ed array sections. This is not a clean solution, because we must modify the parameter lists for the library routines, violating the requirement for backward compatibility. Another possibility is to dynamically allocate scratch arrays where they are needed. This violates portability, because dynamic allocation is not included in standard

. Either solution is easily implemented.

FORTRAN 77

For our purposes, we chose the solution compatible with the

standard, sacri cing backward

FORTRAN

compatibility. Some values must be communicated to neighboring grid points for the computation of transport properties. In a shared-memory system this communication is performed by placing the values to be communicated in the shared memory, making them available to all processors. The quantities to be communicated include thermal conductivity (j ), reaction area (Aj ), density (j ), temperature (Tj ), mole fractions (X1;j ; : : :; XK;j ), and mass ow rate (M_ j 1). We must guarantee these values are 1 2

1 2

45

1 2

Figures 2.9, 2.14, and 2.15

? ? Evaluate diusion velocities (Figure 2.18) ? j 1 -? Evaluate mass density: j ? Evaluate cross-sectional tube area: Aj ? Evaluate enthalpies: h ;j ; : : :; hK;j ? Evaluate speci c heats: cp ;j ; : : :; cpK;j ? Evaluate mean speci c heat: cpj ? Evaluate molar production rates: !_ ;j ; : : :; !_ K;j Y j ? j j +1 HH J ?HH ?N j 1 -? Evaluate temperature: Tj ? Evaluate mass ow rate: M_ j ? yj (Tj ; Y ;j ; : : :; YK;j ; M_ j ) ? j j + 1 Y HHj J ?HH ?N

Evaluate transport coecients, if necessary (Figure 2.17)

1 +2

+1 2

+1 2

1 +1 2

1

+1 2

+1 2

1

1

Figures 2.9, 2.14, and 2.15

Figure 2.19: Flow diagram for modi ed residual evaluation (fun). The original loop (Figure 2.16) has been split into two loops to make the values computed in the rst available to all processors during execution of the second loop.

46

Figure 2.16

? 1 ? Evaluate V ;j ; : : :; VK;j HHj ?J ?HH Y-j ?N j 1 ? Evaluate Z ;j ; : : :; ZK;j HHj ?J ?HH Y-j ?N j

1

1 1 +2

j+1

+1 2

j+1

Figure 2.16

Figure 2.20: Flow diagram for modi ed mdifv. The original loop (Figure 2.18) has been split into two loops to make the values computed in the rst available to all processors during execution of the second loop.

47

written before the beginning of the parallel loop in which they are read. Thus if, for example, j + is 1 2

produced during iteration j and consumed during iteration j+1 (where it is (j +1)? ), a true dependence 1 2

results. We have seen that temperature and mass ow rate are values computed during the previous Newton iteration. Thus, they are guaranteed to be written to the shared memory system before they are needed. The thermal conductivities are also written before they are needed as they are produced in a loop which completes before the parallel loop begins.

2.3 Results 2.3.1 Metric Our goal is to reduce the elapsed execution time of the ame code on a shared-memory multiprocessor. Our metric is speedup , the ratio of elapsed times before and after parallelization. time of original sequential program speedup = execution execution time of parallelized program Table 2.7 gives the speedup for the original code using the FX/FORTRAN compiler. Included are data for several important procedures and the three loops in fun we targeted for optimization in Section 2.2. The pro ling option (-pg) was disabled for this experiment. Table 2.8 exhibits the same values for the optimized version of Premix, along with the overall speedup. With the exception of the linear algebra routines, which could not be optimized, the performance improvement is near the anticipated mark of sixfold speedup. Procedures

dgbco

and dgbsl could not be optimized because of the inherently serial

nature of matrix factorization and backward solves. The importance of parallelizing all signi cant routines is apparent. Linear algebra originally commanded only about 13% of the elapsed time in sequential mode. However, now it is responsible for 27% of the time consumed by the optimized version in parallel mode. Additionally, as the number of processors grows, factorizations and backsolves will increasingly

48

Serial (-Og) Parallel (-Ogc) Parallel Seconds % Seconds % Speedup (a) (b) (a)/(b) fldriv 8658 100.0 4172 100.0 2.1 fun 7350 84.9 3525 84.5 2.1 jacob 4004 46.2 1690 40.5 2.4 dgbsl 666 7.7 287 6.9 2.3 dgbco 487 5.6 253 6.1 1.9 fun loop 5472 63.2 2283 54.7 2.4 mtrnpr loop 1651 19.1 1118 26.8 1.5 mdifv loop 170 2.0 82 2.0 2.1 Name

Table 2.7: Parallel speedup, original version. Times are elapsed times on an Alliant FX/80 with eight processors and include time spent in called subprocedures. These data were collected with the pro ling disabled, and thus dier from those in Figure 2.2.

Serial (-Og) Parallel (-Ogc) Parallel Seconds % Seconds % Speedup (c) (d) (c)/(d) fldriv 8773 100.0 1951 99.9 4.5 fun 7409 84.4 1284 65.7 5.8 jacob 3979 45.3 681 34.8 5.8 dgbsl 689 7.6 271 13.9 2.5 dgbco 491 5.6 247 12.6 2.0 fun loop 5430 61.9 920 47.1 5.9 mtrnpr loop 1735 20.0 296 15.1 5.9 mdifv loop 167 1.9 28 1.4 6.0 Name

Actual speedup (a)/(d) 4.4 5.7 5.9 2.5 2.0 5.9 5.6 6.1

Table 2.8: Parallel speedup, optimized version, with actual speedup. Times are elapsed times on an Alliant FX/80 with eight processors. Procedure timings include time spent in called subprocedures.

49

dominate execution time.1 This is evident in Figure 2.21 which exhibits the parallel speedup for the optimized code. Formerly the chemistry was so expensive that the time spent in linear algebra could be ignored. Now that the chemistry can be made cheap, the serial linear algebra is comparatively expensive. The algorithmic tradeos have changed, suggesting alternatives to the overall solution strategy should be reviewed. Such an analysis is beyond the scope of this thesis.2 In Figure 2.21 we also see that the performance improvement of the optimized code does not match the predicted values. While we did not analyze the source of this ineciency, we suspect it is due mainly to less than optimal load balancing.

2.3.2 Recap A detailed analysis of the mathematical model used in Premix coupled with a study of the computational methods gave us a picture of a hierarchy of parallelism inherent to the problem being solved. A manual interprocedural analysis followed, allowing us to determine to what extent the parallelism inherent to the implementation was expressed in the original version of the code. We then chose an outer loop level appropriate to our target machine and applied a few manual transformations to the code in order to make the parallelism visible to an automatic parallelizing compiler. In all, we modi ed less than 100 lines of code. The result was an improvement in parallel performance, more than doubling the speedup obtained by automatic compilation alone.

1 (The Lapack eort [40{47] oers parallel versions of these solvers, exploiting parallelism in multiple right hand sides and blockingalgorithms. However, the present version of the factor and backsolveroutines did not produce any performance improvement in the optimized version of Premix). 2 Some discussion of parallel methods for solving two point boundary value problems can be found in [48]. Parallel algorithms for banded linear systems such as the block tridiagonal system in Premix are explored in [49].

50

8

ideal predicted (95% parallel) optimized (fortran -Ogc [-recursive]) original version (fortran -Ogc)

6 Sp ee du p 4

2

2

4 6 Number of processors

8

Figure 2.21: Ideal, predicted, original and optimized parallel speedup versus number of processors on

an Alliant FX/80 with the FX/FORTRAN parallelizing compiler. Speedups are computed from elapsed times (T1 =Tp ).

51

CHAPTER 3

IMPLICATIONS AND SUMMARY We were able to easily parallelize Premix largely because the code was developed with each of the constraints listed in Section 1.3 in mind. In this chapter we explore this result more fully. We consider each constraint separately, then discuss to what extent the hypotheses introduced in Chapter 1 are borne out by the code. From this we present our ideas as to what analysis techniques are most essential in parallelizing a production code. We end with a note on the automatability of these techniques.

3.1

Premix

as a production code

3.1.1 Generality Premix was clearly developed with generality in mind. The code can handle both burner-stabilized

and an adiabatic freely propagating ames. The convective terms of equations (2.2) and (2.3), can be discretized by either windward or central dierences. The eect of thermal diusion can be included or neglected. Thermodynamic properties are computed using either multicomponent or mixture averaged 52

formulas. The \conservation diusion velocity" recommended by [50] can be included to enforce species conservation. Temperatures can be determined from an a priori pro le or computed from the coupled energy-species equations (2.2). Tolerances can be speci ed, as can the number of time integrations. Keywords included in the input le select which kind of problem to solve. The initial grid spacing and parameters for grid re nement are also controlled by the input dataset.

3.1.2 Flexibility Flexibility was also given much attention in Premix. Regardless the execution parameters, the same data structures are used for all possible executions of the code. As described in Chapter 2, a \dynamic" allocation scheme is used to support problems of arbitrary size. This is the most portable of solution to the problem of exible storage allocation. Input data can be presented in terms of either mass or mole fractions. An individual reaction can be one of several dierent supported reaction types, including Landau-Teller, third body, and pressure dependent fall-o. The latter can be of Lindemann, Troe, or SRI form.

3.1.3 Robustness The code executes cleanly. The input data is checked for validity, and the results are periodically checkpointed. Importantly, the code can be restarted from a checkpoint with dierent execution parameters. Information saved during the checkpointing operation includes the most recent accepted update to the solution of the two point boundary value problem, the temperature pro le, the contents of the Chemkin work space, and the contents of the Transport work space.

3.1.4 Extensibility The Chemkin libraries undergo constant revision. Each revision is carefully designed to enforce backward compatibility.

53

3.1.5 Portability A single version of standard FORTRAN

77

code is distributed for all platforms, from a desk top personal

computer to a Cray. Operating system speci c code is localized to the program driver. Machine-speci c constants are localized to two Linpack routines in the math library.

3.1.6 Implications The rst of Schneider's hypotheses, that each computational phase is conceptually independent and implemented as an integrated subprogram which communicates with other subprograms through well de ned interfaces, holds for Premix. Well-de ned interfaces can be seen between each of the code's four modules, Premix, Twopnt, Chemkin, and Transport. The main procedure in Premix, is a coroutine of the main procedure in Twopnt. They pass messages only through reverse communication parameters. Premix and the other libraries interact only via library subroutine calls. Though it has access to them, Premix never directly modi es the Chemkin and Transport working storage locations. For conceptual independence, note that Twopnt is an independent library which can be used to solve other problems not related to ames. The Chemkin and Transport libraries are also used without modi cation by numerous other codes. The second hypothesis, that methods are typically selected in a mutually exclusive manner within a single invocation of a procedure, holds for Premix at the program level. Each of the choices listed in Section 3.1.1 results in a particular execution path. Much of the ow controlled by the input dataset appears in high level if-then-else constructs. The dierent reaction types are supported by conditional statements in the outermost branches of the calling tree. The third hypothesis, that control dependences are important in eliminating large classes of potential data dependences, is not as important for Premix, as parallelized for the Alliant FX/80. While control dependences do eliminate potential data dependences in the code, none of these appear at the outer loop level we targeted for parallelization. However, if we had chosen to parallelize across residual evaluations

54

as part of Jacobian computation, control dependences would have eliminated a signi cant number of data dependences. Nevertheless, we found that for Premix on the Alliant FX/80, discovering the precise use of library work array sections was essential, even though we could not use control dependences to determine whether arrays were \killed". The array sections in the Chemkin and Transport oating point working storage are a necessary result of the \dynamic" storage strategy. A full analysis would have been complex, but once we discovered that array section boundaries are fully respected by all parts of the code, the interprocedural analysis became almost trivial. The only diculty lay in tracking the use of array storage locations. To motivate this diculty, consider the code yav in Figure 3.1. We sought to parallelize the outer loop. Some portion of array yav is written and read each iteration. In order to determine whether yav

could be privatized we had to discover whether the rst loop writes to all locations later read. By

demonstrating that kk equals nkk, we proved that these array sections are the same, and thus could be privatized. To analyze yav symbols kk and nkk were traced to nearly the top of the calling tree and down a dierent branch, where we discovered they are set equal by Chemkin initialization procedure . See Figure 3.2. This example demonstrates that the needed information can be far (on the

ckindx

calling tree) from the procedure being compiled. Similar examples occur frequently in Premix. Values such as nkk, the number of species, are needed in the main program for establishing loops bounds. They are also encapsulated in the library COMMON blocks because other simulations may not need direct access to those values, but the library still does. The initializing call to ckindx is essentially a direct request for that value from the Chemkin library. For Premix the need for widespread dissemination of symbolic information is a direct result of the modularity of program's libraries. It is encouraging to note that all symbolic information needed to analyze the code is available at compile-time.

55

do 1000 j = 2, jj-1

.. . 100

do 100 k = 1, kk yav(k) = 0.5 * (s(nys+k,j) + s(nys+k,j+1)) continue xmdot = s(nm,j) rhom = rhop call ckrhoy (p, tav, yav, ickwrk, rckwrk, rhop)

.. . 1000

continue

subroutine ckrhoy (p, t, y, ickwrk, rckwrk, rho) dimension ickwrk(*), rckwrk(*), y(*) common /ckstrt/ nmm , nkk , nii , mxsp, mxtb, mxtp, 1 ncp2, ncp2t,npar, nlar, nfar, nlan, 2 nthb, nrlt, nwl, icmm, ickk, icnc, 3 icnt, icnu, icnk, icns, icnr, iclt, 4 icwl, icfl, icfo, ickf, ictb, ickn, 5 ncwt, nctt, ncaa, ncco, ncrv, nclt, 6 nckt, ncwl, ncru, ncrc, ncpa, nck1, 7 nck4, nci1, nci2, nci3, nci4

150

ncp , nfal, icph, icrl, ickt, ncrl, nck2,

ncp1, nrev, icch, icrv, ncaw, ncfl, nck3,

sumyow = 0.0 do 150 k = 1, nkk sumyow = sumyow + y(k)/rckwrk(ncwt + k - 1) continue rho = p/(sumyow*t*rckwrk(ncru)) return end

Figure 3.1: Array section analysis for an outer loop in fun. If kk equals nkk then the rst loop \kills" the array section; it can be privatized.

56

point \kk"

ckinit \nkk" ckindx \kk = nkk" mcinit twopnt

:::

::: :::

mtrnpr mdifv area ckrhoy

ckwyp \nkk"

premix

fun \kk"

ckytcp \nkk" ckrat \nkk"

cksmh \nkk"

ckhml \nkk"

fldriv \kk"

ckcpbs \nkk"

ckcpms \nkk"

ckcpms \nkk" temp jacob \kk"

::: ::: others : : : dgbco dgbsl

Figure 3.2: Tracing symbols kk and nkk. In quotation marks is the variable which appears in each

procedure. We see that nkk is only used by Chemkin library procedures. We see also that the important information \kk = nkk" is far removed from the procedure where it is needed for parallel compilation (fun).

57

3.2 Production code parallelizing essentials We suggest the following analysis techniques are important in parallelizing large production FORTRAN codes. We do not intend this as an exhaustive listing; rather, we include those techniques required to analyze and transform Premix.

Symbol propagation { We saw in the previous section that modularity can force a piece of information important to eliminating a potential dependence to be located far from the procedure being analyzed. Thus, eective and ecient constant and symbol propagation are essential to parallelizing a production code. Runtime tests might serve as an eective substitute in some instances. Programmer-supplied assertions remain a powerful tool. These should be tailored to re ect the programmer's view of the code.

Interprocedural analysis { Any exible FORTRAN library code must use a scheme similar to that used in Chemkin and Transport to \dynamically" allocate array sections. Interprocedural analysis methods which can handle array sections are necessary for privatizing these common structures. Symbol propagation must be capable of propagating information about array ranges gained from array section analysis.

Common structure recognition { One cannot insist a compiler recognize the full semantics of a program in order to parallelize it. However, as a result of the constraints of exibility and portability, certain structures appear often enough to merit special recognition. An example is the \dynamic" working storage allocation scheme used by Chemkin and Transport. This is really the only way to produce a general FORTRAN program which is both exible and portable. The same scheme is used by many quantum chemistry codes such as Hondo [51,52] and several codes in the Perfect R suite [53] such as Adm and Trfd. Once this storage allocation scheme is Club Benchmarks

reconized, we can be reasonably sure that array section boundaries will be respected.

58

3.3 Automatability For each candidate loop we must analyze any called procedures to determine how their arguments are used. If the program or library was written in a modular fashion, respecting production code constraints, we expect argument passing to be the predominant form of interprocedural communication. We want to classify all arguments to the important procedures as input, output, update, or scratch. Table 3.1 indicates what information is needed to distinguish between these uses. Argument Type IN INOUT OUT SCRATCH INSCRATCH

Relation to procedure call before during after written to read don't care written to read, then modi ed read don't care modi ed read don't care modi ed not read before written written to read, then modi ed not read before written

Table 3.1: Argument classi cation. From the table we can see that we will need to answer the following questions:

Within the procedure, is the argument read, modi ed, or both? Within the procedure, is the argument rst read or modi ed? How is the argument used after the procedure has nished? Some terminology will facilitate our discussion [10]. A de nition of a program variable v is a statement which assigns a value to v. A use of v is a statement which reads v. To avoid having to make pessimistic assumptions about the de nition and use of v, we need to perform an interprocedural de nition-use analysis . Discovering de nition-use and use-de nition chains is a mechanical process. Analyzing array

intervals with symbolic osets is rather cumbersome, however, especially when those osets are not known until run-time and must therefore be represented as symbolic expressions. The constraints of 59

generality and portability make the use of symbolic array indexes likely. As we have seen, it may be necessary to traverse the entire calling tree, propagating constants and symbolic analysis throughout. The complexity of such propagation is exponential in the size of the calling tree. A de nition of an argument is said to be outwardly exposed if it is the last de nition of the argument in the procedure. An argument use is outwardly exposed if the calling procedure does not contain a de nition before this use. To distinguish between arguments used for subroutine output and those that are just used for temporary storage within the subroutine we must also discover whether the last modi cation of the argument is outwardly exposed; that is, we must determine whether the argument is live out . Thus, the ow of symbolic and interprocedural de nition-use information between the program

and its libraries is bidirectional. With an interprocedural de nition-use analysis for procedures called by our candidate loops in place, we can analyze the loop statements for dependences using all available techniques. The intraloop statement analysis is best done by an automatic compiler. In general, user interaction with the compiler is necessary to communicate the interprocedural analysis. An encapsulation of the interprocedural de nition-use analysis must be communicated to the compiler; see Figure 3.3. One way this is done is through compiler assertions (directives). The programmer explicitly labels arguments as input, output, update or scratch.

FORTRAN 90

includes IN,

OUT

and INOUT directives. Our classi cation (Table 3.1)

suggests SCRATCH (or PRIVATE) directive would be a useful addition, because scratch arrays are clearly distinct from the previous three and are automatically privatizable (neither live in nor live out ). If no mechanism is available for passing the analysis along to the compiler, the transformations will have to be done by hand. In this case, it is best to limit hand transformations to the more useful techniques, identi ed by Eigenmann [54]. While automatic compilation may go a long way towards generating ecient derivative codes, highlevel user insight can always be employed pro tably. Ideally, the compiler and programmer communicate interactively. An interactive compiler might simply ask questions such as,

60

'$ Driver

Loop

Interface Block

&%

Libraries

Figure 3.3: An encapsulation of the interprocedural analysis of a library must be communicated to

the calling procedures during compilation, perhaps by an interface block. Additionally, symbolic information must be passed from the calling procedures to the libraries in order to completely analyze library procedures.

61

Does the call to ckwyp have any side eects on the loop? Can ckrhoy be taken out of the loop? What is the value (range) of jj? Is it always greater than three? Is yav live at the end of the loop? Is kk equal to nkk? Is rckwrk(nck1): : : rckwrk(nck1+nkk) scratch space? Is nktb a permutation vector?

3.4 Conclusions Premix exhibits characteristics common to many general production FORTRAN codes. By considering

the mathematical model and computational methods found in Premix we discovered a hierarchy of parallelism inherent to the combustion problem being solved. Choosing a level in this hierarchy appropriate to the test machine, we more than doubled the speedup obtained by automatic compilation alone, employing only a few standard transformations. Our experience in manually analyzing and transforming the code suggests properties of generality, exibility, robustness, extensibility, and portability greatly simplify the analysis required to generate a large-grain parallel version of a code. We discovered that symbol propagation, interprocedural array section analysis, and common structure recognition are essential tools for analyzing production codes like Premix. These techniques appear suitable for inclusion in interactive automatic compiling systems.

62

REFERENCES [1] J. Dongarra, C. Moler, J. Bunch, and G. Stewart. LINPACK Users' Guide. Society of Industrial and Applied Mathematics, Philadelphia, 1979. [2] D. Schneider. A manifesto on the structure and parllelization of large production FORTRAN codes. Unpublished. [3] R. Kee, J. Grcar, M. Smooke, and J. Miller. A FORTRAN program for modeling steady laminar one{dimensional premixed ames. Technical Report SAND85{8240, Sandia National Laboratories, 1985. [4] M. Frisch. Gaussian 86 User's Guide. Corporate Research Laboratories, Eastman Kodak Co., Rochester, NY 14650, 1987. [5] T. Butler and D. Michel. NASTRAN. GPO, 1971. [6] R. DeMeis. A code with dynamic impact (DYNA 3D). Aerospace America, 30:42{43, May 1992. [7] O.-J. Dahl, E. Dijkstra, and C. Hoare. Structured Programming. Academic Press, New York, 1992. [8] J. Wagener. Principles of FORTRAN 77 Programming. Wiley, New York, 1980. [9] J. Kral. Empirical laws of software development and their implications. Computational Physics Communications, 41:385{391, 1986. [10] H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press Frontier Series. ACM Press, New York, 1991. [11] K. Cooper, M. Hall, and L. Torczon. An experiment with inline substitution. Software { Practice and Experience, 21(6):581{601, June 1991. [12] M. Ganapathi and K. Kennedy. Interprocedural analysis and optimization. Technical Report COMP TR89{96, Department of Computer Science, Rice University, P.O Box 1892, Houston, TX 77251{1892, July 1989. [13] P. Havlak and K. Kennedy. Experience with interprocedural analysis of array side eects. Supercomputing '90 Proceedings, 1990. [14] D. Callahan and K. Kennedy. Analysis of interprocedural side eects in a parallel programming environment. Proceedings of the First International Conference on Supercomputing, June 1987. [15] C. Curtiss and J. Hirschfelder. Integration of sti equations. Proceedings of the National Academy of Sciences of the United States of America, 38:235{243, 1952. 63

[16] R. Kee and J. Miller. A structured approach to the computational modeling of chemical kinetics and molecular transport in owing systems. Technical Report SAND86{8841, Sandia National Laboratories, 1986. [17] E. Oran and J. Boris. Numerical Simulation of Reactive Flow. Elsevier, 1987. [18] V. Giovangigli and N. Darabiha. Vector computers and complex chemistry combustion. Mathematical Modeling in Combustion and Related Topics, pages 491{503, 1988. [19] R. Kee, F. Rupley, and J. Miller. CHEMKIN{II: A FORTRAN chemical kinetics package for the analysis of gas{phase chemical kinetics. Technical Report SAND89{8009, Sandia National Laboratories, 1989. [20] R. Kee, F. Rupley, and J. Miller. The CHEMKIN thermodynamic data base. Technical Report SAND87{8215B, Sandia National Laboratories, 1987. [21] R. Kee, G. Dixon-Lewis, J. Warnatz, M. Coltrin, and J. Miller. A FORTRAN computer code package for the evaluation of gas{phase, multicomponent transport properties. Technical Report SAND86{8426, Sandia National Laboratories, 1986. [22] M. Coltrin, R. Kee, and F. Rupley. Surface CHEMKIN: A FORTRAN package for analyzing heterogeneous chemical kinetics at a solid-surface{gas-phase interface. Technical Report SAND908003, Sandia National Laboratories, 1990. [23] R. Mitchell and R. Kee. A general-purpose computer code for predicting chemical behavior behind incident and re ected shocks. Technical Report SAND82-8205, Sandia National Laboratories, 1982. [24] M. Coltrin, R. Kee, G. Evans, E. Meeks, F. Rupley, and J. Grcar. SPIN: A FORTRAN program for modeling one-dimensional rotating-disk/stagnation- ow chemical vapor deposition reactors. Technical Report SAND91-8003, Sandia National Labs, 1991. [25] P. Glarborg, R. Kee, J. Grcar, and J. Miller. PSR: A FORTRAN program for modeling well-stirred reactors. Technical Report SAND86-8209, Sandia National Labs, 1986. [26] A. Lutz, R. Kee, and J. Miller. SENKIN: A FORTRAN program for predicting homogeneous gas phase chemical kinetics with sensitivity analysis. Technical Report SAND87{8248, Sandia National Laboratories, 1987. [27] J. Miller and C. Bowman. Mechanism and modeling of nitrogen chemistry in combustion. Progress in Energy Combustion Science, 15:287{338, 1989. [28] U. Banerjee, R. Eigenmann, A. Nicolau, and D.A. Padua. Automatic program parallelization. Proceedings of the IEEE, 81(2):211{243, February 1993. [29] D. Padua and M. Wolfe. Advanced compiler optimization for supercomputers. CACM, 29(12):1184{ 1201, December, 1986. [30] J. Grcar. The Twopnt program for boundary value problems. Technical Report SAND91-8230, Sandia National Laboratories, April 1992. [31] Alliant Computer Systems Corporation, Acton, MA. FX/FORTRAN Programmer's Handbook, 1985. [32] J. Tyler, A. Bourgoyne, D. Logan, J. Baron, T. Li, and D. Schneider. A vector-parallel version of BOAST II for the IBM 3090. Internal Report, IBM Kingston, 1990.

64

[33] V. Giovangigli. Convergent iterative methods for multicomponent diusion. Impact of Computing in Science and Engineering, 3:244{276, 1991. [34] A. Curtis, M. Powell, and J. Reid. On the estimation of sparse Jacobian matrices. Journal of the Institute of Mathematics and its Applications, 13:117{119, 1974. [35] J. Olsson, O. Lindgren, and O. Andersson. Ecient formation of numerical Jacobian used in ame codes. Combustion Science and Technology, 1991. [36] P. Deu hard. A modi ed newton method for the solution of ill-conditioned systems of nonlinear equations with application to multiple shooting. Numerical Mathematics, 22:289, 1974. [37] D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C.-Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijsho, R. Bramley, U.M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoe inger, G. Jaxon, Z. Li, T. Murphy, J. Andrews, and S. Turner. The Cedar system and an initial performance study. Proceedings of the 20th International Symposium on Computer Architecture, San Diego, CA, May 16-19, 1993. [38] E. Organick, A. Forsythe, and R. Plummer. Programming Language Structures. Academic Press, New York, 1978. [39] M. Marcotty and H. Ledgard. Programming Language Landscape. Macmillan, New York, second edition, 1986. [40] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high{ performance computers. Technical Report CS{90{105, The University of Tennessee Computer Science Department, May 1990. [41] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1992. [42] E. Anderson and J. Dongarra. Evaluating block algorithm variants in LAPACK. Technical Report CS{90{103, The University of Tennessee Computer Science Department, April 1990. [43] J. Demmel, N. Higham, and R. Schreiber. Block LU factorization. Technical Report CS{90{110, The University of Tennessee Computer Science Department, February 1992. [44] E. Anderson. Robust triangular solves for use in condition estimation. Technical report, The University of Tennessee, August 1991. [45] J. Demmel and N. Higham. Improved error bounds for underdetermined system solvers. Technical Report CS{90{113, The University of Tennessee Computer Science Department, August 1990. [46] J. Demmel and N. Higham. Stability of block algorithms with fast level 3 BLAS. Technical Report CS{90{110, The University of Tennessee Computer Science Department, July 1990. [47] J. Demmel, J. Dongarra, and W. Kahan. On designing portable high performance numerical libraries. Technical Report CS{90{110, The University of Tennessee Computer Science Department, March 1992. [48] S. Wright. Stable parallel algorithms for two{point boundary value problems. SIAM Journal on Scienti c and Statistical Computing, 1992. [49] S. Wright. Parallel algorithms for banded linear systems. SIAM Journal on Scieti c and Statistical Computing, 12(4):824{842, July 1991. 65

[50] T. Coee and J. Heimerl. Transport algorithms for premixed laminar, steady-state ames. Combustion and Flame, 43(273), 1981. [51] ed. E. Clementi. Modern Techniques in Computational Chemistry: MOTECC{89. ESCOM, 1989. [52] S. Chin, E. Cleminti, G. Corongui, M. Dupuis, D. Frye, D. Logan, A. Mohanty, and V. Sonnad. MOTECC{89: Input/Output Documentation. IBM Corp., Kingston, NY 12401, 1990. [53] M. Berry, D. Chen, P. Koss, D. Kuck, L. Pointer, S. Lo, Y. Pang, R. Rolo, A. Sameh, E. Clementi, S. Chin, D. Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S. Orszag, R Benchmarks: F. Seidl, O. Johnson, G. Swanson, R. Goodrum, and J. Martin. The Perfect Club Eective performance evaluation of supercomputers. International Journal of Supercomputer Applications, 3(3):5{40, Fall 1989. [54] R. Eigenmann. Toward a methodology of optimizing programs for high-performance computers. In Proceedings of 1993 International Conference on Supercomputing, Tokyo, Japan, pages 27{36, Tokyo, Japan, July 19-23 1993. ACM Press.

66

APPENDIX A

GLOSSARY OF PROCEDURE NAMES

67

Below is a glossary of selected procedure names in Premix. Procedures beginning with ck appear in the Chemkin library [19]. Procedures beginning with mc appear in the Transport library [21]. area

Evaluate cross-sectional tube area.

ckcpbs

Evaluate mean speci c heat.

ckcpms

Evaluate speci c heat.

ckhml

Evaluate enthalpies.

ckinit

Create Chemkin array sections by initializing integer osets.

ckmmwy

Evaluate mean molecular weight.

ckrat

Evaluate rate constants for chemical reactions.

ckrhoy

Evaluate mass density.

cksmh

Evaluate entropies minus enthalpies for the species.

ckwyp

Evaluate molar production rates.

ckytcp

Evaluate molar concentrations.

ckytx

Convert mass fractions to mole fractions.

daxpy

Perform vector triad.

dcopy

Perform vector copy.

ddot

Perform vector dot product.

dgbco

Factor banded matrix; compute condition number.

dgbfa

Factor banded matrix.

dgbsl

Solve using factored banded matrix.

driver

Declare array storage; invoke premix.

dscal

Scale a vector.

fldriv

Execute subroutines as instructed by the twopnt.

fun

Evaluate residual.

idamax

Determine maximum value in an array. 68

jacob

Compute numerical approximation of the Jacobian matrix.

mcacon

Evaluate mixture thermal conductivity.

mcadif

Evaluate mixture-averaged diusion coecients.

mcatdr

Evaluate thermal diusion ratios.

mcedif

Evaluate mixture diusion coecients.

mceval

Evaluate a polynomial t (using Horner's rule).

mcacon

Evaluate mixture thermal conductivity.

mcinit

Create Transport array sections by initializing integer osets.

mcmcdt

Evaluate thermal diusion coecients and mixture thermal conductivities.

mcmdif

Evaluate ordinary multicomponent diusion coecients.

mdifv

Evaluate diusion velocities.

mtrnpr

Evaluate transport coecients.

newton

Perform modi ed Newton-Raphson method.

premix

Set up working storage and call fldriv.

reasen

Perform sensitivity analysis.

temp

Interpolate temperature.

timstp

Perform user speci ed number of time steps.

twopnt

Solve two point boundary value problem.

69

APPENDIX B

NITROGEN CHEMISTRY IN COMBUSTION

70

The following is a description of the nitrogen combustion model [27] used in this thesis for performance testing.

B.1 Chemical species C P H H A A R SPECIES S G MOLECULAR TEMPERATURE ELEMENT COUNT CONSIDERED E E WEIGHT LOW HIGH H O C N ------------------------------------------------------------------------------1. CH4 G 0 16.04303 300.0 5000.0 4 0 1 0 2. CH3 G 0 15.03506 300.0 5000.0 3 0 1 0 3. CH2 G 0 14.02709 250.0 4000.0 2 0 1 0 4. CH G 0 13.01912 300.0 5000.0 1 0 1 0 5. CH2O G 0 30.02649 300.0 5000.0 2 1 1 0 6. HCO G 0 29.01852 300.0 5000.0 1 1 1 0 7. CO2 G 0 44.00995 300.0 5000.0 0 2 1 0 8. CO G 0 28.01055 300.0 5000.0 0 1 1 0 9. H2 G 0 2.01594 300.0 5000.0 2 0 0 0 10. H G 0 1.00797 300.0 5000.0 1 0 0 0 11. O2 G 0 31.99880 300.0 5000.0 0 2 0 0 12. O G 0 15.99940 300.0 5000.0 0 1 0 0 13. OH G 0 17.00737 300.0 5000.0 1 1 0 0 14. HO2 G 0 33.00677 300.0 5000.0 1 2 0 0 15. H2O2 G 0 34.01474 300.0 5000.0 2 2 0 0 16. H2O G 0 18.01534 300.0 5000.0 2 1 0 0 17. C2H G 0 25.03027 300.0 5000.0 1 0 2 0 18. C2H2 G 0 26.03824 300.0 5000.0 2 0 2 0 19. HCCO G 0 41.02967 300.0 4000.0 1 1 2 0 20. C2H3 G 0 27.04621 300.0 5000.0 3 0 2 0 21. C2H4 G 0 28.05418 300.0 5000.0 4 0 2 0 22. C2H5 G 0 29.06215 300.0 5000.0 5 0 2 0 23. C2H6 G 0 30.07012 300.0 4000.0 6 0 2 0 24. CH2OH G 0 31.03446 250.0 4000.0 3 1 1 0 25. CH3O G 0 31.03446 300.0 3000.0 3 1 1 0 26. H2CCCH G 0 39.05736 300.0 4000.0 3 0 3 0 27. C3H2 G 0 38.04939 300.0 5000.0 2 0 3 0 28. CH2(S) G 0 14.02709 300.0 4000.0 2 0 1 0 29. CH2CO G 0 42.03764 300.0 5000.0 2 1 2 0 30. C G 0 12.01115 300.0 5000.0 0 0 1 0 31. C4H2 G 0 50.06054 300.0 5000.0 2 0 4 0 32. H2CCCCH G 0 51.06851 300.0 4000.0 3 0 4 0 33. HCCOH G 0 42.03764 300.0 4000.0 2 1 2 0 34. N2 G 0 28.01340 300.0 5000.0 0 0 0 2

71

B.2 Chemical reactions (k = A T**b exp(-E/RT)) A b E

REACTIONS CONSIDERED 1. CH3+CH3(+M)=C2H6(+M) Low pressure limit: 0.31800E+42 -0.70300E+01 TROE centering: 0.60410E+00 0.69270E+04 H2 Enhanced by 2.000E+00 CO Enhanced by 2.000E+00 CO2 Enhanced by 3.000E+00 H2O Enhanced by 5.000E+00 2. CH3+H(+M)=CH4(+M) Low pressure limit: 0.80000E+27 -0.30000E+01 SRI centering: 0.45000E+00 0.79700E+03 H2 Enhanced by 2.000E+00 CO Enhanced by 2.000E+00 CO2 Enhanced by 3.000E+00 H2O Enhanced by 5.000E+00 3. CH4+O2=CH3+HO2 4. CH4+H=CH3+H2 5. CH4+OH=CH3+H2O 6. CH4+O=CH3+OH 7. CH4+HO2=CH3+H2O2 8. CH3+HO2=CH3O+OH 9. CH3+O2=CH3O+O 10. CH3+O=CH2O+H 11. CH2OH+H=CH3+OH 12. CH3O+H=CH3+OH 13. CH3+OH=CH2+H2O 14. CH3+H=CH2+H2 15. CH3O+M=CH2O+H+M 16. CH2OH+M=CH2O+H+M 17. CH3O+H=CH2O+H2 18. CH2OH+H=CH2O+H2 19. CH3O+OH=CH2O+H2O 20. CH2OH+OH=CH2O+H2O 21. CH3O+O=CH2O+OH 22. CH2OH+O=CH2O+OH 23. CH3O+O2=CH2O+HO2 24. CH2OH+O2=CH2O+HO2 25. CH2+H=CH+H2 26. CH2+OH=CH+H2O 27. CH2+OH=CH2O+H 28. CH+O2=HCO+O 29. CH+O=CO+H 30. CH+OH=HCO+H 31. CH+CO2=HCO+CO 32. CH+H=C+H2 33. CH+H2O=CH2O+H 34. CH+CH2O=CH2CO+H

72

9.03E+16 -1.2 0.27620E+04 0.13200E+03

654.0

6.00E+16 -1.0 0.00000E+00 0.97900E+03

0.0

7.90E+13 2.20E+04 1.60E+06 1.02E+09 1.80E+11 2.00E+13 2.05E+19 8.00E+13 1.00E+14 1.00E+14 7.50E+06 9.00E+13 1.00E+14 1.00E+14 2.00E+13 2.00E+13 1.00E+13 1.00E+13 1.00E+13 1.00E+13 6.30E+10 1.48E+13 1.00E+18 1.13E+07 2.50E+13 3.30E+13 5.70E+13 3.00E+13 3.40E+12 1.50E+14 1.17E+15 9.46E+13

0.0 3.0 2.1 1.5 0.0 0.0 -1.6 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.6 2.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.8 0.0

56000.0 8750.0 2460.0 8604.0 18700.0 0.0 29229.0 0.0 0.0 0.0 5000.0 15100.0 25000.0 25000.0 0.0 0.0 0.0 0.0 0.0 0.0 2600.0 1500.0 0.0 3000.0 0.0 0.0 0.0 0.0 690.0 0.0 0.0 -515.0

35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57.

58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74.

75.

CH+C2H2=C3H2+H CH+CH2=C2H2+H CH+CH3=C2H3+H CH+CH4=C2H4+H C+O2=CO+O C+OH=CO+H C+CH3=C2H2+H C+CH2=C2H+H CH2+CO2=CH2O+CO CH2+O=CO+H+H CH2+O=CO+H2 CH2+O2=CO2+H+H CH2+O2=CH2O+O CH2+O2=CO2+H2 CH2+O2=CO+H2O CH2+O2=CO+OH+H CH2+O2=HCO+OH CH2O+OH=HCO+H2O CH2O+H=HCO+H2 CH2O+M=HCO+H+M CH2O+O=HCO+OH HCO+OH=H2O+CO HCO+M=H+CO+M CO Enhanced by 1.870E+00 H2 Enhanced by 1.870E+00 CH4 Enhanced by 2.810E+00 CO2 Enhanced by 3.000E+00 H2O Enhanced by 5.000E+00 HCO+H=CO+H2 HCO+O=CO+OH HCO+O=CO2+H HCO+O2=HO2+CO CO+O+M=CO2+M CO+OH=CO2+H CO+O2=CO2+O HO2+CO=CO2+OH C2H6+CH3=C2H5+CH4 C2H6+H=C2H5+H2 C2H6+O=C2H5+OH C2H6+OH=C2H5+H2O C2H4+H=C2H3+H2 C2H4+O=CH3+HCO C2H4+OH=C2H3+H2O CH2+CH3=C2H4+H H+C2H4(+M)=C2H5(+M) Low pressure limit: 0.63690E+28 -0.27600E+01 H2 Enhanced by 2.000E+00 CO Enhanced by 2.000E+00 CO2 Enhanced by 3.000E+00 H2O Enhanced by 5.000E+00 C2H5+H=CH3+CH3

73

1.00E+14 4.00E+13 3.00E+13 6.00E+13 2.00E+13 5.00E+13 5.00E+13 5.00E+13 1.10E+11 5.00E+13 3.00E+13 1.60E+12 5.00E+13 6.90E+11 1.90E+10 8.60E+10 4.30E+10 3.43E+09 2.19E+08 3.31E+16 1.80E+13 1.00E+14 2.50E+14

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.2 1.8 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0 0.0 0.0 1000.0 9000.0 500.0 -1000.0 -500.0 -500.0 -447.0 3000.0 81000.0 3080.0 0.0 16802.0

1.19E+13 0.2 3.00E+13 0.0 3.00E+13 0.0 3.30E+13 -0.4 6.17E+14 0.0 1.51E+07 1.3 2.53E+12 0.0 5.80E+13 0.0 5.50E-01 4.0 5.40E+02 3.5 3.00E+07 2.0 8.70E+09 1.1 1.10E+14 0.0 1.60E+09 1.2 2.02E+13 0.0 3.00E+13 0.0 2.21E+13 0.0 -0.54000E+02

0.0 0.0 0.0 0.0 3000.0 -758.0 47688.0 22934.0 8300.0 5210.0 5115.0 1810.0 8500.0 746.0 5955.0 0.0 2066.0

1.00E+14

0.0

0.0

76. 77. 78. 79. 80.

81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119.

C2H5+O2=C2H4+HO2 C2H2+O=CH2+CO C2H2+O=HCCO+H H2+C2H=C2H2+H H+C2H2(+M)=C2H3(+M) Low pressure limit: 0.26700E+28 -0.35000E+01 H2 Enhanced by 2.000E+00 CO Enhanced by 2.000E+00 CO2 Enhanced by 3.000E+00 H2O Enhanced by 5.000E+00 C2H3+H=C2H2+H2 C2H3+O=CH2CO+H C2H3+O2=CH2O+HCO C2H3+OH=C2H2+H2O C2H3+CH2=C2H2+CH3 C2H3+C2H=C2H2+C2H2 C2H3+CH=CH2+C2H2 OH+C2H2=C2H+H2O OH+C2H2=HCCOH+H OH+C2H2=CH2CO+H OH+C2H2=CH3+CO HCCOH+H=CH2CO+H C2H2+O=C2H+OH CH2CO+O=CO2+CH2 CH2CO+H=CH3+CO CH2CO+H=HCCO+H2 CH2CO+O=HCCO+OH CH2CO+OH=HCCO+H2O CH2CO(+M)=CH2+CO(+M) Low pressure limit: 0.36000E+16 0.00000E+00 C2H+O2=CO+CO+H C2H+C2H2=C4H2+H H+HCCO=CH2(S)+CO O+HCCO=H+CO+CO HCCO+O2=CO+CO+OH CH+HCCO=C2H2+CO HCCO+HCCO=C2H2+CO+CO CH2(S)+M=CH2+M H Enhanced by 0.000E+00 CH2(S)+CH4=CH3+CH3 CH2(S)+C2H6=CH3+C2H5 CH2(S)+O2=CO+OH+H CH2(S)+H2=CH3+H CH2(S)+H=CH2+H C2H+O=CH+CO C2H+OH=HCCO+H CH2+CH2=C2H2+H2 CH2+HCCO=C2H3+CO CH2+C2H2=H2CCCH+H C4H2+OH=C3H2+HCO C3H2+O2=HCO+HCCO

74

8.43E+11 0.0 1.02E+07 2.0 1.02E+07 2.0 4.09E+05 2.4 5.54E+12 0.0 0.24100E+04

3875.0 1900.0 1900.0 864.3 2410.0

4.00E+13 0.0 3.00E+13 0.0 4.00E+12 0.0 5.00E+12 0.0 3.00E+13 0.0 3.00E+13 0.0 5.00E+13 0.0 3.37E+07 2.0 5.04E+05 2.3 2.18E-04 4.5 4.83E-04 4.0 1.00E+13 0.0 3.16E+15 -0.6 1.75E+12 0.0 1.13E+13 0.0 5.00E+13 0.0 1.00E+13 0.0 7.50E+12 0.0 3.00E+14 0.0 0.59270E+05 5.00E+13 0.0 3.00E+13 0.0 1.00E+14 0.0 1.00E+14 0.0 1.60E+12 0.0 5.00E+13 0.0 1.00E+13 0.0 1.00E+13 0.0

0.0 0.0 -250.0 0.0 0.0 0.0 0.0 14000.0 13500.0 -1000.0 -2000.0 0.0 15000.0 1350.0 3428.0 8000.0 8000.0 2000.0 70980.0

4.00E+13 1.20E+14 3.00E+13 7.00E+13 2.00E+14 5.00E+13 2.00E+13 4.00E+13 3.00E+13 1.20E+13 6.66E+12 1.00E+13

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1500.0 0.0 0.0 0.0 854.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6600.0 -410.0 0.0

120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135.

136. 137. 138. 139. 140.

141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151.

H2CCCH+O2=CH2CO+HCO H2CCCH+O=CH2O+C2H H2CCCH+OH=C3H2+H2O C2H2+C2H2=H2CCCCH+H H2CCCCH+M=C4H2+H+M CH2(S)+C2H2=H2CCCH+H C4H2+O=C3H2+CO C2H2+O2=HCCO+OH C2H2+M=C2H+H+M C2H4+M=C2H2+H2+M C2H4+M=C2H3+H+M H2+O2=2OH OH+H2=H2O+H O+OH=O2+H O+H2=OH+H H+O2+M=HO2+M H2O Enhanced CO2 Enhanced H2 Enhanced CO Enhanced N2 Enhanced OH+HO2=H2O+O2 H+HO2=2OH O+HO2=O2+OH 2OH=O+H2O H+H+M=H2+M H2 Enhanced H2O Enhanced CO2 Enhanced H+H+H2=H2+H2 H+H+H2O=H2+H2O H+H+CO2=H2+CO2 H+OH+M=H2O+M H2O Enhanced H+O+M=OH+M H2O Enhanced O+O+M=O2+M H+HO2=H2+O2 HO2+HO2=H2O2+O2 H2O2+M=OH+OH+M H2O2+H=HO2+H2 H2O2+OH=H2O+HO2

NOTE:

by by by by by

by by by

3.00E+10 2.00E+13 2.00E+13 2.00E+12 1.00E+16 3.00E+13 1.20E+12 2.00E+08 4.20E+16 1.50E+15 1.40E+16 1.70E+13 1.17E+09 4.00E+14 5.06E+04 3.61E+17

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.5 0.0 0.0 0.0 0.0 1.3 -0.5 2.7 -0.7

2868.0 0.0 0.0 45900.0 59700.0 0.0 0.0 30100.0 107000.0 55800.0 82360.0 47780.0 3626.0 0.0 6290.0 0.0

7.50E+12 1.40E+14 1.40E+13 6.00E+08 1.00E+18

0.0 0.0 0.0 1.3 -1.0

0.0 1073.0 1073.0 0.0 0.0

9.20E+16 6.00E+19 5.49E+20 1.60E+22

-0.6 -1.2 -2.0 -2.0

0.0 0.0 0.0 0.0

6.20E+16

-0.6

0.0

1.89E+13 1.25E+13 2.00E+12 1.30E+17 1.60E+12 1.00E+13

0.0 0.0 0.0 0.0 0.0 0.0

-1788.0 0.0 0.0 45500.0 3800.0 1800.0

1.860E+01 4.200E+00 2.860E+00 2.110E+00 1.260E+00

0.000E+00 0.000E+00 0.000E+00

by

5.000E+00

by

5.000E+00

A units mole-cm-sec-K, E units cal/mole

75

APPENDIX C

SAMPLE PROGRAM DRIVER

76

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

C

PROGRAM DRIVER IMPLICIT DOUBLE PRECISION (A-H,O-Z), INTEGER (I-N) LENLWK LENIWK LENRWK LENCWK LENSYM PARAMETER 1

allocates the logical working space allocates the integer working space allocates the real working space allocates the character working space is the length of a character string (LENLWK=200, LENIWK=6000, LENRWK=800000, LENCWK=100, LENSYM=16)

NMAX is the total number of grid points allowed PARAMETER (NMAX=65) All storage needed by the flame code is allocated in the following three arrays. LWORK is for logical variables, IWORK is for integer variables, RWORK is of real variables, CWORK is for character variables. DIMENSION LWORK(LENLWK), IWORK(LENIWK), RWORK(LENRWK) CHARACTER CWORK(LENCWK)*(LENSYM) LIN is the unit from which user input is read LOUT is the unit to which printed output is written LRIN is the unit from which the restart file is read LROUT is the unit to which the save file is written LRCRVR is the unit to which the scratch save file is written LINKCK is unit from which the Chemkin Linking file is read LINTP is unit from which the Transport Linking file is read DATA LIN, LOUT, LRIN, LROUT, LRCRVR /5, 6, 14, 15, 16/, 1 LINKCK, LINKTP /25, 35/ open the user input file open the restart file OPEN (LRIN,FORM='UNFORMATTED',FILE='restart',STATUS='UNKNOWN') open the save output file OPEN (LROUT,STATUS='UNKNOWN',FORM='UNFORMATTED',FILE='save') open the recover output file OPEN (LRCRVR,STATUS='UNKNOWN',FORM='UNFORMATTED',FILE='recover') open the Chemkin Link file OPEN (LINKCK,STATUS='OLD',FORM='UNFORMATTED',FILE='cklink') open the Transport Link file OPEN (LINKTP,STATUS='OLD',FORM='UNFORMATTED',FILE='tplink') CALL PREMIX (NMAX, LIN, LOUT, LINKCK, LINKTP, LRIN, LROUT, 1 LRCRVR, LENLWK, LWORK, LENIWK, IWORK, LENRWK, 2 RWORK, LENCWK, CWORK) END

77

APPENDIX D

LIBRARY INITIALIZATION ROUTINES

78

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

C C C C C C

C C C C

C C

C

C

C

SUBROUTINE CKINIT (LENIWK, LENRWK, LENCWK, LINC, LOUT, ICKWRK, 1 RCKWRK, CCKWRK) Reads the linking file and creates the internal work arrays ICKWRK, CCKWRK, and RCKWORK. CKINIT must be called before any other CHEMKIN subroutine is called. The work arrays must then be made available as input to the other CHEMKIN subroutines. DIMENSION ICKWRK(*), RCKWRK(*) CHARACTER CCKWRK(*)*(*), VERS*16, PREC*16 LOGICAL IOK, ROK, COK, KERR COMMON /CKSTRT/ NMM , NKK , NII , MXSP, MXTB, 1 NCP2, NCP2T,NPAR, NLAR, NFAR, 2 NTHB, NRLT, NWL, IcMM, IcKK, 3 IcNT, IcNU, IcNK, IcNS, IcNR, 4 IcWL, IcFL, IcFO, IcKF, IcTB, 5 NcWT, NcTT, NcAA, NcCO, NcRV, 6 NcKT, NcWL, NcRU, NcRC, NcPA, 7 NcK4, NcI1, NcI2, NcI3, NcI4 Data about the machine dependent constants is COMMON /MACH/ SMALL,BIG,EXPARG

MXTP, NLAN, IcNC, IcLT, IcKN, NcLT, NcK1,

NCP , NFAL, IcPH, IcRL, IcKT, NcRL, NcK2,

NCP1, NREV, IcCH, IcRV, NcAW, NcFL, NcK3,

carried in

DATA RU,RUC,PA /8.314E7, 1.987, 1.01325E6/ SMALL = 10.0D0**(-300) BIG = 10.0D0**(+300) EXPARG = LOG(BIG) WRITE (LOUT,15) 15 FORMAT (/1X,' CKLIB: 1 /1X,' 2 /1X,'

Chemical Kinetics Library', CHEMKIN-II Version 2.8, January 1991', DOUBLE PRECISION')

CALL CKLEN (LINC, LOUT, LI, LR, LC) IOK = (LENIWK .GE. LI) ROK = (LENRWK .GE. LR) COK = (LENCWK .GE. LC) IF (.NOT. IOK) WRITE (LOUT, 300) LI IF (.NOT. ROK) WRITE (LOUT, 350) LR IF (.NOT. COK) WRITE (LOUT, 375) LC IF (.NOT.IOK .OR. .NOT.ROK .OR. .NOT.COK) STOP REWIND LINC READ (LINC, ERR=110) VERS, PREC, KERR READ (LINC, ERR=110) LENI, LENR, LENC, MM, KK, II, 1 MAXSP, MAXTB, MAXTP, NTHCF, NIPAR, NITAR, 2 NIFAR, NRV, NFL, NTB, NLT, NRL, NW, NCHRG IF (LEN(CCKWRK(1)) .LT. 16) THEN WRITE (LOUT,475) STOP ENDIF NMM = MM NKK = KK NII = II MXSP = MAXSP MXTB = MAXTB MXTP = MAXTP NCP = NTHCF NCP1 = NTHCF+1

79

63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

C C C C C

NCP2 = NTHCF+2 NCP2T = NCP2*(MAXTP-1) NPAR = NIPAR NLAR = NITAR NFAR = NIFAR NTHB = NTB NLAN = NLT NFAL = NFL NREV = NRV NRLT = NRL NWL = NW APPORTION work arrays SET

ICKWRK(*)=1

TO FLAG THAT CKINIT HAS BEEN CALLED

ICKWRK(1) = 1 C C STARTING LOCATIONS OF INTEGER SPACE C C! elemental composition of species IcNC = 2 C! species phase array IcPH = IcNC + KK*MM C! species charge array IcCH = IcPH + KK C! # of temperatures for fit IcNT = IcCH + KK C! stoichiometric coefficients IcNU = IcNT + KK C! species numbers for the coefficients IcNK = IcNU + MAXSP*II C! # of non-zero coefficients (0=irreversible) IcNS = IcNK + MAXSP*II C! # of reactants IcNR = IcNS + II C! Landau-Teller reaction numbers IcLT = IcNR + II C! Reverse Landau-Teller reactions IcRL = IcLT + NLAN C! Fall-off reaction numbers IcFL = IcRL + NRLT C! Fall-off option numbers IcFO = IcFL + NFAL C! Fall-off enhanced species IcKF = IcFO + NFAL C! Third-body reaction numbers IcTB = IcKF + NFAL C! number of 3rd bodies for above IcKN = IcTB + NTHB C! array of species #'s for above IcKT = IcKN + NTHB C! Reverse parameter reaction numbers IcRV = IcKT + MAXTB*NTHB C! Radiation wavelength reactions IcWL = IcRV + NREV ITOT = IcWL + NWL - 1 C C STARTING LOCATIONS OF CHARACTER SPACE C C! start of element names IcMM = 1 C! start of species names IcKK = IcMM + MM ITOC = IcKK + KK - 1 C C STARTING LOCATIONS OF REAL SPACE

80

130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193

C C! atomic weights NcAW = 1 C! molecular weights NcWT = NcAW + MM C! temperature fit array for species NcTT = NcWT + KK C! thermodynamic coefficients NcAA = NcTT + MAXTP*KK C! Arrhenius coefficients (3) NcCO = NcAA + (MAXTP-1)*NCP2*KK C! Reverse coefficients NcRV = NcCO + NPAR*II C! Landau-Teller #'s for NLT reactions NcLT = NcRV + NPAR*NREV C! Reverse Landau-Teller #'s NcRL = NcLT + NLAR*NLAN C! Fall-off parameters for NFL reactions NcFL = NcRL + NLAR*NRLT C! 3rd body coef'nts for NTHB reactions NcKT = NcFL + NFAR*NFAL C! wavelength NcWL = NcKT + MAXTB*NTHB C! universal gas constant NcRU = NcWL + NWL C! universal gas constant in units NcRC = NcRU + 1 C! pressure of one atmosphere NcPA = NcRC + 1 C! internal work space of length kk NcK1 = NcPA + 1 C! 'ditto' NcK2 = NcK1 + KK C! 'ditto' NcK3 = NcK2 + KK C! 'ditto' NcK4 = NcK3 + KK NcI1 = NcK4 + KK NcI2 = NcI1 + II NcI3 = NcI2 + II NcI4 = NcI3 + II NTOT = NcI4 + II - 1 C C SET UNIVERSAL CONSTANTS IN CGS UNITS C RCKWRK(NcRU) = RU RCKWRK(NcRC) = RUC RCKWRK(NcPA) = PA C C!element names, !atomic weights READ (LINC,err=111) (CCKWRK(IcMM+M-1), RCKWRK(NcAW+M-1), M=1,MM) C C!species names, !composition, !phase, !charge, !molec weight, C!# of fit temps, !array of temps, !fit coeff'nts READ (LINC,err=222) (CCKWRK(IcKK+K-1), 1 (ICKWRK(IcNC+(K-1)*MM + M-1),M=1,MM), 2 ICKWRK(IcPH+K-1), 3 ICKWRK(IcCH+K-1), 4 RCKWRK(NcWT+K-1), 5 ICKWRK(IcNT+K-1), 6 (RCKWRK(NcTT+(K-1)*MAXTP + L-1),L=1,MAXTP), 7 ((RCKWRK(NcAA+(L-1)*NCP2+(K-1)*NCP2T+N-1), 8 N=1,NCP2), L=1,(MAXTP-1)), K = 1,KK) C

81

194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253

IF (II .EQ. 0) RETURN C C!# spec,reactants, !Arr. coefficients, !stoic coef, !species numbers READ (LINC,end=100,err=333) 1 (ICKWRK(IcNS+I-1), ICKWRK(IcNR+I-1), 2 (RCKWRK(NcCO+(I-1)*NPAR+N-1), N=1,NPAR), 3 (ICKWRK(IcNU+(I-1)*MAXSP+N-1), 4 ICKWRK(IcNK+(I-1)*MAXSP+N-1), N=1,MAXSP), 5 I = 1,II) C IF (NREV .GT. 0) READ (LINC,err=444) 1 (ICKWRK(IcRV+N-1), (RCKWRK(NcRV+(N-1)*NPAR+L-1),L=1,NPAR), 2 N = 1,NREV) C IF (NFAL .GT. 0) READ (LINC,err=555) 1 (ICKWRK(IcFL+N-1), ICKWRK(IcFO+N-1), ICKWRK(IcKF+N-1), 2 (RCKWRK(NcFL+(N-1)*NFAR+L-1),L=1,NFAR),N=1,NFAL) C IF (NTHB .GT. 0) READ (LINC,err=666) 1 (ICKWRK(IcTB+N-1), ICKWRK(IcKN+N-1), 2 (ICKWRK(IcKT+(N-1)*MAXTB+L-1), 3 RCKWRK(NcKT+(N-1)*MAXTB+L-1),L=1,MAXTB),N=1,NTHB) C IF (NLAN .GT. 0) READ (LINC,err=777) 1 (ICKWRK(IcLT+N-1), (RCKWRK(NcLT+(N-1)*NLAR+L-1),L=1,NLAR), 2 N=1,NLAN) C IF (NRLT .GT. 0) READ (LINC,err=888) 1 (ICKWRK(IcRL+N-1), (RCKWRK(NcRL+(N-1)*NLAR+L-1),L=1,NLAR), 2 N=1,NRLT) C IF (NWL .GT. 0) READ (LINC,err=999) 1 (ICKWRK(IcWL+N-1), RCKWRK(NcWL+N-1), N=1,NWL) C 100 CONTINUE RETURN C 110 WRITE (LOUT,*) ' Error reading linking file...' STOP 111 WRITE (LOUT,*) ' Error reading element data...' STOP 222 WRITE (LOUT,*) ' Error reading species data...' STOP 333 WRITE (LOUT,*) ' Error reading reaction data...' STOP 444 WRITE (LOUT,*) ' Error reading reverse Arrhenius parameters...' STOP 555 WRITE (LOUT,*) ' Error reading Fall-off data...' STOP 666 WRITE (LOUT,*) ' Error reading third-body data...' STOP 777 WRITE (LOUT,*) ' Error reading Landau-Teller data...' STOP 888 WRITE (LOUT,*) ' Error reading reverse Landau-Teller data...' STOP 999 WRITE (LOUT,*) ' Error reading Wavelength data...' STOP C 300 FORMAT (10X,'ICKWRK MUST BE DIMENSIONED AT LEAST ',I5) 350 FORMAT (10X,'RCKWRK MUST BE DIMENSIONED AT LEAST ',I5)

82

254 255 256

375 FORMAT (10X,'CCKWRK MUST BE DIMENSIONED AT LEAST ',I5) 475 FORMAT (10X,'CHARACTER LENGTH OF CCKWRK MUST BE AT LEAST 16 ') END

83

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

C--------------------------------------------------------------------C SUBROUTINE MCINIT (LINKMC, LOUT, LENIMC, LENRMC, IMCWRK, RMCWRK) C C*****double precision IMPLICIT DOUBLE PRECISION (A-H, O-Z), INTEGER (I-N) C*****END double precision C C*****single precision C IMPLICIT REAL (A-H, O-Z), INTEGER (I-N) C*****END single precision C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C SUBROUTINE MCINIT (LINKMC, LOUT, LENIMC, LENRMC, IMCWRK, RMCWRK) C C THIS SUBROUTINE SERVES TO READ THE LINKING FILE FROM C THE FITTING CODE AND TO CREATE THE INTERNAL STORAGE C AND WORK ARRAYS, IMCWRK(*) AND RMCWRK(*). MCINIT C MUST BE CALLED BEFORE ANY OTHER TRANSPORT SUBROUTINE C IS CALLED. IT MUST BE CALLED AFTER THE CHEMKIN C PACKAGE IS INITIALIZED. C C INPUTC LINKMC - LOGICAL UNIT NUMBER OF THE LINKING FILE. C FITTING CODE WRITE TO DEFAULT UNIT 35 C LOUT - LOGICAL UNIT NUMBER FOR PRINTED OUTPUT. C LENIMC - ACTUAL DIMENSION OF THE INTEGER STORAGE AND WORKING C SPACE, ARRAY IMCWRK(*). LENIMC MUST BE AT LEAST: C LENIMC = 4*KK + NLITE C WHERE, KK = NUMBER OF SPECIES. C NLITE = NUMBER OF SPECIES WITH MOLECULAR C WEIGHT LESS THAN 5. C LENRMC - ACTUAL DIMENSION OF THE FLOATING POINT STORAGE AND C WORKING SPACE, ARRAY RMCWRK(*). LENRMC MUST BE AT C LEAST: C LENRMC = KK*(19 + 2*NO + NO*NLITE) + (NO+15)*KK**2 C WHERE, KK = NUMBER OF SPECIES. C NO = ORDER OF THE POLYNOMIAL FITS, C DEFAULT, NO=4. C NLITE = NUMBER OF SPECIES WITH MOLECULAR C WEIGHT LESS THAN 5. C C WORKC IMCWRK - ARRAY OF INTEGER STORAGE AND WORK SPACE. THE STARTING C ADDRESSES FOR THE IMCWRK SPACE ARE STORED IN C COMMON /MCMCMC/. C DIMENSION IMCWRK(*) AT LEAST LENIMC. C RMCWRK - ARRAY OF FLOATING POINT STORAGE AND WORK SPACE. THE C STARTING ADDRESSES FOR THE RMCWRK SPACE ARE STORED IN C COMMON /MCMCMC/. C DIMENSION RMCWRK(*) AT LEAST LENRMC. C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C DIMENSION IMCWRK(*), RMCWRK(*) CHARACTER*16 VERS, PREC LOGICAL IOK, ROK C COMMON /MCMCMC/ RU, PATMOS, SMALL, NKK, NO, NLITE, INLIN, 1 IKTDIF, IPVT, NWT, NEPS, NSIG, NDIP, NPOL, 2 NZROT, NLAM, NETA, NDIF, NTDIF, NXX, NVIS, 3 NXI, NCP, NCROT, NCINT, NPARK, NBIND, NEOK, 4 NSGM, NAST, NBST, NCST, NXL, NR, NWRK, K3 C

84

66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128

C C C C C C C C C C

THE FOLLOWING NUMBER SMALL IS USED IN THE MIXTURE DIFFUSION COEFFICIENT CALCULATION. ITS USE ALLOWS A SMOOTH AND WELL DEFINED DIFFUSION COEFFICIENT AS THE MIXTURE APPROACHES A PURE SPECIES, EVEN THOUGH STRICTLY SPEAKING THERE DOES NOT EXIST A DIFFUSION COEFFICIENT IN THIS CASE. THE VALUE OF "SMALL" SHOULD BE SMALL RELATIVE TO ANY SPECIES MOLE FRACTION OF IMPORTANCE, BUT LARGE ENOUGH TO BE REPRESENTED ON THE COMPUTER. SMALL = 1.0E-20 RU = 8.314E+07 PATMOS= 1.01325E+06

C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C WRITE VERSION NUMBER C WRITE (LOUT, 15) 15 FORMAT( 1/' TRANLIB: Multicomponent transport library,', 2/' CHEMKIN-II Version 1.6, November 1990', C*****double precision 3/' DOUBLE PRECISION') C*****END double precision C*****single precision C 3/' SINGLE PRECISION') C*****END single precision C C CHANGES FROM VERSION 1.0: C 1. Changed REAL*8 to DOUBLE PRECISION C CHANGES FROM VERSION 1.1: C 1. Eliminated many GOTO's C CHANGES FOR VERSION 1.3 C 1. SUBROUTINE MCLEN C CHANGES FOR VERSION 1.4 C 1. Linking file has additional record to indicate its C version, machine precision, and error status C 2. Linking file has required lengths for integer, real C work arrays. C 3. New Subroutines MCPNT, MCSAVE read, write linking C file information, work arrays C CHANGES FOR VERSION 1.6 C 1. Linking file versions 1.8 and 1.9 added (TRANLIB V.1.5 C and TRANFIT V.1.8 were intermediate versions which may C not be legitimate; TRANFIT V.1.9 is actually a C correction to V.1.7, and TRANLIB 1.6 is an update of C V.1.4) C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C READ THE PROBLEM SIZE C CALL MCLEN (LINKMC, LOUT, LI, LR) IOK = (LENIMC .GE. LI) ROK = (LENRMC .GE. LR) IF (.NOT. IOK) WRITE (LOUT, 300) LI IF (.NOT. ROK) WRITE (LOUT, 350) LR IF (.NOT.IOK .OR. .NOT.ROK) STOP C REWIND LINKMC READ (LINKMC, ERR=999) VERS, PREC, KERR READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE

85

129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

C

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

C

C

C

C

NK NK2 K2 K3 K32 NKT

= = = = = =

NO*NKK NO*NKK*NKK NKK*NKK 3*NKK K3*K3 NO*NKK*NLITE

APPORTION THE REAL WORK SPACE THE POINTERS HAVE THE FOLLOWING MEANINGS: NWT - THE SPECIES MOLECULAR WEIGHTS. NEPS - THE EPSILON/K WELL DEPTH FOR THE SPECIES. NSIG - THE COLLISION DIAMETER FOR THE SPECIES. NDIP - THE DIPOLE MOMENTS FOR THE SPECIES. NPOL - THE POLARIZABILITIES FOR THE SPECIES. NZROT - THE ROTATIONAL RELAXATION COLLISION NUMBERS. NLAM - THE COEFFICIENTS FOR THE CONDUCTIVITY FITS. NETA - THE COEFFICIENTS FOR THE VISCOSITY FITS. NTDIF - THE COEFFICIENTS FOR THE THERMAL DIFFUSION RATIO FITS. NXX - THE MOLE FRACTIONS. NVIS - THE SPECIES VISCOSITIES. NXI - THE ROTATIONAL RELAXATION COLLISION NUMBERS BEFORE THE PARKER COFFECTION. NCP - THE SPECIES SPECIFIC HEATS. NCROT - THE ROTATIONAL PARTS OF THE SPECIFIC HEATS. NCINT - THE INTERNAL PARTS OF THE SPECIFIC HEATS. NPARK - THE ROTATIONAL RELAXATION COLLISION NUMBERS AFTER THE PARKER CORRECTION. NBIND - THE BINARY DIFFUSION COEFFICIENTS. NEOK - THE MATRIX OF REDUCED WELL DEPTHS. NSGM - THE MATRIX OF REDUCED COLLISION DIAMETERS. NAST - THE MATRIX OF A* COLLISION INTEGRALS FOR EACH SPECIES PAIR. NBST - THE MATRIX OF B* COLLISION INTEGRALS FOR EACH SPECIES PAIR. NCST - THE MATRIX OF C* COLLISION INTEGRALS FOR EACH SPECIES PAIR. NXL - THE "L" MATRIX. NR - THE RIGHT HAND SIDES OF THE LINEAR SYSTEM INVOLVING THE "L" MATRIX. NWRK - THE WORK SPACE NEEDED BY LINPACK TO SOLVE THE "L" MATRIX LINEAR SYSTEM. NWT = NEPS = NSIG = NDIP = NPOL = NZROT=

1 NWT + NKK NEPS + NKK NSIG + NKK NDIP + NKK NPOL + NKK

NLAM = NETA = NDIF = NTDIF=

NZROT + NKK NLAM + NK NETA + NK NDIF + NK2

NXX = NVIS = NXI = NCP = NCROT= NCINT= NPARK=

NTDIF + NO*NKK*NLITE NXX + NKK NVIS + NKK NXI + NKK NCP + NKK NCROT + NKK NCINT + NKK

NBIND= NEOK = NSGM = NAST = NBST = NCST =

NPARK + NKK NBIND + K2 NEOK + K2 NSGM + K2 NAST + K2 NBST + K2

NXL

= NCST + K2

86

201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239

C C C C C C C C C

C C C

C C C C

NR = NXL + K32 NWRK = NR + K3 NTOT = NWRK + K3 - 1 APPORTION THE INTEGER WORK SPACE THE POINTERS HAVE THE FOLLOWING MEANING: INLIN - THE INDICATORS FOR THE MOLECULE LINEARITY. IKTDIF- THE SPECIES INDICIES FOR THE "LIGHT" SPECIES. IPVT - THE PIVOT INDICIES FOR LINPACK CALLS. INLIN = IKTDIF= IPVT = ITOT =

1 INLIN + NKK IKTDIF + NLITE IPVT + K3 - 1

READ THE DATA FROM THE LINK FILE READ (LINKMC, ERR=999) PATMOS, (RMCWRK(NWT+N-1), 1 RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1), 2 RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1), 3 IMCWRK(INLIN+N-1), N=1,NKK), 4 (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK), 5 (RMCWRK(NDIF+N-1), N=1,NK2), 6 (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT) SET EPS/K AND SIG FOR ALL I,J PAIRS CALL MCEPSG (NKK, RMCWRK(NEPS), RMCWRK(NSIG), RMCWRK(NDIP), 1 RMCWRK(NPOL), RMCWRK(NEOK), RMCWRK(NSGM) ) 300 FORMAT (10X,'IMCWRK MUST BE DIMENSIONED AT LEAST ', I5) 350 FORMAT (10X,'RMCWRK MUST BE DIMENSIONED AT LEAST ', I5) RETURN 999 WRITE (LOUT, *) ' Error reading Transport linking file...' STOP END

87