1995, Santa Barbara, California
Generating Parallel Code from Object Oriented Mathematical Models Niclas Andersson and Peter Fritzson
Department of Computer and Information Science Linkoping University, S-581 83 Linkoping, Sweden E-mail:
[email protected],
[email protected] Phone: +46 13 281000 September 21, 1995
Abstract For a long time ecient use of parallel computers has been hindered by dependencies introduced in software through low-level implementation practice. In this paper we present a programming environment and language called ObjectMath (Object oriented Mathematical language for scienti c computing), which aims at eliminating this problem by allowing the user to represent mathematical equation-based models directly in the system. The system performs analysis of mathematical models to extract parallelism and automatically generates parallel code for numerical solution. In the context of industrial applications in mechanical analysis, we have so far primarily explored generation of parallel code for solving systems of ordinary dierential equations (ODEs), in addition to preliminary work on generating code for solving partial dierential equations. Two approaches to extracting parallelism have been implemented and evaluated: extracting parallelism at the equation system level and at the single equation level, respectively. We found that for several applications the corresponding systems of equations do not partition well into subsystems. This means that the equation system level approach is of restricted general applicability. Thus, we focused on the equation-level approach which yielded signi cant parallelism for ODE systems solution. For the bearing simulation applications we present here, the achieved speedup is however critically dependent on low communication latency of the parallel computer.
1 Introduction Parallel processing has become a widely used approach for increasing the throughput of modern computers. Development in this area has been promoted by applications that need immense computing power such as weather prediction, chemical modeling and various mechanical simulation problems. Many of these problems are solved by dedicated application programs, which
are hard to understand and even harder to port between dierent parallel architectures without major restructuring or loss in eciency. The problems should therefore be expressed at the highest possible level of abstraction to reduce the dependence between instances of machines and programs. The work, which we describe in this paper, is done in ObjectMath [6, 7], a high-level programming environment for scienti c computing. It includes mathematical modeling, symbolic computation and objectoriented constructs which are both good for mathematical modeling and very useful when the problem is transformed into a low-level, executable program (Figure 1 & 2). In this paper, it is assumed that there are processors, operating systems and imperative languages in which parallelism can be expressed. Instead of having diculties in nding parallelism in low-level Fortran [15, 19, 10] we make an attempt to extract parallelism at a high level where there is more knowledge left about the original problem.
1.1 The Ideal Programming Environment: Programming In High-Level Equations
The ideal high level programming environment would automatically transform systems of equations into ecient symbolic and numerical programs. It would select optimization routines with good convergence properties for the given problem. The environment would also aid in formulating equations given geometrical constraints and transforming equations between dierent coordinate systems. However, fully automatic versions of some of these capabilities will be hard to achieve. It is more realistic to assume that the user will work in dialogue with the interactive system, and that the user can supply valuable hints and information that will guide the system in choosing the right algorithms and transformations. Another important advantage of an equational representation over a Fortran implementation is that more
2 Potential Parallelism
Force equilibrium
F
(W N i) W i;I
+ F(
W N i)
W i;E
+ F(
W N i)
W i;ext
= 0
Moment equilibrium
M
(W N i) W i;I
+ M(
W N i)
W i;E
p p
+ M(
+ + ) +
W N i)
W i;ext
F ( ) F( p( ) F(
(W N i)
(W N i)
W i;I
W i;I
W Ni
W i;E
W Ni
W i;E
W Ni
W N i)
W i;ext
W i;ext
= 0
INSTANCE BodyW[i] INHERITS Roller(W[i]) ... (* Equations *) Eq[1] := F[W[i]][BodyIr] + F[W[i]][BodyEr] + F[W[i]][Ext] == { 0, 0, 0 }; Eq[2] := M[W[i]][BodyIr] + ... == { 0, 0, 0 }; ... END BodyW[i];
Figure 1: The same equations for Moment and Force equilibrium in both mathematical syntax and in ObjectMath syntax. inherent parallelism of the problem is preserved. Thus better results can be expected when generating code for massively parallel machines. Some desired capabilities of the programming environment are listed below: Modeling support in expressing systems of equations, e.g. handling of geometrical constraints and coordinate transformations. Integration of object-oriented techniques for better structuring of equational models. Algebraic transformations of equations. Transformation of equations in to programs for computation on massively parallel hardware. Evaluation of numerical experiments. Graphical presentation and visualization. The ObjectMath environment provides several of these capabilities. An overview of an early version of this environment can be found in [8], whereas a more recent description appears in [7]. The main topic of this paper, however, is the generation of parallel code from equation-based models.
As mentioned previously, one of the main problems in extracting parallelism from application programs written in languages such as Fortran is the low level of abstraction. During the process of implementing a typical numerical simulation program, many unnecessary dependencies and constraints are introduced because of implementation choices. The program is thus, in a sense, over-speci ed. This means that many possibilities for parallelization are lost. If instead the problem is represented at the highest level of abstraction, i.e. the mathematical model consisting of equations, it should be possible to extract essentially all parallelism from the application model. Low-level executable source code for solution of systems of equations should be automatically generated from the mathematical model and combined with high-quality parallel numerical algorithms to form the parallel application code.
2.1 Kinds of Parallelism
Parallelism can appear at several levels of a problem description. In this paper we investigate two ways of extracting parallelism from the equation-based model, and how to combine this parallel solution algorithms.
Parallelism at the system of equations level If
the system of equations can be partitioned into two or more subsystems which can be solved independently of each other, then the computation can be parallelized accordingly. The dependency analysis is based around the standard algorithm for nding strongly connected components in a directed graph [1, pages 222{226]. The equations are partitioned into sets of mutually dependent equations by this algorithm (i.e. separate system of equations) and the reduced, acyclic dependency graph is built. The reduced graph is then used to schedule the solution of the equation systems, i.e. to determine the order in which the systems has to be solved and which systems that can be solved in parallel. An additional possibility is pipe-line parallelism between the solution of equation systems: values produced from the solution of one system are continuously passed as input for the solution of another system.
Parallelism at the equation level If, during the
solution process, computational parts originating from dierent equations are independent of each other and can be evaluated in parallel, then we have parallelism at the equation level. For ordinary dierential equations in explicit form, the derivatives are de ned by the right-hand sides of the equations, which are independent of each other and thus can be evaluated in parallel.
Figure 2: The ObjectMath high-level modeling environment. The window to the right is the browser showing the overall structure of a model. The other windows displays the code in various objects.
Parallelism in the solution algorithm This case of parallelism originates from the speci c numerical methods used by the numerical solver. Depending on the algorithm, various kinds of parallelism can be exploited.
2.2 Initial Value Problems
on the previous one. To make the nal result exact enough, every calculation step must keep a high degree accuracy, forcing the solver to take very small steps and approach the solution at a very slow pace. That makes this kind of computation very time-consuming. Attempts have been made to restructure and adapt numerical ODE solution algorithms to become partly parallel [3, 16], but with limited success.
The methods applicable to extract parallelism from equation-based models will to some extent be dependent on the class of mathematical model. In this paper 2.3 Parallelism for Ordinary Dierential Equations we focus on a large group of problems which can be expressed as initial-value problems, giving rise to or- There are several ways to exploit parallelism for an dinary dierential equations (ODEs) [11, 2]. However, we have recently started work to extend the approach initial value problem computation: to a wider problem domain by also investigating mod- Parallelism across the method. els based on partial dierential equations. An initial value problem is solved numerically by ap- Parallelism across time. plying a general, pre-written ODE-solver to the equation system. Existing solvers use inherently sequen- Parallelism across the system (or problem). tial methods. The computed solution of an initial value problem consists of a large number of calculated The method and time techniques cannot be exploited approximations where every approximation depends without redesigning the ODE solver, which currently is
outside the scope of this work. Here we focus on exploiting problem-dependent parallelism in the system of equations. Such parallelism can be found at two levels: Parallelism at the equation system level. If the set of ODEs can be partitioned into two or more sets which can be solved independently of each other, the computation can be parallelized accordingly. Parallelism at the equation level. The right hand sides of the equations are independent of each other and can therefore be evaluated in parallel. When we nd two or more sets of ODEs which are independent of each other as the equation system technique suggests, we split these sets into separate problems. How independent the sets are is determined by dependence analysis. The gain of such partitioning is: We get speedup due to parallelism even if the derivatives computation time is short. Otherwise, if the time to compute the given derivatives is small compared to the communication latency, we will get an increase in time if we distribute the computation of the derivatives according to the equation level approach The ODE-solver can, for each ODE system, choose its own step size independently of the others. In each ODE system there are fewer equations and hence fewer bounds on the step size. Consequently, the average step size may increase. The ODE-solver's internal computation time decreases due to fewer state variables. If the solver uses an implicit method we can get quadratic speedup thanks to a smaller Jacobian matrix.
G1‘IPart
x 15
Gate‘Angle$A
x5
G1‘Throttle
…
…
Dam‘SurfaceLevel
Regulator‘IPart
G6‘Throttle
G6‘IPart
Figure 3: Dependencies between equations and strongly connected components (SCCs) in the dependency graph for the hydroelectric power plant example model x
y Ro x x y
y
Or Ir
2.4 Structure of an ODE Solver
An ODE-solver consists of a general, carefully implemented algorithm which numerically approximates the solution to a given system of ODEs, an initial state, some controlling parameters and solution points where solutions are desired: The initial state (often trivial to calculate) and control parameters are given once, at the beginning of the computation. The desired solution can be either one single point or a hole range of points where solutions are wanted. The system of ODEs is a function y_ (t) = f (y(t); t) which calculates the rst-order derivatives y_ (t) for each state variable from a given state y(t). Thus the equation system consists of only rst-order
Figure 4: Geometry for the 2D rolling bearing. ODEs. The variable t is called the free variable and often represents the time in a simulation. The function should be side-eect free to allow as much parallelism as possible to be extracted. From now on we call this function RHS1. The solution process can be described as follows. From a known point, y , on the solution curve, the solver makes an approximation y +1 to y(t +1 ). This approximation is an extrapolation of either previously calculated points (multi-step methods) or intermediate n
n
1 Right-Hand-Side
n
SpinningElement
CoordinateSystem Body ContactSystem Rings Contact RoRi RiRo
Bearing Or Ir
Roller Or Ir
Ro
B
B2
B1
Dependencies between equations and strongly connected components are depicted in Figure 3. The 2D rolling bearing model was designed as a simpli ed version of the much more complex realistic 3D bearing models. It has some properties in common with the industrially relevant models, but is more amenable to detailed analysis in a limited space than the complex models. Figure 4 shows the geometry of the bearing, consisting of an outer ring, an inner ring and ten rolling elements. The equations describing the motions and interactions of these elements have been represented in an ObjectMath model, whose inheritance hierarchy and composition structure is shown in Figure 5. This model is described in more detail in [5]. The 2D rolling bearing is more complex from a mathematical point of view than the other two examples. All equations are strongly connected except one, which can be seen in Figure 6.
Figure 5: ObjectMath inheritance hierarchy and com- 2.5.1 Discussion of Parallelization at the position structure for the 2D rolling bearing model. Equation System Level Is the dependence analysis we have done useful in our extrapolations (single-step or Runge-Kutta methods) parallelizing process? If we assume that it is possible [16]. Each time a extrapolation is calculated, the func- to gain performance by partitioning the ODE-system, tion RHS is invoked. Since the solver usually needs to then the dependence analysis is always useful. Thus take a large number of steps to reach the desired solu- the question we really should ask is whether there are tion, the communication between the solver and RHS realistic problems that can be partitioned. Looking at is intense. If the method used by the ODE-solver is the examples we see that there really are more than implicit, the extrapolation point is dependent on itself one strongly connected component (SCC). A closer exand calculated by iteration. In that case it can be nec- amination of the SCCs shows that there is often one essary to calculate the Jacobian matrix, J = ( ) . SCC where the \main" problem is located, and one or We can, of course, use the RHS-function to compute more peripheral SCCs which are either trivial or uninan approximation, but it is usually possible to provide teresting to calculate (or both). Researchers in control the solver with an extra function dedicated to comput- theory and application modeling con rm this view to ing the Jacobian. some extent, although there also seem to be realistic control applications which partition nicely. To have this analysis as the only guide for paral2.5 Examples of Parallelism Using the lelization seems to be rather unreliable since using it Systems of Equations Approach on real problems in many cases does not yield much To get an indication of the eectiveness of extracting parallelism. However, the analysis and the visualizaparallelism using the method of partitioning systems tion of dependencies are very helpful tools for the modof equations, we have modeled some small applications el implementor. It is easy to nd missing dependencies in ObjectMath. These are: 1) a hydroelectric power or dependencies that should not be there. Also, uninplant, and 2) a 2D bearing. The 2D bearing model has teresting parts of the problem can be removed at an some properties in common with the much more com- early stage so that no computing power is wasted. plex 3D models that describe realistic bearings which can be manufactured. An ObjectMath model of a hydroelectric power plant 2.5.2 Parallelization at the Equation Level has been created, including objects like turbines, spill- Since the partitioning into systems of equations does ways, dams, and regulators. The model is based on not yield much parallelism for some classes of applian actual Swedish power plant, A lvkarleby Kraftverk, cations, in particular bearing simulation, we focus our although the faithfulness to the original varies consid- eorts on the equation level approach to generate parerably within the model. The focus is on water levels allel code, which is described in the rest of this paand water ow through the plant. Thus, the mod- per. This approach is guaranteed to yield signi cant el can be used for verifying dam safety margins, for amounts of parallelism if there are enough equations, example. The model is described in some detail in [5]. since all equation right-hand sides can be computed in @f y;t @y
R Problem
dT3
ObjectMath Model
ObjectMath Interactive Environment
dFi
ODEs internal form
T3 ObjectMath Code Generator
Fi
dR
Figure 6: Dependencies between equations and SCCs in the 2D rolling bearing model (for one roller).
Runtime System & Solver
RHS Fortran 90 or C++
Compilers HPF, F90, C++, cc
Start Values
Parameters for the solver
Executable
parallel. The main problem is to get performance out of this parallelism by using appropriate code generation and scheduling techniques.
MIMD
3 System Architecture and Code Generation
Simulation Result
Visualization Tool
In the introduction we brie y described an ideal pro7: The overall structure of the ObjectMath engramming environment and some aspects of the Ob- Figure vironment. jectMath environment. Figure 7 shows the structure of the ObjectMath system in some additional detail. An application problem is described as an object oriented mathematical model. This model can then be inspected, transformed, and used for generation of parallel ObjectMath 3.0 : ObjectMath 4.0 : code which is combined with library routines, compiled and run on a parallel MIMD computer. Finally, there is some support for visualization of numerical results. ObjectMath Model
ObjectMath Model
3.1 Implementation of the ObjectMath Compiler and Code Generator
The architecture of the ObjectMath implementation has been changed several times, due to extensions of the system and feedback from the users. Figure 8 shows the structure of both the old and the new compiler. The old compiler is written in Scheme. The intermediate representation for this compiler consists of abstract syntax trees represented as S-expressions, corresponding to Mathematica's internal format. This compiler performed a number of transformations, but leaved name- and scope analysis to be handled by Mathematica's context mechanism. This worked well initially, but turned out to be a serious drawback when ObjectMath was extended to allow modeling of composition [6]. The old code generator
ObjectMath Compiler
ObjectMath Frontend
Objectmath internal form Mathematica code Transformer
Mathematica Code Generator
Serial Numerical Code (C++)
Mathematica int. form
Code Generator
Mathematica Unparser
Code Generator Interface Mathematica code
Parser
Serial and Parallel Numerical Code (Fortran 90 , C++)
Figure 8: The architecture of two generations of the ObjectMath system (not including the browser). The latest version keeps a symbol table which is used by both the transformer and the code generator.
was implemented in the Mathematica language, because of the availability of algebraic transformations and other powerful operations. However, some drawbacks in the Mathematica language and implementation made it inecient and hard to maintain. Also, some information which is easily available to the compiler had to be derived by extensive analysis by the code generator which only had access to the Mathematica code generated by the compiler, not the original ObjectMath model. The above mentioned limitations make it necessary to develop a completely new implementation when adding new capabilities such as generation of parallel code, and generation of Fortran 90 code. This required a more general type analysis than the previous C++ oriented mechanism. The new implementation use the Cocktail toolkit, a set of compiler writing tools [9] such as tools for parsing, pattern-matching, tree-transformation, attribute evaluation, etc. In the new implementation, the compiler and code generator run in the same address space. This makes it possible for the code generator to directly access the internal representation of both the ObjectMath model and the generated Mathematica code. Sometimes algebraic expressions should be evaluated symbolically in Mathematica before code is generated from them. This can easily be handled, as the code generator communicates with Mathematica via the MathLink [20] protocol. The input to the code generator consists of a list of abstract syntax trees, compatible with Mathematica's full form internal representation. This list is extracted automatically from the ObjectMath system. The expression transformer in the code generator accepts a list of rst order dierential equations, where some subexpressions have been annotated by type information. Since the equation part consists of rst order dierential equations, the left-hand side is always a derivative. Various transformations are done, including removing the derivatives and replacing the equations by assignments, where the right-hand sides are the right-hand sides from the equations. The result represents what really needs to be computed by the generated code when using a speci c solver.
3.2 Generating Parallel Code
The parallelization stage of the code generator groups all small assignments into one task and splits large assignments obtained from the equations into several tasks for computation. The dependence relation between the tasks determines the communication between them. This forms a directed acyclic graph which is the input to the scheduler. A simple supervisorworker scheme (Figure 10) is currently used to schedule the computation of the tasks and to minimize the amount of sent data, communication analysis is need-
Expression Transformer
Function Transformer
Compilable Subset Verifyer
Transformations Common Subexpression Elimination
Communication analysis
Parallelization
Type Derivation (checking)
Static Scheduling
Fortran90 Generator
C++ Generator
Figure 9: The ObjectMath 4.0 code generator in more detail. Modules in dashed lines is not fully implemented. ed to nd out which data should be distributed. Also, common subexpressions are eliminated. No subexpressions are shared between the tasks and by wrapping the generated code with some pre-written parts which set up the message-passing communication, and performs the communication and dynamic scheduling, we get a complete program. A very simple example, where x = y + c and y = ?x is shown in intermediate code pre x code in Figure 11, in addition to Fortran 90 for the generated parallel distributed memory MIMD version. Note that the generated code for all right-hand sides have been put into the single subroutine RHS. The derivatives have been replaced by the variables xdot and ydot. The righthand sides are very simple here, but for example in code generated for bearing applications the right-hand sides consist of several tens of thousands of oating point operations. Currently the decision on which processors the separate expressions are computed is postponed to runtime to make load balancing possible. This makes it impractical to generate special code for packing and sending messages since the contents of the messages is not known at compile time. We go one step further and implement slightly more general communication routines in the runtime system and invoke these from RHS as needed. The are unnecessary assignments in the generated code will be removed by Fortran compiler by means of optimizations based on data ow analysis. In addition to RHS we need some functions to set the starting point for the simulation. To make it possible 0
0
for the programmer to use the variable names from the ObjectMath model, these functions must also be generated by the code generator. Furthermore, since it is essential that the start values for the simulation can be changed without re-compilation of the application, we generate a function which reads values from a text le and assigns it to the right variable. At this moment, we do not consider data parallelism. This is due to the characteristics of our application. Most of the arrays used in the application are of size 1 3 or 3 3, since we are dealing with physical three dimensional objects. These arrays are too small to bene t from data parallelism.
Supervisor
Workers
Figure 10: The supervisor/worker model of task scheduling. The solver is supervisor and assigns tasks to the workers to evaluate equation right-hand sides of the parallelized RHS.
3.2.1 The Existing Solver
There are several ODE-solvers available. We have used a solver named LSODA from the ODE-solver package ODEPACK [12]. This solver is written in Fortran77 by Alan C. Hindmarsh and Linda R. Petzold [14]. It is one of the solvers which implements BDF (backward dierentiation formulas) methods, which are usually used to solve sti ODEs [11]. Before calling the solver, the user have to supply several parameters, start conditions, and provide the solver with a function which computes the derivative of the current state. This function is the target of the parallelization eorts of the code generator. There is also a possibility for the user to provide the solver with an extra function that computes the Jacobian, instead of having the solver doing it internally (which is usually very expensive.) If the user can provide this function the computation time might be reduced drastically.
3.2.2 Target Platforms
Our initial target architectures are MIMD machines. The Parsytec GC/PP has 64 nodes, where each node contains two PowerPC 601 processors and four T805 transputers. The other target machine is a shared-memory MIMD computer, a SPARC Center 2000 with 8 processors.
3.2.3 Scheduling Parallel ODE Solution
Normal form (without type annotations) { { x'[t] == y[t], y'[t] == -x[t] }, { t, tstart, tend } }
Pre x form with type annotations List[ List[Equal[Derivative[1] [om$Type[x, om$Real]] [om$Type[t, om$Real]], om$Type[y, om$Real]], Equal[Derivative[1][y][t], Minus[x[t]]]], List[t, om$Type[tstart, om$Real], om$Type[tend, om$Real]] ]
Generated Parallel Fortran 90 code subroutine RHS(workerid, yin, yout) integer workerid real(double) yin(2), yout(2) select case (workerid) case (1) y = yin(2); xdot = y; yout(1) = xdot case (2) x = yin(1) ydot = -x yout(2) = ydot end select end subroutine
In most cases a real problem contains more dierential equations than the target machine has processors. Therefore the scheduler is responsible for clustering tasks into a number of groups corresponding to the number of processors available on the machine. One method of doing this is to predict the estimated execution time (or weight) of each task to be able to distribute the load as evenly as possible. As the scheduler has the predicted execution time of each task and Figure 11: Example of normal form of equations, preall tasks are currently independent of each other, it x intermediate code form with type annotations and can use the very simple largest-processing-time (LPT) generated SPMD Fortran 90 code. scheduling algorithm [4] to construct an ecient schedule.
3.3 Generating Code For the 2D Bearing Application
The chosen bearing simulation application is based on a simple 2D model of a cylindrical rolling bearing described in section 2.5. The ObjectMath system currently generates serial code from the large 3D models, and will soon be able to generate parallel code from these models. From its 560 lines representation in the interactive ObjectMath environment, the 2D model expands into 11859 lines of type annotated Mathematica full form intermediate code. From this, the code generator produces 10913 lines of Fortran 90 code, of which 4 709 lines are variable declarations. The common subexpression elimination (CSE) extracts 4642 common subexpressions. The Fortran 90 statements are readable but not understandable since many subexpressions have been extracted from their context. If we instead generate serial Fortran 90 code, i.e. allowing the CSE-eliminator to optimize all equation right-hand sides together, and not only equation by equation, we obtain 4 301 lines of Fortran 90 code (1 840 common subexpressions). This substantial reduction is apparently caused by dierent equations having several large subexpressions in common. These cannot be shared when the equations are scheduled as
600 GigaCube/PP SparcCenter2000 500
400 #RHS-calls/s
However, there may be conditional expressions within the right-hand sides. These may be impossible to predict statically which makes it dicult to estimate the execution time. During simulation, these conditions can cause the load on dierent processors to vary over time and temporarily reduce the overall performance. This imbalance can be avoided by dynamically adapting the schedule to the varying load. We are using the elapsed times for right-hand side evaluations during the previous iteration step to predict the execution times during the next step. This information is used to regularly update the schedule. This semi-dynamic version of the LPT algorithm consumes less than 1% of the execution time for the 2D bearing simulation examples so far investigated. An additional problem is that the LPT algorithm does not take communication latency into account. As the application, and thus the number of ODEs increases, larger messages need to be sent between the solver process and all the workers. This must be handled eciently to make the application scalable. Currently, every variable that might be used is passed to the worker processors, i.e. all variables in the state vector. This scheme is used because of the dynamic scheduling strategy. Since the work is redistributed semi-dynamically by the scheduler, we also have to compose the messages semi-dynamically at runtime. This composition of smaller messages instead of sending the whole state will be implemented in the future.
300
200
100
0 1
2
3
4
5
6
7
8 9 10 11 number of processors
12
13
14
15
16
17
Figure 12: Speedup curves for the 2D bearing example computed on the Parsytec GC/PP and the SPARC Center 2000. separate tasks|hence the larger number of subexpressions. In order to reduce this number and produce more ecient parallel code, we will have to extract some of the larger common subexpressions and compute them in parallel.
4 Performance Measurements The performance is measured on two dierent computers; one with shared memory and one with distributed memory. (Earlier described in section 3.2.2.) A message of 1 byte takes 4 s to be propagated to another processor on the shared memory architecture and 140 s on the distributed memory machine. We are only interested in measuring the computation of the RHS-function since this is our target of the parallelization. The speed of the dierent architectures is shown in gure 12. By using the shared memory architecture (with the low latency of shared memory) we get an almost linear speedup up to seven processors. Since the computer have a time-sharing operating system (UNIX) we can not exploit the whole machine|hence the \knee" at the end of the speedup curve. The speed of the distributed memory machine reach a peak at four processors. By using more processors, the latency and network contention becomes too large to get additional performance. However, the performance is better if we have a larger problem. To be able to increase the performance the problem has to have a larger granularity. This can be solved by using more thorough dependency analysis and task partition algorithms.
5 Related Work Most systems do not generate numeric code from equation-based models. Of those that do generate code, we know of only two that generate parallel code: a recent version of FINGER [17, 18], which generates parallel code for specialized FEM computations, and the Sinapse system [13], which can generate some parallel code for problems in its application domain.
6 Conclusions and Future Work There is a strong need for ecient high-level tools and languages in scienti c computing to enhance the programming process and to exploit the maximum parallelism of applications in scienti c computing. We feel that the ObjectMath approach makes an important contribution in satisfying part of this need. Complex mathematical equations and functions can be expressed at a high level of abstraction rather than as procedural code. Object-oriented features allow better structure of models and permit reuse of equations through inheritance. We have explored two ways of extracting parallelism from equation-based models, and tested this approach on three small applications. The rst approach detects groups of inter-dependent equations that form strongly connected components (SCCs) in the dependency graph of the equations. Such groups represent subsystems of equations that can be solved in parallel or in a pipeline. However, there seems to be is a high degree of connectivity for many real applications. This makes most equations interdependent, and prevents extracting parallelism in the form of such subsystems of equations. Concerning our three examples, the hydroelectric power station model and the trivial servo-example could be reasonably parallelized through such partitioning, whereas the 2D bearing model only yielded two SCCs, where all the computation was embedded in one of them. Thus, the technique of extracting parallelism through subsystems of equations is highly application dependent and cannot in general be expected to pay o. In all cases though, the analysis and visualization are very useful for the problem implementor, who can easily nd missing or incorrect dependencies. Also, unnecessary parts of the model can be removed at an early stage to avoid wasting computation. The second approach of extracting parallelism from equation-based models is more ne-grained, and yields signi cant amounts of parallelism. For systems of ordinary dierential equations in explicit form, the derivatives need to be computed during each computational step of the solution process. Derivatives are de ned by the right-hand sides of the equations, which can be computed in parallel. We have focused on evaluating this approach for rolling bearing simulation appli-
cations in cooperation with industry. Some speedups have been obtained for a small 2D rolling bearing application, on a low-latency SPARC Center 2000 and on a medium-latency Parsytec GC/PP. The scalability is however dependent on low latency and high bandwidth of the parallel machine, and on computationally heavy right-hand sides of the equations. These conditions can be ful lled with the larger 3D bearing applications. Preliminary analysis and test runs of subsets of these applications indicate that a potential speedup of 100{300 will be possible for large bearing problems. Future work will continue to improve the eciency and scheduling of the automatically generated code. Better integration of dierent code optimization techniques is needed. This will be evaluated for the large 3D bearing applications mentioned above. We have also started to extend the domain of equation systems for which code can be generated to partial dierential equations, where uid dynamics applications are common.
7 Acknowledgments Lars Viklund designed the overall architecture of the ObjectMath 4.0 environment. Rickard Westman designed the ObjectMath model for an hydroelectric power station and implemented the type analysis of the system; Patrick Hagglund designed the ObjectMath model for the 2D bearing application. Patrick Nordling provided knowledge in the mathematical eld. Dag Fritzson at SKF ERC provided expertise on the application domain.
References [1] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison-Wesley Publishing Company, 1983. [2] Francois E. Cellier. Continuous System Modeling. Springer-Verlag, 1991. [3] Yi-Ling F. Chiang, Ji-Suing Ma, Kuo-Lin Hu, and Chia-Yo Chang. Parallel multischeme computation. Journal of Scienti c Computing, 3(3):289{ 306, 1988. [4] Edward G. Coman, Jr and Peter J. Denning. Operating System Theory. Prentice Hall, 1973. [5] The PREPARE Consortium. Dependency analysis and scheduling for ordinary dierential equations. Ref. PELAB-2-Scheduling, rel. 1.1+, February 1994. [6] Peter Fritzson, Vadim Engelson, and Lars Viklund. Variant handling, inheritance and com-
position in the ObjectMath computer algebra en- [18] Paul S. Wang. Graphical user interfaces and vironment. In Alfonso Miola, editor, Design and automatic generation of sequential and parallel Implementation of Symbolic Computation Syscode for scienti c computing. In IEEE CompCon tems, volume 722 of Lecture Notes in Computer Spring, 1988. Science, pages 145{160. Springer-Verlag, 1993. [19] Michael E. Wolf and Monica S. Lam. A loop trans[7] Peter Fritzson, Lars Viklund, Dag Fritzson, and formation theory and an algorithm to maximize Johan Herber. ObjectMath | an environment for parallelism. IEEE Transaction on Parallel and high-level mathematical modeling and programDistributed Systems, 2(4):452{471, October 1991. ming in scienti c computing. Accepted for publi[20] Wolfram Research, Inc, P.O. Box 6059, Chamcation in IEEE Software. paign, IL, 61826-6059, USA. MathLink Reference [8] Peter Fritzson, Lars Viklund, Johan Herber, and Guide, 1993. Version 2.2. Dag Fritzson. Industrial application of objectoriented mathematical modeling and computer algebra in mechanical analysis. In Georg Heeg, Boris Magnusson, and Bertrand Meyer, editors, Technology of Object-Oriented Languages and Systems | TOOLS 7, pages 167{181. Prentice Hall, 1992.
[9] Josef Grosch and Helmut Emmelmann. A tool box for compiler construction. In Dieter Hammer, editor, Compiler Compilers, volume 477 of Lecture Notes in Computer Science, pages 106{116. Springer-Verlag, 1990. [10] Manish Gupta and Prithviraj Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transaction on Parallel and Distributed Systems, 3(2):179{193, March 1992. [11] E. Hairer and G. Wanner. Solving Ordinary Dierential Equations II: Sti and DierentialAlgebraic Problems. Springer-Verlag, 1991.
[12] Alan C. Hindmarsh. ODEPACK, A systematized collection of ODE solvers. IMACS Transactions on Scienti c Computing, 1:55{64, 1983. [13] Elaine Kant. Synthesis of mathematical modeling software. IEEE Software, 10(3):30{41, May 1993. [14] Linda Petzold. Automatic selection of methods for solving sti and nonsti systems of ordinary differential equations. SIAM J. Sci. Stat. Comput., 4(1):136{148, March 1983. [15] Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew. An empirical study of Fortran programs for parallelizing compilers. IEEE Transaction on Parallel and Distributed Systems, 1(3):356{364, July 1990. [16] P. J. van der Houwen and B. P. Sommeijer. Iterated Runge-Kutta methods on parallel computers. SIAM J. Sci. Stat. Comput., 12(5):1000{1028, September 1991. [17] Paul S. Wang. FINGER: A symbolic system for automatic generation of numerical programs in nite element analysis. Journal of Symbolic Computation, 2:305{316, 1986.