Workstation Clustering: A Powerful Tool for Numerical Simulation in Flight Sciences and Space Research Michael Reschy & Hans Babovskyz
RUS, Allmandring 30, D-70550 Stuttgart, e-mail:
[email protected] z Wissenschaftliches Zentrum der IBM Heidelberg Vangerowstr. 18, D-69115 Heidelberg y
Abstract:
Numerical simulation in ight sciences and space research is an extremely compute-intensive task. Within the special research project SFB 259 several sub-projects are dealing with such numerical simulations in dierent elds of space research. It was clear from the beginning of the SFB that such simulations can be performed only on powerful computing systems. The paper presented shows how parallel computing on a cluster of workstations helps to get compute-power. We will present the tools available for programming. Furthermore, the subprojects that are involved in the cluster-project will be discussed and results of the eorts done in parallelizing are given. Numerische Simulationen im Bereich der Flugwissenschaften und Weltraumforschung sind mit grossem Rechenaufwand verbunden. Im Rahmen des SFB 259 befassen sich einige Teilprojekte mit solchen Berechnungen in verschiedenen Bereichen der Weltraumforschung. Von Beginn des SFB an war klar, dass solche Simulationen nur auf sehr leistungsfahigen Systemen durchgefuhrt werden konnen. In diesem Beitrag wird gezeigt, wie ein Cluster von Workstations die erforderliche Leistung liefern kann. Wir werden Werkzeuge und Strategien zur Parallelisierung zeigen. Der Hauptteil des Artikels wird sich mit den im SFB erzielten Ergebnissen der Parallelisierung von Codes beschaftigen.
1 Introduction Numerical simulation is part of the research in ight sciences and space research. Mainly in the eld of aerodynamics and thermodynamics much work is done in simulating processes numerically that are dicult or expensive to simulate in experiments. The main drawback of numerical simulations so far has been their intensive need for computing power. Currently the most promising approach in high performance computing is parallel computing. Among the dierent solutions that are oered to get performance by concurrently operating computing facilities, the cluster approach is an extremely cost-eective and simple one. The SFB 259 has thus decided to establish a cluster of RS/6000 workstations in a joint project with DLR, RUS and IBM to provide computing power to the members of the SFB. Usage of parallel systems makes it necessary to migrate or even redesign applications. Several approaches toward parallelism like message-passing and data parallel programming are possible. The paper will emphasis on these and will discuss strategies to use workstation clusters for numerical simulation even for more sophisticated geometrical and numerical models. One major basic problem in the SFB is the calculation of internal and external ows in aerospace. This requires the solution of the Navier-Stokes equations. Two Navier-Stokes solvers are thus being parallelized in the project. In both cases a domain decomposition approach was chosen. In another sub-project the calculation of optimal ight paths was performed. The computational costs in the optimization are mainly dominated by the computation of gradients in the optimization algorithm. In a rst step this part of the program was parallelized. 1
2 The PARIS-cluster: In a joint project of IBM, DLR, RUS and the Special Research Project about "High Temperature Problems of Re-Usable Space Transportation Systems" at the University of Stuttgart, a cluster of 6 IBM RS/6000-550 (called PARIS { PArallel RISc) was installed. Each of the machines provides a peak performance of 86 Million Floating Point Operations per Second (MFLOPS), has 128 MByte of main memory and 1 GByte of disk space. Only one of the machines allows user login while the others may be used only as additional nodes in a dedicated environment. This allows for providing the users with a single-system image and thus makes work for the users easier. PARIS6 −FD
PARIS5
PARIS4
−FD
−FD S
SOCC
PARIS3
PARIS2
−FD S
−FD S
PARIS1 −FD
FDDI Ethernet
Figure 1: Schematic view of the workstation cluster All of the machines in the cluster are connected via both standard Ethernet and FDDI. Additionally three of them are connected via a special high speed point-to-point interconnect network SOCC (Serial Optical Channel Converter) from IBM. Figure 1 shows a schematic view of the cluster.
3 Tools and Performance: As the main parallelization paradigm, message-passing was chosen. The reasons for this relate to portability issues to a very broad range of dierent hardware-platforms. Furthermore other parallel programming paradigms, like shared virtual memory and the data-parallel model, are consumer-driven models and therefore have very hard needs concerning the communication speed and latency. Message-passing, however, is a producer-driven model and therefore allows a skilled user to have an arbitrary amount of overlap between communication and computation. By this, good results can already be achieved on very low-cost hardware like the cluster described here. The users were provided with several tools for code development in a parallel computing environment.
3.1 Preprocessing:
A prerequisite of any parallelization project must be a good understanding of the problem and of the code under consideration. Preprocessing of a code that is to be parallelized means that a detailed analysis on Data ow of the program Control ow of the program must be done. 2
In both cases dependencies may occur that prevent parallelization. Such dependencies may be found in the control ow when several tasks depend on each other and thus may not be able to run in parallel. In most cases this is due to data dependencies where calculation of step A needs data that are calculated in another step B. For analysing Fortran 77 codes Forge 90 was used [3]. It allows for a global view both on control and data ow and thus enables the user to nd bottlenecks and dependencies in the parallelization process easily.
3.2 Message-Passing:
PVM3 (Parallel Virtual Machine) from Oak Ridge National Laboratory [1] has become a sort of standard for message-passing during the last two years. It is easy to understand and easy to handle for the user and mainly provides two functionalities. Handling of processes Exchange of messages In addition to standard PVM3 a specially designed version from IBM was used (called PVM/6000) that takes full advantage of the high-speed interconnect networks of the cluster [2]. Thus, PVM/6000 yields a much better performance as we will see later.
3.3 Performance Analysis:
To be able to evaluate the performance of a parallel program a special tool called Paragraph was used. Paragraph is a graphical display system for visualizing the behaviour and performance of parallel programs on message-passing multicomputer architectures [4]. It provides the user with a visual animation of the behaviour of the parallel program and thus allows for depicting the computation and communication phases. Idle times and bottlenecks can be found easily and improvement of the program can be based on these data.
3.4 Performance
Performance of parallel codes using message-passing paradigma is mainly determined by the latency of a message-passing operation and the bandwidth of the network system [5].
Latency: The amount of time it takes to prepare and set up a message. This overhead is caused by initialization processes and software overhead. It is assumed here to be constant and will be measured in microseconds. Bandwidth: The amount of Data that are transferred via the network during a given time. Normally this is given in Mbytes/sec.
We therefore measured both latency and bandwidth for the message-passing libraries used on the dierent networks. Table 1 shows latencies in microseconds measured for three dierent networks (Ethernet, FDDI, SOCC) using two dierent message-passing systems (PVM3, PVM/6000). For the standard PVM3, latencies are nearly the same for each network and are rather high: in the range of 1700 to 1950 microseconds. PVM/6000 is able to reduce latency by a factor of about 3 to 4, but still the values measured are rather high. The bandwidth that can be achieved when sending a message is much smaller compared to theoretical peak bandwidth values due to software overhead. Figure 2 shows measured bandwidth for both standard PVM3 and the specially-tuned PVM/6000. For an Ethernet network (ETH3.DAT) using PVM3, the results look quit satisfactory. A bandwidth of about 1 MB/s is reached which is about 80 % of theoretical peak bandwidth (1.25 MB/s). On the FDDI network, measured bandwidth lags behind theoretical values more. Using PVM3 (FDDI3.DAT) the user ends up at a measured bandwidth of about 2.25 MB/s which is only about 3
Network MP System latency (sec)
Ethernet PVM3 1920
FDDI SOCC PVM3 PVM/6000 PVM3 PVM/6000 1850 960 1700 535
Table 1: Measured latency of dierent message-passing systems on dierent networks 9
8
’ETH3.DAT’ ’FDDI3.DAT’ ’SOCC3.DAT’ ’FDDI6.DAT’ ’SOCC6.DAT’
7
Bandwidth in MB/sec
6
5
4
3
2
1
0 0
5000
10000
15000 20000 Messagesize in Byte
25000
30000
35000
Figure 2: Measured bandwidth of dierent message-passing systems on dierent networks 18 % of what is theoretically possible (12.5 MB/s). The same network yields much better results when the tuned message-passing library PVM/6000 is used (FDDI6.DAT). A peak measured rate of about 5 MB/s corresponds to 40 % of theoretical peak rate. This increase in bandwidth is due to the fact that software overhead is reduced for the specially tuned software. The same eect, even more dramatic can be seen for the SOCC-network. While standard PVM3 ends up at about 2.4 MB/s thus providing only 8.5 % of peak rate (SOCC3.DAT), PVM/6000 goes up to 8.3 MB/s which represents about 30 % of peak rate (28.125 MB/s).
4 Strategies of parallelization: Strategies of Parallelization have to take into account both the mathematical model and software used and the hardware on which the application is intended to run.
4.1 General Comments:
Taking into account the results of performance tests aects the strategies chosen in parallelization of codes. For the description of these strategies some basic de nitions are used:
Granularity: The number of subproblems into which a problem is split. A ne-grained problem consists of a large number of small tasks. A coarse-grained problem consists of a small number of large tasks. Ratio of computation to communication: The ratio of time spent in computation to the time spent in communication.
4.2 Remark on the PARIS-cluster:
As we have seen above, a higher amount of communication results in a necessity for more system time. Since latencies are in the range of several hundred microseconds, the intensive exchange of 4
messages may add up to a considerable amount of time spent in communication. And since the tools available do not support an overlap of communication and computation, this yields a long idle time for each process involved in the data exchange process. So whatever strategy is chosen has to ful ll two important criteria: Reduce the amount of communication as much as possible, since each message that has to be sent yields a long idle time for the processor in use. Try to put as much work as possible on one processor. Which means try to only work with coarse-grained problems. If those two constraints are taken into consideration, an excellent ratio of computation to communication can be achieved, which will lead to an ecient use of the workstation cluster.
4.3 Basic Strategies:
Having a look at applications as used in ight sciences and space research we may nd two approaches to parallelization. Use of inherent parallelism: Many of the problems under consideration showed some inherent parallelism, which means that already in the sequential code there are tasks that can be processed in parallel without aecting the program itself. An example may be an optimization process where for dierent parameters several independent optimizations have to be performed. This can be done in parallel without a signi cant change of the code. Data decomposition: Many of the applications in use are working on a large grid which corresponds to a large array of data. Normally for each cell of the grid a large amount of work has to be done. An easy way to ful ll the criteria above is to simply split the grid (or domain) into a number of subdomains and solve the problem on each of those subdomains. An interchange of information between processors needs to be done only when data have to be interchanged between neighboring domains. Both approaches were used in some of the projects of the SFB. They will be discussed in more detail below, together with the projects themselves.
4.4 Long Range Considerations:
Strategies as shown above mainly apply suciently to codes based on standard methods which are known to be robust. In the frame of the project as well as in the workframe of further projects in the eld of ight sciences and space research, new developments have come about that impose new challenges to the user of parallel systems, challenges that can only partially be met by the approaches described above. Due to increased performance and memory avialability, more exible and sophisticated models become a customary praxis. On the one hand, the availability of much larger main memory - often due to scaling of parallel systems that adds up to a much larger amount of memory than has been available on single processor machines - allows for a more complex geometric modelling. This results in more complex grids that do not allow the use of a simple domain decomposition strategy. For structured codes such complex geometries may cause problems in evenly partitioning the grid. Rather, standard partitioning methods based on the multi-block approach will end up with an imbalanced load distributed across the cluster of workstations. However, this problem may be overcome to a certain extent specifying ner partitioning of the grid into a much higher number of blocks than workstations available. Such ne granular partitioning will allow for a distribution of blocks - and thus workload - that comes close to optimum. On the other hand, new numerical methods have emerged in the eld of ight sciences and space research. From the point of view of parallel systems mainly those methods that add complexity to the overall algorithm in terms of dynamic change of compute load are critical methods. 5
Among those algorithms, multigrid algorithms constitute a challenge when having to be parallelized. Such algorithms have shown to be latency-bound [5], mainly due to the fact that the amount of work goes down tremendously for coarser grids, thus yielding an infavorable ratio of computation-to-communication. As has been shown above, latencies on a workstation cluster are in the range of several hundred milliseconds. For codes such as those described here, this may impose a strong limitation on the speedup to be achieved. Strategies have been developed to overcome such problems that may be summarized by the term "latency-hiding" - the main idea being to overlap communication and computation in the code, thus keeping the processor from staying idle while messages are to be sent. When using message-passing strategies, as are appropriate for clusters of workstations, the message-passing library used has to provide the functionality of simultaneously handling communication and computing. This requires that asynchronism of sending and receiving data is allowed. Furthermore the correct handling of buers used in the message exchange has to be ensured. Such asynchronous communication has so far been provided only by native message passing systems of massively parallel systems and has not been available for workstations clusters. However, with the emergence of an industrial standard message-passing system MPI (Message Passing Interface) this feature has become widely available for those users too [7]. Another point to be mentioned here is the problem of network bandwidth. As we have already seen above this is both a problem of hardware and software used. For the PARIS-cluster it was shown that speci cally designed software on high performance networking hardware may signi cantly improve performance of even complex multigrid codes [6]. Summarizing one may say that latency hiding together with improved message-passing software in a cluster of workstations connected via standard high performance networks such as FDDI or ATM may help to overcome problems that are known to be latency-bound. An even more challenging problem for the software engineer when it comes to parallelize a code is that of handling adaptive grid-re nement. Such re nement may be a regular one that only slightly alters the grids database. However if due to self-adaptation unpredictable changes of the database have to be faced which yields a change in computing load. This change of computing load is re ected in the underlying grid database. In addition to that, the problem may become even worse when unstructered grids are used. Strategies must be found to overcome this problem without signi cant loss of performance. The typical issues of handling such a problem and an approach to address them are described shortly here [8]. After dynamic changes of the grid database are made the whole database has to be redistributed across the processor address spaces in a balancing step, while taking the initial distribution into account. This is necessary to gain good load balancing during the entire solution process. Several problems will occur in implementing this balancing step, both technical and strategical. The technical problems stem from the representation of the grid by object structures and links (i.e. pointers) between them. As there is no possibility for ecient implementations of global pointer concepts, complex globalize/localize procedures have to be implemented in order to ensure validity across address space boundaries. Additionally, the communication aspects of the transfer procedure have to be optimized for latency hiding, message number minimization and software overhead. As for the second strategical problem, distributions must be found which have to be nearly optimal in terms of the load per processor. Important strategic constraints are: minimal number of transfered objects, minimal inter-processor boundaries, low runtime of the distribution strategy, and an implementation which is scalable with processor number and problem size. The use of an additional balancing step will cause a point of synchronization and its feasability will depend on the costs for redistribution compared to that for calculation. For clusters of workstations, this will require an approach where a balancing step is used omly when a considerable amount of time was spent for calculation. The development of future networking hardware will not help to fully overcome this problem, but for problems exhibiting a low degree of dynamics use of workstation clusters still will make a lot of sense.
6
5 Results: For a better understanding of the results given below, two notions have to be introduced that are normally used to describe the results of parallelization.
Speedup: Describes how much computation has been accelerated by parallelization. t1CPU pr Sppr = pr tCPU
(1)
where Sppr ... Speedup for pr processors tnCPU ... CPU-time spent on n processors Eciency: Describes how ecient a parallel system is used. Sppr Eff pr = pr 0 Eff pr 1
(2)
were
Eff pr ... Eciency on pr processors
These two gures give an estimate on how well a program has been parallelized.
5.1 URANUS:
The newly developed two-dimensional Navier-Stokes solver URANUS (Upwind Relaxation Algorithm for the Navier-Stokes equations of the University of Stuttgart) has been developed for the simulation of non-equilibrium ows around re-entry vehicles in a wide altitude-velocity range [9]. The unsteady, compressible Navier-Stokes equations are discretized using the cell-centered Finite-Volume approach.
Figure 3: Modi ed domain decomposition to ensure balanced computing. The method results in a block-pentadiagonal linear system of equations that is iteratively solved by the Jacobi line relaxation method where an underlying subiteration scheme is used to minimize the inversion error. A simple preconditioning technique is used which improves the condition of the linear system and simpli es the LU-decomposition of the systems to be solved in each line relaxation step. A short analysis of the code showed that the explicit parts show some inherent parallelism that can be easily exploited in a rst parallelization attempt. This required a domain decomposition strategy for the grid in use. Actually the grid corresponds to a two-dimensional Fortran array of constant size to be allocated by the program. Thus, decomposition of the grid corresponds to a decomposition of the array in a natural way. The decomposition is then mapped onto to the parallel machine. The approach is 7
based on the assumption that the grid in use is uniform and will not be re ned or changed in any other way during the calculation process. This may become an important restriction when trying to apply modern numerical techniques to the problem solved. For the application under consideration another problem appears. Since the grid size does not always t the array size an imbalance of work may occur. For the problem presented, gure 3 shows the nal solution of the domain decomposition algorithm. The mapping chosen can easily be done and ensures a nearly perfect load balance. The problem that remains is that of handling of boundary conditions. Each processor working on a block that includes boundaries has to do some additional work. The results of this slight imbalance can be seen in gure 4, where speedup results for only the explicit part of the code are shown. For a reasonably large problem the parallelization yields an ecieny of nearly 90 %. Speedup
6 5 4 3 2 1 1 2 3 4 5 6 Number of workers
Figure 4: Speedup results for the explicit part of URANUS.
5.2 IRS{SINA:
In the project A1 a code system to compute the plasma wind tunnel ows is developed called IRS{SINA (Sequential{Iterative Non{equilibrium Algorithm) [11, 12, 13]. The code system is divided into three parts: a solver for the ow eld, a solver for the species and vibrational energy conservation equation, and an electron temperature solver. The parts of the program system are solved in an sequential{iterative manner until convergence is reached. For parallelization of the entire code system each of these program parts needs to be parallelized. For the ow eld solver, a strongly modi ed Navier{Stokes solver (NSFLEX) from DASA is used originally parallelized by Michl [14]. Basically the same structure is used, with a few modi cations, for the ow eld solver of IRS{SINA. The sequential NSFLEX is able to work on block-structured grids, where the blocks are computed one after another. Based on this structure a domain decomposition is done. For each block a 1-D communication table is created. This table contains all necessary information for the communication, such as neighbouring blocks and block correlations. Using such a table allows a simple and secure control for all messages that are to be sent or received by each process. At the beginning of a new timestep the messages for each block are packed and sent to the applicable process. Subsequently, the messages from the neighbouring blocks are received. The other parts of the solver had to be modi ed and rearranged rst in order to be treated on a multiblock grid. Then the second 'parallel' module of NSFLEX, as described above, was modi ed and adapted in a suitable way to work with the other data of the species and vibrational energy conservation equation and electron temperature solvers. A rst comparison between the sequential and parallel code was performed. For each run 12000 iterations were made and periodically (every 100 iterations) the data were written to an output le on another machine. Thus, there is a time loss in making connections and transfering the data via NFS over the local network to another machine. For the sequential run there was a need of 3:859s=Iter: and for the parallel run on 3 blocks and, therefore, 3 machines the time per iteration was reduced to 1:4308s=Iter:. So the fraction of the time per iteration is 2:7, which is close to the 8
#line search iterations Sp6 Eff 6
1 2 4 5 7 3.336 2.923 2.525 2.321 2.106 0.56 0.49 0.42 0.39 0.35
Table 2: Measured speedup of phase 1 optimization for dierent numbers of line search iteration steps. theoretical value of 3. The dierence is probably caused by the large time amount for the data output via NFS. The rst step was to prove the applicability of the code system on a parallel machine. The performance of the program has shown to be good. There will be further investigations to improve the code and to extend it to new boundary conditions, which will allow it to calculate with overlapping grids with unregular common boundaries.
5.3 Flight Performance and System Optimization:
The prediction of ight performance capabilities of a proposed space transportation system require the optimization of trajectories including both the ascent ight leg and the reentry part of a reusable vehicle and the determination of an optimal system design. Optimization of the whole system is a time-consuming task. Therefore, a two-level decomposition approach has been developed to solve both the mission and system optimization task simultaneously [10]. The approach that was chosen in the project was a decomposition of the problem by splitting up the trajectory to be optimized into subarcs which can be optimized independently on a subproblem level. For this a nonlinear programming optimization algorithm is used consisting mainly of calculation of search direction and an iterative line search. A second level controller is then used to optimize the entire mission using the partial results of the rst step. For a more detailed description of the algorithm see [10]. The most time consuming part in this algorithm was the calculation of gradients. Since this was nearly data-independent gradient calculation was split up do be done in parallel. Following this approach requires only two small communication steps for data exchange. During the parallelization process two main problems were encountered: Load imbalance: The dierent trajectory segments considered in the subproblem optimization level are dependent on various numbers of optimization parameters. Thus, each of the trajectory optimization problems requires a dierent number of gradients to be evaluated which may cause processors to stay idle. Sequential parts: While the calculation of gradients has been fully parallelized, the iterative line search is calculated sequentially so far. Normally this part of the calculation requires only a small amount of time. However, there may be situations where the number of line search steps per iteration increases and thus, the time spent in sequential calculations becomes larger. Table 2 summarizes results measured for the rst subproblem optimization phase performed on 6 processors of the cluster. For an increasing number of line search steps speedup decreases as well as eciency. The relatively small eciency of 56 %, even for only a small sequential part (1 line search step), is due to the imbalance mentioned above. This eect is further increased when the sequential part becomes more important (7 line search steps). Eciency for this case is only 35 %. Further steps will therefore be taken into account to parallelize also the line search procedure. To solve the problem of imbalance a larger number of processors would be required. This could be achieved by adding some workstations to the cluster or by porting the code to a massively parallel machine.
9
6 Summary and Conclusions: We have presented the workstation cluster project PARIS and have shown how parallel programming can be done using such a system. Furthermore we have given a description of the programming strategies that allow one to exploit parallel architectures like the one shown. The results we present show that parallel computation is a means to reduce turn-around times of compute intensive jobs. Those results can be extrapolated to other parallel systems. Taking into account the relatively low bandwidth and high latency values of the networks used, compared to those of a modern massively parallel system, we can summarize by saying that a big share of applications of space research and ight sciences will pro t to a high extent from parallel computing.
References [1] Geist A., et al.: PVM 3 User's Guide and Reference Manual. ORNL/TM-12187, Oak Ridge 1994. [2] M. Bernaschi, G. Caliari, G. Richelli, PVM/6000 User's Guide and Subroutine Reference, Release 2.0, February 21, 1994. [3] Applied Parallel Research: Forge 90 Baseline System User's Guide. Placerville, 1993. [4] Heath M.T., Finger J.E.: ParaGraph: A Tool for Visualizing Performance of Parallel Programs. Technical Report ORNL/TM-11813, Oak Ridge National Laboratory, Oak Ridge 1991. [5] Resch M., Geiger A., Zikeli J.: Message-Passing-Systems on Workstation Clusters and Parallel Computers { The Impact of Software- and Network-Architectures on Applications. In: Gentzsch W. and Harms U. (eds.) High Performance Computing and Networking, Vol. 2, pp 260 - 266. Berlin, Heidelberg: Springer 1994. [6] M. Resch, A. Geiger, M. Sang, Developing PVM-Code on Various Hardware Platforms: Portability and Performance. First European PVM User Group Meeting, 10. - 11. October 1994, Rom, Italy. [7] W. Gropp, Ewing Lusk & anthony Skjellum: USING MPI Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge - Massachusetts - London, 1994. [8] Klaus Birken: An Ecient Programming Model for Parallel and Adaptive CFD-Algorithms, Proceedings of Parallel CFD Conference 1994, Kyoto,Japan, 1994. [9] Scholl E., Fruhauf H.-H. and Zhenping Feng: Calculation of Turbulent Internal Flows with a New Three-Dimensional Implicit Upwind Navier-Stokes Solver. In: Wagner S., Hirschel E.H., Periaux J., Piva R. (eds.) Computational Fluid Dynamics '94, pp 933-940. Chichester, New York, Brisbane, Toronto, Singapore: John Wiley & Sons 1994. [10] Rahn M., Schottle U.M.: Decomposition Algorithm For Flight Performance And System Optimization Of An Airbreathing Launch Vehicle. 45th Congress of the International Astronautical Federation: October 9-14, Jerusalem, Israel, 1994. [11] Gogel,T.H.: Numerische Modellierung von Hochenthalpiestromungen mit Strahlungsverlusten. Dissertation, Universitat Stuttgart, 1994 [12] Grau, T., Gogel, T.H., Sleziona, C., Messerschmid, E.W.: Development of a CFD-Code for the Computation of Plasma Wind Tunnel Flows. AIAA 93-5024, AIAA/DGLR Fifth Aerocpace Planes and Hypersonic Technology Conference, 1993, Munich. [13] Messerschmid, E. W., Fasoulas, S., Gogel, T.H., Grau, T.: Numerical Modeling of Plasma Wind Tunnel Flows. ZFW, this issue. 10
[14] Michl, T., Bode, A., Lenke, M., Wagner, S.: Implicit Euler Solver on Alliant FX/2800 and Intel iPSC/860 Multiprocessors. Flow Simulation with High Performance Computers I, DFG Priority Research Programme, Vieweg, Braunschweig 1993.
11