Experiences in Parallelising an Aeronautics Code on the ... - CiteSeerX

Experiences in Parallelising an Aeronautics Code on the KSR1 R. W. Ford Centre for Novel Computing, Department of Computer Science, Manchester University, Oxford Road, Manchester, M13 9PL U.K.

Abstract Virtual Shared Memory (VSM) has been proposed as the solution to scalable shared memory parallel architectures. This paper reports on parallelising a scientific code from aeronautical engineering to a VSM machine, the KSR1. The code predicts the laminar to turbulent transition point of flow over an aerofoil. The experiences of initial porting and successive optimisation to examine efficiency on a KSR1 is discussed in detail and performance results are presented. Comparisons are made with a traditional shared memory Alliant FX/2808. Performance results indicate the suitability of parallel architectures for this class of application.

1: Introduction Laminar Flow Control (LFC) [1] is steadily becoming accepted as an important industrial tool in increasing the fuel efficiency of aircraft. In January 1993 a major UK aero engine company test flew a nacelle (engine casing) whose shape had been determined by the application of LFC. LFC achieved laminar flow over 60% of the nacelle, giving a 3% saving in specific fuel consumption (total saving in fuel) over the previous shape. To put the saving into perspective it has been estimated that for a typical aircraft a 3% reduction in specific fuel consumption will save a company $300,000 per year [2] and it has been noted that a 4% difference in specific fuel consumption will enable an aircraft to corner the market [3]. LFC has even greater impact in reducing skin friction for submarines and long distance pipelines as it constitutes

80-85% and 100% of total drag, respectively, compared with 50% for aircraft [4]. One of the ways to achieve a reduction in skin friction is to delay the onset of turbulence. In this paper, Boundary Layer Stability theory is used to model the turbulent transition point and features causing it. Flow properties such as velocity and pressure are split into steady components and small oscillatory disturbance terms. The latter can be shown to satisfy the Orr Somerfeld equation [5][6]. In this study, flow over an aerofoil is modelled by a 2-D incompressible fluid. The corresponding Orr-Somerfeld equation reduces to a 4th order boundary value problem (BVP), see Figure 1. When the amplification rate of the instability wave solution of the Orr-Somerfeld equation (N-factor) exceeds a critical value, turbulence occurs. Figure 1: air velocity profile freestream

boundary layer aerofoil Starting with an initial approximation the N-factor is solved by the combination of a shooting method and a Newton Raphson search technique. The judicial choice of the initial approximation guarantees convergence of the

iterative method. The user gives an initial approximate solution for the lowest frequency and position (station) closest to the front of the aerofoil; the algorithm then automatically generates approximations for each subsequent frequency specified at that station. The solution for the lowest frequency is then used as an initial guess for the next station. Therefore the method is self propagating, see figure 2.

Each processor has 0.5Mbyte of subcache, split equally between instructions and data, and 32Mbyte of cache. It is therefore a non-uniform memory access (NUMA) style memory system, see table 1. In this system instructions and data are not bound to specific physical location, rather they migrate to where they are being referenced. The KSR1 is a cache-only memory architecture (COMA).

2.1: ALLCACHE memory Figure 2:

stations

To determine where turbulence starts for a given aerofoil, a low resolution (1-4 frequencies and approximately 100 stations) search is made from the start of the aerofoil and stopped when the turbulent features are observed. This first step may be repeated if the frequency range does not give the required information. When the instability features have been located, a high resolution (20-40 frequencies and 150-300 stations) profile is made. For some applications this second step may be repeated to obtain even finer resolution.

In most of today’s distributed memory parallel systems, the job of managing the movement of codes and data among memory units belongs to the programmer. ALLCACHE memory provides programmers with a uniform address space for instructions and data. This space is called system virtual address space or SVA space. SVA space is physically mapped to a set of memories arranged as local caches, each capable of storing 32Mbyte. There is one local cache for each processor in the system. When an address is referenced that is not in the local cache the KSR1’s search engine (described later) causes that address and its data to be fetched to the local cache. The address and data will remain in that local cache until the space is required for something else. Consistency of data between caches is maintained by distinguishing the type of reference made; if the data in the location is to be modified, the local cache will receive the one and only instance of an address and its data; if, however, the data is read but not modified, the local cache will receive a copy of the address and its data.

2: The KSR1 The scalability of shared memory multiprocessors has traditionally been limited to tens of processors due to memory access contention. As a result it has been widely accepted that distributed memory is the key to scalable parallel machines, however these machines have been notoriously difficult to program. The KSR1 is a distributed memory machine that provides a single address space, supported by proprietary hardware [7]. The advantage being a shared memory programming model for the user. This technique has been termed virtual shared memory (VSM). This term can cause confusion as the KSR1 also supports virtual memory (VM) with an address space of 1 million megabytes (240). Gordon Bell cites the KSR1 architecture as most likely to be “... the blueprint for future scalable, massively parallel computers ...” [8]. Each KSR1 processor is a 20MHz RISC-style superscalar 64-bit unit operating at 20 MIP/s and 40 MFLOP/s (peak). A KSR1 system contains from 8 to 1088 processors with a peak performance range from 320 to 43,520 MFLOP/s.

subpage location

cycle latency

data fetched (64 bits)

subcache

2

1

local cache

18

8

SE:0

175

16

SE:1

600

16

disk

400,000

16,,384

Table 1: Memory Access Latency

The ALLCACHE memory also supports low level synchronisation through instructions which lock and unlock sub-pages. These instructions can be used to implement multiprocessor synchronisation functions such as data locks, barriers, critical regions, and condition variables (these are available via the KSR1 compiler, libraries, and OS calls.)

The KSR1 search engine is implemented as a two-level hierarchy of uni-directional rings. Each ring is a sequence of point-to-point connections among a set of units, with the last unit in the set being connected back to the first. Each unit is a combination of a router for request/response packets and a directory. The router can move a packet farther along the ring or send it up or down in the hierarchy. All of the units on all rings can operate simultaneously. The lowest level rings are called Search-Engine:0’s (or SE:0). Each SE:0 can be configured to contain from 8 to 32 processor/ local cache pairs. In the KSR1 the packet passing speed of an SE:0 is 8 million packets per second. SE:l’s can be configured to handle 8, 16, or 32 million packets per second. Each packet contains 128 bytes of data (the unit of consistency); hence the SE:0 bandwidth is 1 Gbyte/s and the SE:1 bandwidth ranges from 1 to 4 Gbyte/s.

2.2: Parallel Programming with Pthreads A pthread is a sequential flow of control within a process (conforming to the IEEE draft POSIX standard P1003.4a), which may cooperate with other pthreads to solve a problem. Pthreads are the underlying mechanism used to execute the parallel constructs available to Fortran programmers. These constructs - parallel regions, parallel sections and tile families - form a high level interface to pthreads. The user inserts parallel region and parallel section directives around appropriate blocks of code. These directives are seen as comments to other compilers. Parallel regions enable threads to concurrently execute the same fragments of code, whilst parallel sections enable threads to concurrently execute separate fragments of code. Tiling is the name given to the parallelisation of DO loops on the KSR1. There are a number of tiling facilities available to the user, varying from fully automatic to fully manual. The fully automatic route to parallelism involves running code through the automatic parallelising pre-processor, KSR KAP, developed by Kuck and Associates; KAP is the most widely accepted automatic parallelising tool and is implemented on most parallel single address space machines. KAP examines DO loops within the code and, with the aid of dependency analysis and dependency elimination, determines which can be executed in parallel; it also does scalar optimisation. KAP inserts relevant KSR1 directives around appropriate DO loops, telling the compiler to parallelise them, and provides output telling the user what it has done and why it has failed to tile certain loops, i.e: KAP tells the user the dependencies it finds. It has no concept of granularity, leaving decisions such as tile size and strategy to the run time system, PRESTO. KAP ensures correctness, not parallelising if there is any

potential dependency. It is, however, necessarily cautious. Semi-automatic tiling allows the user to influence the KSR1’s tiling decisions. This involves inserting a ptile, ‘please tile’, before the loop. KAP can then either be run in its normal mode, or with a ‘noautotile’ option which makes it only select loops with a ptile statement. If KAP knows it can parallelise a loop it inserts the relevant KSR1 directives. Assertions are available to tell KAP that certain potential hazards such as subroutine calls are safe to call in parallel. Options can be added as parameters to ptile specifying various efficiency related parameters. Manual tiling is achieved by the user inserting the full tiling specification for a DO loop. The user is entirely responsible for correctness. As in semi automatic tiling the user may insert efficiency related parameters. The parallel programming environment of the KSR1 is based on a run time system PRESTO. PRESTO dynamically executes run-time decisions based on compiler-generated or programmer-specified directives. The system dynamically decides the level of resources it will devote to a particular parallel task at run time based on the amount of calculation required and resources available.

3: Parallelisation In this work, the objective was to reduce the run time of low resolution searches to an acceptable interactive level without significant algorithmic changes. The sequential code was written in standard fortran 77. It compiled and ran on a single cell of the KSR1. The KSR1 compiler has two levels of optimisation; -O1 for scalar optimisation and -O2 combines this with loop unrolling. Sequential performance may also be enhanced by running the code through KAP with only scalar optimisation switched on. Due to the nature of the dependencies and KAP’s limited inter procedural analysis capability no worthwhile parallelism was detected. Finding parallelism in this case requires a knowledge of the algorithm by the user (although full inter procedural analysis would detect some of the inherent parallelism in the program). The parallelism in this code is highly nested. Parallelism at the innermost level involves two parallel subroutine calls, a simple dependence due to a variable being updated in one of the subroutines was easily removed. These calls are located inside the shooting method integration loop which is sequential, see figure 3. KAP does not currently parallelise subroutine calls. Parallel sections were manually inserted at the appropriate positions. The next level of parallelism is over an independent loop. KAP failed to find this due to the current limitations of its inter procedural analysis. A simple dependence was eliminated by scalar expansion. At the next level a small change in the algorithm was

needed to extract parallelism. This is beyond the capabilities of parallelising compilers and probably always will be, (thus the need for interactive parallelisation tools). Figure 3: DO I=1,NN CALL INTEG(AZ, .. ) CALL INTEG(CZ, .. ) CALL ORTHO (AZ, CZ, .. ) ... ENDDO At the outermost level there is a loop with an ordered critical region. This gives a parallel pipelining effect. Because of a loop carried dependency, the subsequent iteration cannot proceed until a value has been written to by the previous iteration (the generated approximate solution). Moreover there is I/O within this loop which needs to be ordered so as to maintain the form of the sequential output. KAP cannot cope with ordered critical regions. This was implemented on the KSR1 using a parallel region to control the appropriate number of threads, a lock to limit access to a single thread at a time and a condition variable to order the threads access. Scalar expansion was also necessary in order to make the generated approximate solution available to the next thread. The parallelism available here is slightly greater than the number of frequencies in the simulation although further increasing the number of threads can be advantageous due to load imbalance.

3.1: Performance tuning When a parallel construct is encountered, PRESTO will create a team of threads, of appropriate size, to collectively execute it. On finishing the parallel construct the team will be disbanded. Creating a team manually means that the team is never disbanded so overhead of team creation is removed for any subsequent executions of the parallel construct, at the expense of run time control over the number of threads used. If the parallel section construct is used inside a sequential loop (e.g: for each call to INTEG in figure 3), replicating the loop and synchronising via user controlled shared variables may increase efficiency. A pre-fetch enables a thread to either request ownership or a copy of a subpage, before it is needed, reducing memory latency. Currently only 3 pre-fetches are allowed per processor before the processor stalls. There is a small (few cycle) set up overhead in using pre-fetch. Post-store allows a thread which has written to a variable, or set of variables on the same subpage, to broadcast this update to any processor which, before the update, had a valid copy. This can reduce the latency for the subsequent reading of the variable and may spread out the communica-

tion between cells. There is also a small (few cycle) set up overhead in using post-store. Subpage alignment is available to eliminate false sharing. Independent variables or arrays may lie on the same subpage and cause thrashing. To eliminate this, variables and arrays can be separated to different subpages.

4: Results The computation was performed on a 32 processor KSR1 at the University of Manchester. Single processor performance has shown an increase of approximately 30% over the lifetime of the KSR1 due to improvements in software and an hardware upgrade. Table 2 shows the KSR1 single cell performance for a typical search of 4 frequencies and 40 stations, with varying optimisation. Comparisons are given with an Intel i860 and an HP Snake processor.

clock rate MHz

optimisation level

Peak Mflop/s

time taken sec

KSR1 early

20

-O2

40

255

KSR1

20

none

40

333

KSR1

20

-O

40

206

KSR1

20

-O2

40

196

KSR1

20

KAP -O2

40

251

i860

40

-uniproc -Og

80

183

HP snake

66

none

22

273

HP snake

66

+O3

22

96

processor

Table 2 single processor performance

This 4 frequency and 40 station search was used for varying combinations of the levels of parallelism discussed in section 3, see figure 4. The fastest solve time has been reduced from 196 seconds to 15 seconds for 24 threads on a KSR1. This represents an overall speed-up of 12.1. In comparison with the best time on the HP Snake this constitutes a speed-up of 6.4. The Alliant FX/2808 has no run time control of threads. It cannot, therefore, initialise multiple (imperfectly nested) levels of parallelism. Two of these levels were separately parallelised on the Alliant, see table 3. Despite having better single cell performance than the KSR1, the Alliant

achieves poor parallel results. This poor performance is due to the cache not being used when code is run in parallel.

For this code optimisation such as subpage align, prefetch and post-store were implemented. These, however, gave no improvement in performance.

Figure 4:

5: Conclusions 25

20

speedup

15

10

5

0 0

5

10

15

20

25

processors

The parallel performance of the code on the KSR1 at the innermost level, see figure 3, was examined. When this was first parallelised it produced a significant slow down due to the run time creation of teams. Adding a team manually, changes a serious slow-down to a modest speed up.

level

KSR1 (sec)

Alliant (sec)

single processor

196

183

level 1 - (2 threads)

122

282

level 2 - (3 threads)

70

180

Table 3: Comparison with an Alliant

Replicating the outer loop inside each parallel section and adding user supplied shared variable synchronisation gives an even greater speed up - this implies that KSR’s synchronisation calls are currently non-optimal, see table 4.

An aeronautics code has been parallelised on a VSM machine, the KSR1. The code ran on a single cell of a KSR1 without modification. The single cell performance is currently below that of similar superscalar processors, due to its limited clock speed. To parallelise codes the KSR1 provides KSR KAP. This can automatically parallelise loops with simple dependencies by methods such as scalar expansion. It can also perform limited inter procedural analysis to determine if subroutines are side effect free. If, however, the loops are too complex for KAP to examine exhaustively, or the parallelism available is of a type not recognised by KAP, as is the case in this code, it must be manually parallelised. As well as providing facilities for parallelising loops the KSR1 provides parallel sections and regions which the user can insert manually into appropriate segments of code. Parallel sections provide a powerful and simple way to implement functional parallelism. They were used in this code to parallelise two independent subroutine calls. The most complex parallelism found in this code involved an ordered critical region. To implement an ordered critical region in a loop is non-trivial, the use of a parallel region directive, a lock and an array of condition variables is not intuitive. A higher level construct is needed to deal with this type of available parallelism. The parallelism in this problem is highly nested. Due to the run time control of threads parallelism could be examined independently at each level and later combined. This was not possible on the Alliant. Performance results indicate the suitability of parallel architectures for this class of application. Run times for low resolution searches have been reduced from minutes, to an acceptable interactive level, without significant algorithm changes.

Acknowledgments optimisation

time taken (sec)

single processor

196

parallel

762

team

173

replicated loop

125

Table 4: Inner level parallelism

This research was supported by ESPRIT as part of project 2716-AMUS. The author would like to thank Professor Ian Poll and Mark Gallagher from the Aeronautics department of the University of Manchester for access to the code and members of the CNC for invaluable discussion and revision of this manuscript.

References [1] A.J.Mullender, A.L. Bergin, D.I.A. Poll, Boundary

Layer Transition and Control, Royal Aeronautical Society Conference, Cambridge, 8-12 April 1991. [2] T. Cebeci Application of CFD By Reduction of Skin Friction Drag, Progress in Astronautics and Aeronautics, Vol. 123, 1990 [3] D.I.A. Poll, private communication [4] Dennis M. Bushnell Application of CFD By Reduction of Skin Friction Drag, Progress in Astronautics and Aeronautics, Vol. 123, 1990 [5] Leslie M. Mack Boundary Layer Linear Stability Theory, Special Course on Stability and Transition of Laminar Flow, Advisory Group for Aerospace Research and Development, AGARD Report No. 709, March 1984. [6] R. Ford, CNC Application Analysis Report, Laminar To Turbulent Transition, December 1991. [7] KSR1 Programming Manuals, Kendal Square Research Corporation, October 1991. [8] G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8) 26-47, August 1992.

Experiences in Parallelising an Aeronautics Code on the ... - CiteSeerX

Experiences in Parallelising an Aeronautics Code on the ... - CiteSeerX

Suggest Documents

Parallelising Matrix Operations on Clusters for an

Parallelising Computational Microstructure Simulations for ... - CiteSeerX

AMA Safety Code - Academy of Model Aeronautics

Early Experiences with the Myricom-X2000 Switch on an ... - CiteSeerX

A Case Study in Pipeline Processor Farming: Parallelising ... - CiteSeerX

1 American Institute of Aeronautics and Astronautics AN ... - CiteSeerX

An exploration of experiences in using the hybrid ... - CiteSeerX

Visual Experiences in the Blind induced by an Auditory ... - CiteSeerX

Experiences in Specifications - CiteSeerX

1 American Institute of Aeronautics and Astronautics AN ... - CiteSeerX

Hindustan Aeronautics Ltd - Hindustan Aeronautics Limited

The Influence of Identity on Students' Learning Experiences in an

1 BILLING CODE 7510-13 NATIONAL AERONAUTICS AND SPACE ...

On being a teammate: Experiences acquired in the design ... - CiteSeerX

On being a teammate: Experiences acquired in the design ... - CiteSeerX

SAMOS in Hindsight: Experiences in Building an Active ... - CiteSeerX

Initial experiences in testing: the IAEA/AAPM code of ...

American Institute of Aeronautics and Astronautics - CiteSeerX

Origin of an Alternative Genetic Code in the Extremely ... - CiteSeerX

DiTyCO: an Experiment in Code Mobility from the Realm ... - CiteSeerX

aeronautics - proesa

MovieLens Unplugged: Experiences with an Occasionally ... - CiteSeerX

Experiences of sustainability assessment: an awkward ... - CiteSeerX

American Institute of Aeronautics and Astronautics ... - CiteSeerX