In January 1993 a major UK aero engine company test ... Newton Raphson search technique. ... cache the KSR1's search engine (described later) causes.
Experiences in Parallelising an Aeronautics Code on the KSR1 R. W. Ford Centre for Novel Computing, Department of Computer Science, Manchester University, Oxford Road, Manchester, M13 9PL U.K.
Abstract Virtual Shared Memory (VSM) has been proposed as the solution to scalable shared memory parallel architectures. This paper reports on parallelising a scientific code from aeronautical engineering to a VSM machine, the KSR1. The code predicts the laminar to turbulent transition point of flow over an aerofoil. The experiences of initial porting and successive optimisation to examine efficiency on a KSR1 is discussed in detail and performance results are presented. Comparisons are made with a traditional shared memory Alliant FX/2808. Performance results indicate the suitability of parallel architectures for this class of application.
1: Introduction Laminar Flow Control (LFC) [1] is steadily becoming accepted as an important industrial tool in increasing the fuel efficiency of aircraft. In January 1993 a major UK aero engine company test flew a nacelle (engine casing) whose shape had been determined by the application of LFC. LFC achieved laminar flow over 60% of the nacelle, giving a 3% saving in specific fuel consumption (total saving in fuel) over the previous shape. To put the saving into perspective it has been estimated that for a typical aircraft a 3% reduction in specific fuel consumption will save a company $300,000 per year [2] and it has been noted that a 4% difference in specific fuel consumption will enable an aircraft to corner the market [3]. LFC has even greater impact in reducing skin friction for submarines and long distance pipelines as it constitutes
80-85% and 100% of total drag, respectively, compared with 50% for aircraft [4]. One of the ways to achieve a reduction in skin friction is to delay the onset of turbulence. In this paper, Boundary Layer Stability theory is used to model the turbulent transition point and features causing it. Flow properties such as velocity and pressure are split into steady components and small oscillatory disturbance terms. The latter can be shown to satisfy the Orr Somerfeld equation [5][6]. In this study, flow over an aerofoil is modelled by a 2-D incompressible fluid. The corresponding Orr-Somerfeld equation reduces to a 4th order boundary value problem (BVP), see Figure 1. When the amplification rate of the instability wave solution of the Orr-Somerfeld equation (N-factor) exceeds a critical value, turbulence occurs. Figure 1: air velocity profile freestream
boundary layer aerofoil Starting with an initial approximation the N-factor is solved by the combination of a shooting method and a Newton Raphson search technique. The judicial choice of the initial approximation guarantees convergence of the
iterative method. The user gives an initial approximate solution for the lowest frequency and position (station) closest to the front of the aerofoil; the algorithm then automatically generates approximations for each subsequent frequency specified at that station. The solution for the lowest frequency is then used as an initial guess for the next station. Therefore the method is self propagating, see figure 2.
Each processor has 0.5Mbyte of subcache, split equally between instructions and data, and 32Mbyte of cache. It is therefore a non-uniform memory access (NUMA) style memory system, see table 1. In this system instructions and data are not bound to specific physical location, rather they migrate to where they are being referenced. The KSR1 is a cache-only memory architecture (COMA).
2.1: ALLCACHE memory Figure 2:
stations
To determine where turbulence starts for a given aerofoil, a low resolution (1-4 frequencies and approximately 100 stations) search is made from the start of the aerofoil and stopped when the turbulent features are observed. This first step may be repeated if the frequency range does not give the required information. When the instability features have been located, a high resolution (20-40 frequencies and 150-300 stations) profile is made. For some applications this second step may be repeated to obtain even finer resolution.
In most of today’s distributed memory parallel systems, the job of managing the movement of codes and data among memory units belongs to the programmer. ALLCACHE memory provides programmers with a uniform address space for instructions and data. This space is called system virtual address space or SVA space. SVA space is physically mapped to a set of memories arranged as local caches, each capable of storing 32Mbyte. There is one local cache for each processor in the system. When an address is referenced that is not in the local cache the KSR1’s search engine (described later) causes that address and its data to be fetched to the local cache. The address and data will remain in that local cache until the space is required for something else. Consistency of data between caches is maintained by distinguishing the type of reference made; if the data in the location is to be modified, the local cache will receive the one and only instance of an address and its data; if, however, the data is read but not modified, the local cache will receive a copy of the address and its data.
2: The KSR1 The scalability of shared memory multiprocessors has traditionally been limited to tens of processors due to memory access contention. As a result it has been widely accepted that distributed memory is the key to scalable parallel machines, however these machines have been notoriously difficult to program. The KSR1 is a distributed memory machine that provides a single address space, supported by proprietary hardware [7]. The advantage being a shared memory programming model for the user. This technique has been termed virtual shared memory (VSM). This term can cause confusion as the KSR1 also supports virtual memory (VM) with an address space of 1 million megabytes (240). Gordon Bell cites the KSR1 architecture as most likely to be “... the blueprint for future scalable, massively parallel computers ...” [8]. Each KSR1 processor is a 20MHz RISC-style superscalar 64-bit unit operating at 20 MIP/s and 40 MFLOP/s (peak). A KSR1 system contains from 8 to 1088 processors with a peak performance range from 320 to 43,520 MFLOP/s.
subpage location
cycle latency
data fetched (64 bits)
subcache
2
1
local cache
18
8
SE:0
175
16
SE:1
600
16
disk
400,000
16,,384
Table 1: Memory Access Latency
The ALLCACHE memory also supports low level synchronisation through instructions which lock and unlock sub-pages. These instructions can be used to implement multiprocessor synchronisation functions such as data locks, barriers, critical regions, and condition variables (these are available via the KSR1 compiler, libraries, and OS calls.)
The KSR1 search engine is implemented as a two-level hierarchy of uni-directional rings. Each ring is a sequence of point-to-point connections among a set of units, with the last unit in the set being connected back to the first. Each unit is a combination of a router for request/response packets and a directory. The router can move a packet farther along the ring or send it up or down in the hierarchy. All of the units on all rings can operate simultaneously. The lowest level rings are called Search-Engine:0’s (or SE:0). Each SE:0 can be configured to contain from 8 to 32 processor/ local cache pairs. In the KSR1 the packet passing speed of an SE:0 is 8 million packets per second. SE:l’s can be configured to handle 8, 16, or 32 million packets per second. Each packet contains 128 bytes of data (the unit of consistency); hence the SE:0 bandwidth is 1 Gbyte/s and the SE:1 bandwidth ranges from 1 to 4 Gbyte/s.
2.2: Parallel Programming with Pthreads A pthread is a sequential flow of control within a process (conforming to the IEEE draft POSIX standard P1003.4a), which may cooperate with other pthreads to solve a problem. Pthreads are the underlying mechanism used to execute the parallel constructs available to Fortran programmers. These constructs - parallel regions, parallel sections and tile families - form a high level interface to pthreads. The user inserts parallel region and parallel section directives around appropriate blocks of code. These directives are seen as comments to other compilers. Parallel regions enable threads to concurrently execute the same fragments of code, whilst parallel sections enable threads to concurrently execute separate fragments of code. Tiling is the name given to the parallelisation of DO loops on the KSR1. There are a number of tiling facilities available to the user, varying from fully automatic to fully manual. The fully automatic route to parallelism involves running code through the automatic parallelising pre-processor, KSR KAP, developed by Kuck and Associates; KAP is the most widely accepted automatic parallelising tool and is implemented on most parallel single address space machines. KAP examines DO loops within the code and, with the aid of dependency analysis and dependency elimination, determines which can be executed in parallel; it also does scalar optimisation. KAP inserts relevant KSR1 directives around appropriate DO loops, telling the compiler to parallelise them, and provides output telling the user what it has done and why it has failed to tile certain loops, i.e: KAP tells the user the dependencies it finds. It has no concept of granularity, leaving decisions such as tile size and strategy to the run time system, PRESTO. KAP ensures correctness, not parallelising if there is any
potential dependency. It is, however, necessarily cautious. Semi-automatic tiling allows the user to influence the KSR1’s tiling decisions. This involves inserting a ptile, ‘please tile’, before the loop. KAP can then either be run in its normal mode, or with a ‘noautotile’ option which makes it only select loops with a ptile statement. If KAP knows it can parallelise a loop it inserts the relevant KSR1 directives. Assertions are available to tell KAP that certain potential hazards such as subroutine calls are safe to call in parallel. Options can be added as parameters to ptile specifying various efficiency related parameters. Manual tiling is achieved by the user inserting the full tiling specification for a DO loop. The user is entirely responsible for correctness. As in semi automatic tiling the user may insert efficiency related parameters. The parallel programming environment of the KSR1 is based on a run time system PRESTO. PRESTO dynamically executes run-time decisions based on compiler-generated or programmer-specified directives. The system dynamically decides the level of resources it will devote to a particular parallel task at run time based on the amount of calculation required and resources available.
3: Parallelisation In this work, the objective was to reduce the run time of low resolution searches to an acceptable interactive level without significant algorithmic changes. The sequential code was written in standard fortran 77. It compiled and ran on a single cell of the KSR1. The KSR1 compiler has two levels of optimisation; -O1 for scalar optimisation and -O2 combines this with loop unrolling. Sequential performance may also be enhanced by running the code through KAP with only scalar optimisation switched on. Due to the nature of the dependencies and KAP’s limited inter procedural analysis capability no worthwhile parallelism was detected. Finding parallelism in this case requires a knowledge of the algorithm by the user (although full inter procedural analysis would detect some of the inherent parallelism in the program). The parallelism in this code is highly nested. Parallelism at the innermost level involves two parallel subroutine calls, a simple dependence due to a variable being updated in one of the subroutines was easily removed. These calls are located inside the shooting method integration loop which is sequential, see figure 3. KAP does not currently parallelise subroutine calls. Parallel sections were manually inserted at the appropriate positions. The next level of parallelism is over an independent loop. KAP failed to find this due to the current limitations of its inter procedural analysis. A simple dependence was eliminated by scalar expansion. At the next level a small change in the algorithm was
needed to extract parallelism. This is beyond the capabilities of parallelising compilers and probably always will be, (thus the need for interactive parallelisation tools). Figure 3: DO I=1,NN CALL INTEG(AZ, .. ) CALL INTEG(CZ, .. ) CALL ORTHO (AZ, CZ, .. ) ... ENDDO At the outermost level there is a loop with an ordered critical region. This gives a parallel pipelining effect. Because of a loop carried dependency, the subsequent iteration cannot proceed until a value has been written to by the previous iteration (the generated approximate solution). Moreover there is I/O within this loop which needs to be ordered so as to maintain the form of the sequential output. KAP cannot cope with ordered critical regions. This was implemented on the KSR1 using a parallel region to control the appropriate number of threads, a lock to limit access to a single thread at a time and a condition variable to order the threads access. Scalar expansion was also necessary in order to make the generated approximate solution available to the next thread. The parallelism available here is slightly greater than the number of frequencies in the simulation although further increasing the number of threads can be advantageous due to load imbalance.
3.1: Performance tuning When a parallel construct is encountered, PRESTO will create a team of threads, of appropriate size, to collectively execute it. On finishing the parallel construct the team will be disbanded. Creating a team manually means that the team is never disbanded so overhead of team creation is removed for any subsequent executions of the parallel construct, at the expense of run time control over the number of threads used. If the parallel section construct is used inside a sequential loop (e.g: for each call to INTEG in figure 3), replicating the loop and synchronising via user controlled shared variables may increase efficiency. A pre-fetch enables a thread to either request ownership or a copy of a subpage, before it is needed, reducing memory latency. Currently only 3 pre-fetches are allowed per processor before the processor stalls. There is a small (few cycle) set up overhead in using pre-fetch. Post-store allows a thread which has written to a variable, or set of variables on the same subpage, to broadcast this update to any processor which, before the update, had a valid copy. This can reduce the latency for the subsequent reading of the variable and may spread out the communica-
tion between cells. There is also a small (few cycle) set up overhead in using post-store. Subpage alignment is available to eliminate false sharing. Independent variables or arrays may lie on the same subpage and cause thrashing. To eliminate this, variables and arrays can be separated to different subpages.
4: Results The computation was performed on a 32 processor KSR1 at the University of Manchester. Single processor performance has shown an increase of approximately 30% over the lifetime of the KSR1 due to improvements in software and an hardware upgrade. Table 2 shows the KSR1 single cell performance for a typical search of 4 frequencies and 40 stations, with varying optimisation. Comparisons are given with an Intel i860 and an HP Snake processor.
clock rate MHz
optimisation level
Peak Mflop/s
time taken sec
KSR1 early
20
-O2
40
255
KSR1
20
none
40
333
KSR1
20
-O
40
206
KSR1
20
-O2
40
196
KSR1
20
KAP -O2
40
251
i860
40
-uniproc -Og
80
183
HP snake
66
none
22
273
HP snake
66
+O3
22
96
processor
Table 2 single processor performance
This 4 frequency and 40 station search was used for varying combinations of the levels of parallelism discussed in section 3, see figure 4. The fastest solve time has been reduced from 196 seconds to 15 seconds for 24 threads on a KSR1. This represents an overall speed-up of 12.1. In comparison with the best time on the HP Snake this constitutes a speed-up of 6.4. The Alliant FX/2808 has no run time control of threads. It cannot, therefore, initialise multiple (imperfectly nested) levels of parallelism. Two of these levels were separately parallelised on the Alliant, see table 3. Despite having better single cell performance than the KSR1, the Alliant
achieves poor parallel results. This poor performance is due to the cache not being used when code is run in parallel.
For this code optimisation such as subpage align, prefetch and post-store were implemented. These, however, gave no improvement in performance.
Figure 4:
5: Conclusions 25
20
speedup
15
10
5
0 0
5
10
15
20
25
processors
The parallel performance of the code on the KSR1 at the innermost level, see figure 3, was examined. When this was first parallelised it produced a significant slow down due to the run time creation of teams. Adding a team manually, changes a serious slow-down to a modest speed up.
level
KSR1 (sec)
Alliant (sec)
single processor
196
183
level 1 - (2 threads)
122
282
level 2 - (3 threads)
70
180
Table 3: Comparison with an Alliant
Replicating the outer loop inside each parallel section and adding user supplied shared variable synchronisation gives an even greater speed up - this implies that KSR’s synchronisation calls are currently non-optimal, see table 4.
An aeronautics code has been parallelised on a VSM machine, the KSR1. The code ran on a single cell of a KSR1 without modification. The single cell performance is currently below that of similar superscalar processors, due to its limited clock speed. To parallelise codes the KSR1 provides KSR KAP. This can automatically parallelise loops with simple dependencies by methods such as scalar expansion. It can also perform limited inter procedural analysis to determine if subroutines are side effect free. If, however, the loops are too complex for KAP to examine exhaustively, or the parallelism available is of a type not recognised by KAP, as is the case in this code, it must be manually parallelised. As well as providing facilities for parallelising loops the KSR1 provides parallel sections and regions which the user can insert manually into appropriate segments of code. Parallel sections provide a powerful and simple way to implement functional parallelism. They were used in this code to parallelise two independent subroutine calls. The most complex parallelism found in this code involved an ordered critical region. To implement an ordered critical region in a loop is non-trivial, the use of a parallel region directive, a lock and an array of condition variables is not intuitive. A higher level construct is needed to deal with this type of available parallelism. The parallelism in this problem is highly nested. Due to the run time control of threads parallelism could be examined independently at each level and later combined. This was not possible on the Alliant. Performance results indicate the suitability of parallel architectures for this class of application. Run times for low resolution searches have been reduced from minutes, to an acceptable interactive level, without significant algorithm changes.
Acknowledgments optimisation
time taken (sec)
single processor
196
parallel
762
team
173
replicated loop
125
Table 4: Inner level parallelism
This research was supported by ESPRIT as part of project 2716-AMUS. The author would like to thank Professor Ian Poll and Mark Gallagher from the Aeronautics department of the University of Manchester for access to the code and members of the CNC for invaluable discussion and revision of this manuscript.
References [1] A.J.Mullender, A.L. Bergin, D.I.A. Poll, Boundary
Layer Transition and Control, Royal Aeronautical Society Conference, Cambridge, 8-12 April 1991. [2] T. Cebeci Application of CFD By Reduction of Skin Friction Drag, Progress in Astronautics and Aeronautics, Vol. 123, 1990 [3] D.I.A. Poll, private communication [4] Dennis M. Bushnell Application of CFD By Reduction of Skin Friction Drag, Progress in Astronautics and Aeronautics, Vol. 123, 1990 [5] Leslie M. Mack Boundary Layer Linear Stability Theory, Special Course on Stability and Transition of Laminar Flow, Advisory Group for Aerospace Research and Development, AGARD Report No. 709, March 1984. [6] R. Ford, CNC Application Analysis Report, Laminar To Turbulent Transition, December 1991. [7] KSR1 Programming Manuals, Kendal Square Research Corporation, October 1991. [8] G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8) 26-47, August 1992.