implement them on massively parallel machines. ... This report is based of the weather prediction model developed and run by ... on papers describing the model, the algorithm, weather modelling in general .... way in which they are modelled is governed by the available computational power. ...... ECMWF Technical Memo.
Centre for Novel Computing Application Analysis Report
Numerical Weather Prediction at ECMWF CNC/NC/AR/1991/044.001
Mark Bull and Rupert Ford December 1991
Contents 1 Introduction
2
2 The Application
2
3 Current Implementation
9
2.1 Overview and Context : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Physical Description : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 Mathematical Description : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1 3.2 3.3 3.4
The Spectral Transform Method : The ECMWF model : : : : : : : Recent developments : : : : : : : Execution : : : : : : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
2 3 6
9 11 16 17
4 Parallel methods
18
5 Conclusions
24
4.1 Reasons for parallelism : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Potential for parallelism : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.3 Other Methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
18 19 23
1 Introduction The CNC series of Application Analysis Reports are top down reviews of computationally intensive applications. By taking a top down approach we can study methods to eciently implement them on massively parallel machines. This both aids the user and provides CNC with a study of the diculties encountered in parallelising a real problem. This report is based of the weather prediction model developed and run by the European Centre for Medium-Range Weather Forecasts (ECMWF). As their name suggests, ECMWF are an international centre funded by a number of European countries with the aim of developing and running computer weather forecasts. We present the conclusions and recommendations of 3 person months studying their problem, the way in which they solve it, and the possible opportunities for increasing the accuracy of their forecasts by using a massively parallel computer. The work here has been based on papers describing the model, the algorithm, weather modelling in general etc. The code itself was not examined { this was a conscious decision as the relevant information was available at a higher level underlying the centre's belief in top down analysis and programming. Whilst writing this report various documents and information has been obtained from ECMWF; in particular we would like to thank Tuomo Kauranne for his assistance.
2 The Application 2.1 Overview and Context The applications of weather forecasting are many and diverse. They include both military and civil aviation, shipping, agriculture and the construction industry as well as national and local media which, although publically very visible, are a relatively minor activity. (For example, the UK Meteorological Oce employs over 2000 sta, of whom about half a dozen appear on national television and radio). Although all forecasts are based at least in part on the output from numerical weather prediction models (computer simulations of the atmosphere), there is still substantial human input in the process of providing a forecast to a user or customer. One reason for this, as we shall see, is that the numerical forecasts cannot predict local variations in weather. The production of weather forecasts is normally funded by governments, as forecast information is of strategic importance for defence purposes, and supplemented by revenue generated from commercial sales. ECMWF receives its funding from the governments of a number of European countries, and is solely concerned with the production of numerical forecast data, without any
subsequent human interpretation, or providing any peripheral services. This includes a substantial research eort directed at improving the quality of the forecasts as well as daily production runs. ECMWF specialises in longer range forecasts, from about two days (after which time a simulation needs to have at least hemispherical coverage) out to ten days, which is approximately the current time limit beyond which numerical forecasts are of insucient accuracy to provide useful information. Clearly there is a strong real-time constraint on the time which they can allow to elapse from data time (when observations are made) to the time when the forecast is available for dissemination (for ECMWF this is of the order of 18 hours). This constraint, coupled with the desire for accuracy, means that numerical weather prediction has always been an application which has made use of machines at the leading edge of computer technology. Nowadays most weather forecast models are run on a dedicated vector supercomputer, and increasingly on machines with some coarsegrained parallel capability. The ECMWF model is a typical example, being currently implemented on an 8 processor Cray Y-MP. Predictions are made not only of the basic meteorological variables such as wind velocity and temperature but also of quantities such as rainfall, cloud cover and sea state. The initial conditions for the forecast are obtained from a world-wide network of observations which includes ground stations, radiosondes (balloons), satellites, aircraft, ships and drifting ocean buoys. The raw data is collected via an extensive global telecommunications network. Strict quality control procedures are employed when the data is combined with a previous forecast using a complicated statistically-based assimilation technique to provide the initial conditions for the forecast model in a suitable form. When the forecast is complete extensive post-processing of the results is carried out, largely to convert numerical output to graphical form. The complete cycle from observations to forecast product must be tted into a period of eighteen hours or so, and both hardware and software need to meet high standards of reliability. The data assimilation and post-processing are substantial computational tasks in their own right, but in this report we shall be concerned only with the forecast model itself.
2.2 Physical Description Numerical weather prediction (NWP) requires the solution of a uid dynamics problem in a thin shell (the atmosphere) surrounding a rotating sphere (the earth). The eects of temperature and water vapour on the ow are important (for example, the energy of a North Atlantic low pressure system comes in almost equal parts from the potential energy of a north-south temperature gradient and the latent heat released by condensation of water vapour). There are also a variety of sources and sinks of momentum, temperature
and humidity, which must be included. These include surface friction, radiation (both incoming solar and outgoing infra-red from ground and clouds), evaporation, condensation and precipitation. The lower boundary condition is complicated|the surface of the earth is not at, and many of the source and sink processes occur here. The upper boundary condition is less crucial because there is only weak interaction between the troposphere (the lowest layer of the atmosphere, typically 6 to 12 km deep, where most weather occurs) and the upper layers of the atmosphere. It nevertheless requires some care because the numerical boundary condition has to be imposed at some nite height, even though there is no nite physical boundary condition. The physical phenomena in the atmosphere take place on a wide variety of scales. The way in which they are modelled is governed by the available computational power. Small scale processes cannot be modelled directly|their eects on the larger scale processes must be modelled in terms of larger scale variables. In meteorology this is referred to as parametrisation. The uid ow takes place on all scales from the largest weather systems which may span several thousand kilometres to the smallest turbulent eddies, which have a scale of about 1mm. To discretise the entire uid ow problem would require of the order of 1022 grid points which is many orders of magnitude more than can be handled by any existing computer. Thus the smaller scale ows must be parametrised. The eects of the smaller scales can be thought of as (and appear in the equations as) sources and sinks of the large scale variables. The processes that constitute true physical sources and sinks also take place almost exclusively on small scales, so they too must be parametrised. It is therefore convenient to divide the modelling process into two categories|the resolved part and the unresolved or parametrised part. Meteorologists frequently (and somewhat confusingly) refer to these as `the dynamics' and `the physics' respectively. Exactly what is modelled in each part is determined by the available computational power. With current supercomputers, the length scale dividing the resolved and unresolved parts of the model is in the range of 50 to 100km. The resolved part of the model contains the large scale uid dynamics and thermodynamics. As we shall see in the following section, the large scale variables are the wind velocity, temperature, humidity and surface pressure. The unresolved processes include the following {
Radiation This includes both incoming radiation from the sun in the short infra-
red and outgoing, longer infra-red from the ground and clouds. On a very simple level the incoming radiation is absorbed by the ground. This then warms the air close to the surface, and is transported upwards by turbulent mixing (see below). The ground and clouds lose heat by radiating into space. In practice, however there are complicated interactions involving absorption, re ection and transmission at
dierent levels in the atmosphere|modelling these suciently accurately results in radiation being the most computationally expensive of the parametrised processes.
Condensation, evaporation and precipitation These are major sources and sinks of moisture in the atmosphere.
Convection This is a process by which heating of the lower layers of the atmo-
sphere causes the air to become unstably strati ed, and results in overturning which can extend throughout the whole depth of the troposphere. This is the process that produces showery rain and thunderstorms. (N.B. The term convection is widely used in uid dynamics simply to mean the transport of variables by the uid ow. In meteorology it is used to refer to the speci c process described above|the term advection is used for the general transport of variables by the ow.)
Gravity wave drag Air owing over mountains can result in wave-like motions
on a scale of a few kilometres. These waves can transport momentum vertically in the atmosphere, and slow down the ow at level much higher than the mountain peaks.
Turbulent mixing The atmosphere has a turbulent boundary layer which can
vary from about 200 to 2000 metres deep. The turbulent eddies transport momentum, heat and moisture throughout the boundary layer, and are an important mechanism in the transfer of these quantities between the surface and the bulk of the atmosphere.
Surface processes The amounts of momentum, heat and moisture transferred
between the surface and the boundary layer are governed by surface characteristics such as surface temperature, surface moisture, surface roughness, vegetative cover, snow cover and so on. These can all change in both time and space, and must be estimated.
The construction and testing of algorithms for the parametrised processes consumes a large part of the research eort in numerical weather prediction. It is a dicult area both because good observational data is hard (and expensive) to obtain, and the existence of a wide variety of feedback mechanisms makes evaluating the eect of new algorithms on the whole model very tricky. (To illustrate this|during its development, the U.K. Met. Oce's mesoscale model, covering the British Isles at a resolution of 15km, contained a sign error in the transfer of moisture from surface to atmosphere which had the eect of
predicting too much fog and low cloud. The bug went undetected for over two years.) In this report we will not be concerned with the details of the unresolved part of the model. There appears to be far less inherent parallelism in the parametrisation algorithms than in the resolved part of the model, and (as we shall see in x4.2) even if one were considering a machine at some future date with tens of thousands of processors it would not be necessary to attempt to exploit parallelism at the level of the unresolved processes. It is sucient for our purposes simply to know what resolved data is required for the modelling of these processes.
2.3 Mathematical Description As a preliminary to presenting the equations governing the resolved part of the model, we describe the somewhat unusual coordinate system which is now popular in NWP. The horizontal coordinates are the usual spherical coordinates|longitude and = sin where is the latitude. The vertical coordinate is a pressure-based terrain-following coordinate . This must be a monotonic function of pressure p and will also depend on the surface pressure ps , so that = (p; ps ) and is such that
(0; ps ) = 0 and (ps ; ps) = 1:
We shall adopt the most commonly used vertical coordinate| = pp : s Use of this type of vertical coordinate has the advantage that the lowest coordinate surface is the earth's surface, making the lower boundary condition easy to implement. Conversely, its use does make the governing equations more complex. We will use r to denote the Laplacian operator acting in the two horizontal coordinate directions only. Most, if not all, current NWP models use the so-called primitive equations. These are derived from Newton's laws of motion, the principle of mass continuity and the rst law of thermodynamics, with the following assumptions|
The atmosphere is shallow, that is the depth of the atmosphere is small compared to the radius of the earth.
The atmosphere behaves as an inviscid ideal gas. Surfaces of constant apparent gravitational potential are spherical.
Some of the apparent accelerations resulting from the curvature of the coordinate system and the rotation of the earth are suciently small as to be negligible.
Vertical accelerations are negligible (the hydrostatic approximation). For more details see, for example [5]. The approximations have two eects. They reduce the number of equations required, and they eliminate terms in the remaining equations which turn out to be numerically several orders of magnitude smaller than other terms. For example there is no equation for the rate of change of the vertical component of velocity. This is convenient since vertical velocities in the atmosphere are small, vary considerably over short distances and are consequently hard to measure. These approximations result in prognostic equations for U and V , the components of the wind eld, the temperature T and the humidity q. (A prognostic equation is one which relates the rate of change of a variable in time to the current state of the model atmosphere) These are @U ? V + _ @U + RdTv @ lnp + 1 @ ( + E ) = P (1) U @t @ a @ a @ @V + U + _ @V + RdTv (1 ? 2) @ lnp + (1 ? 2) @ ( + E ) = P (2) V @t @ a @ a @ Tv ! @T + U @T + V @T + _ @T + (3) 2 @t a(1 ? ) @ a @ @ (1 + ( ? 1)q)p = PT @q + U @q + V @q + _ @q = P (4) @t a(1 ? 2) @ a @ @ q In these equations, is the absolute vorticity given by
= 2 + where is the earth's rotation rate and is the relative vorticity
= rU and U = (U; V )T . (For those unfamiliar with uid dynamics, vorticity is a tricky concept. informally it can be thought of as the rate at which a small paddle wheel placed in the
ow would rotate. In this case the axis of the paddle wheel is vertical and since we are dealing with large scale ow, the `small' paddle wheel would actually have to be quite large.... Relative vorticity is the rate at which the wheel would be seen to rotate by an observer on the earth, while absolute vorticity would be the rate seen by an observer in space, stationary with respect to the earth's axis.) The geopotential is given by a form of the hydrostatic equation Z1 (5) = s + RdTv d
where s is the surface geopotential. (The geopotential is the gravitational potential energy per unit mass, which, with the assumptions we are making, is just the product of the acceleration due to gravity with the distance from the earth's centre. It becomes a variable because we are working on sigma coordinates surfaces whose physical height changes in space and time.) E is de ned by U2 + V 2 E = 2(1 ? 2 ) and D is the divergence| D = r:U which, roughly, is the amount of spreading out of the ow on a sigma surface. a is the radius of the earth, p is the pressure, and Tv is the virtual temperature, given by 1 Tv = T 1 + ? 1 q ) (This is the temperature that moist air would have if all the water vapour were condensed out of it.) Rd, , and are thermodynamic constants. PU , PV , PT and Pq represent contributions from the unresolved processes. We also have a mass continuity equation ! @ @p + p D~ + p @ _ = 0 (6) s s @ @t @ where U @ lnps + V @ lnps D~ = D + a(1 ? (7) 2) @ a @ ! is the pressure-coordinate vertical velocity, given by Z @p + V @p : ! = ? p D~ d + a(1 U? 2 ) @ a @ 0
Explicit prognostic equations for surface pressure and can be obtained by integrating (6) to give @ lnps = ? Z 1 D~ d (8) @t 0 and ps ? Z D~ d: _ = ? @ ln (9) @t 0 We now have a closed set of equations describing the evolution of the 5 basic variables U , V , T , q and ln ps . The next stage in the modelling process is the development of a suitable numerical method for the solution of these equations. Historically there have been two fundamentally dierent approaches here. The rst approach is to use nite dierence methods which was done in all early forecast models and is still employed successfully by the U.K. Meteorological Oce. Other major forecast centres, including ECMWF who converted in 1983, now run models based on the second approach, the spectral transform method. In the following section we describe this approach in general, and then present the main features of the ECMWF implementation.
3 Current Implementation 3.1 The Spectral Transform Method The equations of x2.3 can be written in the general form @Fi = A (F ; : : :; F ) i = 1; : : :; r i 1 r @t where F1; : : : ; Fr are the prognostic variables and the Ai are operators which include non-linear terms, partial spatial derivatives and some spatial integrals. As we have mentioned these equations can be solved by quite standard nite dierence methods. There are, however some attractions in representing the elds Fi in spectral form (that is, by a sum of suitable basis functions) in one or more of the three spatial dimensions. The basis functions are chosen to have `nice' mathematical properties; in particular it must be possible to approximate functions accurately by a weighted sum of basis functions. Provided the elds are suitably smooth (which is a reasonable assumption for meteorological variables on large scales) greater accuracy can be obtained using the same number of degrees of freedom than is possible with nite dierence methods. With the right choice of basis functions some of the sub-operators of the Ai, notably partial spatial derivatives, which can be dicult to handle well using nite dierence methods, become almost trivial to apply to the transformed elds. Furthermore there is a restriction on the timestep that can be used in any model if numerical stability is to be maintained (see x3.3). This restriction is generally less strict for spectral models than for nite dierence models of the same spatial accuracy. If we denote the approximation to a eld F in terms of the basis functions by F , we would like to solve the equations
@F i = A (F ; : : : ; F ) i = 1; : : : ; r i 1 r @t This is not possible, however since Ai(F i) will not usually be in the span of the basis functions (i.e. it cannot be written exactly as a weighted sum of basis functions). To obtain a soluble system, we approximate Ai(F i) by a sum of the basis functions, so we actually solve @F i = A (F ; : : : ; F ) i = 1; : : : ; r (10) i 1 r @t
Unfortunately it is not practicable to compute all the terms of the RHS of (10). The cost of evaluating non-linear terms rises rapidly with increasing resolution (number of basis functions), and the inclusion of the unresolved contributions is completely intractable. A solution to this problem is to store elds both in terms of basis functions (that is in spectral space) and on a grid as one would in a nite dierence model. The key concept is
now to transform the elds from spectral space to grid space, compute the non-linear and parametrised terms of the operators Ai in grid space, transform these tendencies back to spectral space and compute the spatial derivative terms in spectral space. This is the spectral transform method, and despite the cost of performing the transform operations, it is now well recognised that this method is competitive with nite dierence models in terms of computational cost incurred to achieve a given accuracy in the forecast. It is usual to use the spectral method for the two horizontal spatial dimensions and to employ nite dierence (or possibly nite element) methods in the vertical direction. Large scale vertical velocities in the atmosphere are two or three orders of magnitude smaller than horizontal velocities, so the accurate treatment of vertical derivatives is not as important as that of horizontal derivatives, and using gridpoint methods avoids diculties with the boundary conditions. The natural choice of basis functions are the so-called spherical harmonics. These consist of a Fourier series of sines and cosines in the East/West direction and associated Legendre functions in the North/South direction. The horizontal variation of a variable F is given by M X N X F (; ) = FnmPnm()eim; (11) where the Pnm are given by
m=?M n=jmj
v u u ? m)! 1 (1 ? 2)m=2 dn+m (2 ? 1)n m 0 m Pn () = t(2n + 1) ((nn + m)! 2n n! dn+m
and
Pn?m () = Pnm(): Since F is real Fn?m is just the complex conjugate of Fnm, so the the Fnm need only be stored for m 0. The spatial derivatives of F have simple forms| M X N @F = X m m im @ m=?M n=jmj imFn Pn ()e ;
and
N M X m @F = X m dPn eim F @ m=?M n=jmj n d where dPnm =d are known analytically. The horizontal Laplacian also has very neat representation| N n(n + 1) M X X mP m ()eim F r2F = ? n n 2 a m=?M n=jmj
The values of M and N determine the eective spatial resolution of the model. The two values should be related so that resolution is not concentrated in one particular direction. There are two main candidates for N , given a choice of M |either N = M (triangular truncation) or N = M + jmj (rhomboidal truncation). A series of experiments (including some at ECMWF) were carried out in the late 1970's to compare the two and the triangular truncation turned out to be preferable both from a computational and from a meteorological point of view. Equation (11) gives the basis for the inverse transform from spectral space to grid space. The transform from grid to spectral space is derived from the following identity| Z 1 Z 2 Fnm = 41 F (; )Pnm()e?im d d ?1 0
(12)
For the integral over (the Fourier transform) a minimum of 3N + 1 points round each latitude circle in the East/West direction is required for the quadrature (numerical integration) to be exact. It is best to choose N so that 3N +1 is equal to or just less than a number whose prime factors are all small ( 5 say). This allows ecient use of Fast Fourier Transform (FFT) methods for computing the integral over . The integral over (the Legendre transform) in (12) is calculated by Gaussian quadrature. If the correct (Gaussian) distribution of gridpoints (which is almost regular for suciently large N ) is used then the quadrature rule can be made exact for quadratic terms. To achieve this requires a minimum of (3N +1)=2 points from pole to pole in the North/South direction.
3.2 The ECMWF model In this section we show (roughly) how the spectral transform technique is currently applied to the primitive equations in the ECMWF model. It is worth emphasising that there is some freedom of choice here. This is partly because it is possible to form linear combinations of terms either in grid space or in spectral space. There are complicated tradeos between the amount of storage required, the number of elds passed through each transform process, memory access times and the need to make certain variables available for use in the parametrisation algorithms. The optimal choice is strongly dependent on the hardware (and memory management software) being used, and the method has been and no doubt will continue to be subject to revision along these lines. The following description is therefore is just one way of organising the computation, and should not be thought of being a de nitive description of the algorithm. For further details see [4]. These reorganisations do not have a signi cant eect on the structure of the algorithm with regard to parallelisation.
It turns out to be slightly more convenient in a spectral model to use the relative vorticity and the divergence D as basic variables instead of U and V . Prognostic equations for and D can be derived from (1) and (2)| @ = 1 @Fv ? 1 @Fu (13) @t a(1 ? 2) @ a @ @D = 1 @Fu + 1 @Fv ? r2G (14) @t a(1 ? 2) @ a @ where
Rd Tv @ lnp + P ; Fu = V ? _ @U ? U @ a @ RdTv (1 ? 2) @ lnp + P ? Fv = ?U ? _ @V V @ a @
and
G=+E From (13), (14), (3), (4) and (8) prognostic equations for the spectral coecients of , D, T , q and ln ps at each level in the vertical can be derived. Integration by parts, and the spatial derivative formulae are used to obtain numerically convenient forms of these equations. These are ! @nm = 1 Z 1 Z 2 im F P m + F dPnm e?im d d (15) u @t 4a ?1 0 1 ? 2 v n d ! @Dnm = 1 Z 1 Z 2 im F P m ? F dPnm + n(n + 1) GP m e?im d d (16) v n @t 4a 1 ? 2 u n d a @Tnm
?1 0
1 Z 1 Z 2
!
?U @T ? V @T + R P m e?im d d 1 ? 2 @ @ T! n ?U @q ? V @q + R P m e?im d d 1 ? 2 @ @ q n
@t = 4a ?1 0 @qnm = 1 Z 1 Z 2 @t 4a ?1 0 @ (lnps )mn = 1 Z 1 Z 2 F P m e?im d d p n @t 4 ?1 0
where
and
(17) (18) (19)
Tv ! + RT = PT ? _ @T @ (1 + ( ? 1)q)p @q Rq = Pq ? _ @ Z1
Fp = ? 0 D~ d Note that equations (15{18) hold on each of the discrete levels in the vertical. The integrals in equations (15{19) can all be calculated by the quadrature techniques of x3.1.
@q @q @T once the variables Fu, Fv , G, RT , Rq , U @T @ , V @ , U @ , V @ and Fp are known in grid space. To compute these quantities requires the evaluation of parametrised terms, vertical derivatives and vertical integrals. The latter two are eected by a nite dierence method and a standard quadrature method respectively. In addition to this we require the ve basic variables and some horizontal derivatives. These are obtained by transforming them from spectral space. Finally we require gridpoint values of U and V . These may be obtained from the spectral representation of and D by the following relationships M X
N X
!
m 1 2 ) m dPn ? imDm P m eim ; U =a (1 ? n n n d m=?M n=jmj; n6=0 n(n + 1) N M m! X X dP 1 m m 2 m im n im P + (1 ? ) D e V = ?a n n n d m=?M n=jmj; n6=0 n(n + 1)
(20) (21)
which are of a similar form to the transforms required for the basic variables, and can be computed in the same way. In terms of transforms we can view the method in this way: we are transforming from full spectral space to grid space via the intermediate Fourier-grid space and then back again via the same path, performing the appropriate calculations in each space. This is illustrated schematically in Figure 1. Once their time tendencies have been computed the spectral coecients of the basic variables can be integrated forward in time. This is done using a scheme which is semiimplicit for gravity wave terms (essentially the r2G term in the divergence equation) and horizontal diusion (this has some physical basis but is largely a numerical x to keep elds smooth). It is possible to exploit the ease of computing r2 in spectral space to produce a method which makes little dierence to the overall cost of a timestep compared with an explicit method, but whose stability properties permit a signi cantly larger timestep to be used. See [3] for further details of this. The current model uses an N = 213 triangular truncation (often referred to as T213 in the literature), 31 levels in the vertical and a timestep of 20 seconds.
Spectral space
6
1. Inverse Legendre transform
4. Legendre transform
?
Fourier-grid space
6 2. Inverse Fourier transform
3. Fourier transform
?
Gridpoint space
Figure 1: Schematic illustration of spaces and transforms
The algorithm may be summarised in pseudocode as follows|
for each timestep for each vertical level for each eld for each latitude circle for each Fourier coecient end
end
compute inverse Legendre transform sum to form new elds
end for each new eld for each latitude circle end
compute inverse Fourier transform (FFT) to form new elds
end end for each latitude circle for each longitude line for each vertical level end end
compute non-linear terms and vertical derivatives
compute vertical integrals and parametrised terms form new elds
end for each vertical level for each new eld for each latitude circle end
compute Fourier transform (FFT) to form new elds
end for each new eld for each Fourier coecient for each Legendre coecient end
end
compute Legendre transform integral to form new elds
end end for each Fourier coecient for each Legendre coecient
end
end
end
compute elds at next timestep in vertical columns (semi-implicit)
3.3 Recent developments ECMWF have made two recent changes to their algorithm, and are in the process of implementing them. The rst of these is to use a reduce the number of gridpoints used. This is possible because in the original algorithm the number of gridpoints on each latitude circle is the same. In a recent study, [2], it was found that reducing the number of gridpoints so that the East-West separation of gridpoints is more or less uniform over the whole sphere. This reduces the total number of points in grid space by about a third, with a very small impact on forecast quality. However, while this modi cation is clearly advantageous in an implementation on a small number of processors, it poses extra diculties with load balancing for a large number of processors (see x4.2). The second change in to implement a semi-Lagrangian treatment of the advection terms in the primitive equations. There is a stability constraint on the timestep which can be employed using standard (Eulerian) treatment of the advection ( uid transport) terms. This is known as the Courant-Friedrichs-Lewy (CFL) condition and takes the form t k x Umax where t is the timestep, x is the grid spacing (or eective grid spacing in the case of spectral models), Umax is the maximum value of the velocity eld (in modulus) and k is some constant which depends on the numerical method being employed. Although the value of k is larger for a semi-implicit method than an explicit one, the CFL condition still determines the maximum permissible timestep. Eulerian methods for the advection term (in either gridpoint or spectral models) are based on the estimation of spatial derivatives at gridpoints. The local rate of change of a variable is then calculated as the sum of the products of these derivatives with the components of the wind eld. Semi-Lagrangian methods adopt a dierent approach. For each gridpoint a uid trajectory is estimated for the current timestep, ending up at the gridpoint. The change in the variable over the timestep is then given by the dierence in the value of the variable between the ends of the trajectory. Usually the start of the trajectory does not lie on a gridpoint, so the value of the variable must be estimated by interpolation. In a spectral model the spectral representations of elds give readily available interpolants. Adoption of this technique has little impact on the accuracy of the solution, and costs more to implement. However the major gain is in the maximum permissible timestep which can be as much as six times longer for the semi-Lagrangian method, representing a considerable overall saving in computational cost. The method adopted by ECMWF is based on that described in [8]. This method simply changes the way in which some terms are computed in grid space { it does not aect the overall structure of the algorithm.
3.4 Execution The model is currently run on an 8 processor Cray Y-MP. Each of these processors is a high performance vector processor with a peak rate of 330 MFlops giving a full machine peak performance of 2.66 GFlops. The current ECMWF model runs at a sustained 1.5 GFlops with a T213 resolution and 31 levels. The code is approximately 100,000 lines of Fortran. Assembler routines are used for a few important routines such as the fast Fourier transforms and matrix multiplications. The fortran is highly vectorised (to the order of 99% of all oating point operations). There is no real diculty with the model in terms of available parallelism. The algorithm splits easily into a number of very large evenly distributed computationally intensive tasks (currently 320) which can be eciently distributed on any standard shared memory machine with a small (even) number of processors. The algorithm in fact splits into many more large grained tasks (see x4.2). The problem in this case is more to do with data access which in this case is exacerbated by the limited `core' memory size of their Y-MP (512 Mbytes). In this algorithm most of the data must be held out of core and thus access to it becomes the main problem in trying to keep the processors busy. ECMWF bought Cray's solid state device SSD (fast secondary storage) to help with this but secondary storage is always slower than core memory. To achieve the resolution of T213 with 31 levels pushed the limit of the machine. The way ECMWF tackled the problem was to keep the spectral data in core and have data from other spaces (Fourier and grid point) along with pre-calculated data in les. This in itself causes data access problems. They produced a multiple scanning technique; this technique and the organisation of the data was originally designed to take advantage of the multi-tasking Cray X-MP/22 and Cray X-MP/48. When moving to the Cray Y-MP the same organisation was used originally but the algorithm has recently changed to incorporate the semi-Lagrangian treatment of advection. This strategy is described in detail in [1]. ECMWF have recently announced jointly with Cray Research Inc. that they are upgrading their YMP to a C90 (this is eectively a YMP-16 with `souped up' processors). They will be receiving the rst commercial C90 in the world; the delivery date being sometime in July 1992. As with the Y-MP, each processor is a high performance vector processor but has a peak rate of 1GFlop (one of the reasons for this is that the clock rate has been reduced from 6ns for the YMP to 4ns for the C90). There are 16 processors in all giving a peak performance of 16GFlops. There is also more memory than the YMP (2Gbytes), although the maximum amount per processor (which is what aects the ECMWF model) is the same. The total cost of the system is $30.5M. It is interesting to
note that even Cray, the bastion of vector supercomputers, are very much looking at the potential of massively parallel machines for their future generations of supercomputers.
4 Parallel methods 4.1 Reasons for parallelism Operational weather forecast centres such as ECMWF are constantly striving to improve the accuracy of their forecasts. At the present time the two major factors in uencing the accuracy of the forecast are the resolution of the numerical model, and the quality of the initial conditions. Since there is a real-time constraint on the duration of the forecast cycle, the resolution which can be used is limited by the power of the execution engine. Reducing the forecast execution time allows more time to be spent on the data analysis which produces the initial conditions. This would permit the use of more sophisticated algorithms, which would give more accurate initial conditions. Such centres are generally in the position of being able to dedicate to this application the most powerful machines available. However, since they are running essentially a single application, they will purchase a machine which has been shown to be the fastest reliable executor of their particular application. At the present time the machines which meet this speci cation are the shared memory vector processor machines with a modest (though increasing) number of processors. Although some massively parallel machines have a higher peak Mega op rating, there are a number of reasons why these machines are not yet serious contenders for use at operational weather forecast centres. Among these are
Current NWP algorithms do not map well onto massively parallel SIMD machines, despite the high degree of data parallelism (see x4.2) On a MIMD machine the execution time of the model may be limited by factors
other than the oating point performance of the processors themselves. These factors include communication bandwidth between processors, the amount of memory attached to each processor and the local memory access time. Although (as we shall see in x4.2) the spectral model exhibits a high degree of potential parallelism, most, if not all, current massively parallel machines cannot, for one reason or another, even approach their peak performance on a spectral NWP model.
Programming massively parallel machines to obtain good performance is still relatively dicult, with the result that porting codes to them is rarely a simple task.
Implementing 100,000 line Fortran codes eciently on a massively parallel machine still requires considerable human eort and time.
Many massively parallel machines are not yet suciently reliable, either in terms of hardware or system software, to be run on an operational basis.
The average lifetime of manufacturers of massively parallel machines is not yet long enough for the nancial risks involved in purchasing one to be acceptable.
Nevertheless it is generally felt that there are good prospects that the above diculties will be overcome in the near future, and that implementing an operational weather forecast on a massively parallel machine will become a possibility. Thus considerable eort is being expended on the study of how to port NWP codes to these machines, in the anticipation of more robust, more easily programmable, and more powerful massively parallel machines will soon be available. For further discussion see [6]. In particular, the advent of virtual shared memory machines oers the possibility of achieving a very fast implementation without excessive programming eort. Kendal Square Research (KSR) produced a feasibility study for implementing the ECMWF code on such a physically distributed single address space machine (the KSR1). This study con rms the authors opinion that the ECMWF model is well suited to ecient implementation on such massively parallel systems (see x4.2). In any case it seems almost certain that NWP codes will continue to be run on parallel machines of one sort or another, and that the number of processors employed will continue to increase.
4.2 Potential for parallelism Referring to the pseudocode description of the algorithm in x3.2, we see that the outermost loop is over timesteps. No parallelism can be exploited here, since the calculations for each timestep require knowledge of the current values of all the elds, which are not available until the completion of the previous timestep. Within each timestep we can identify six sections. These are
A Inverse Legendre transforms B Inverse Fourier transforms C Grid point calculations (including parametrisations) D Fourier transforms E Legendre transforms
F Time stepping These sections are sequentially dependent. Within each of these sections we can identify a large number of tasks which are independent apart from common read access to data and can thus be computed in parallel. We can also identify six one-dimensional domains over which parallelism is possible in one or more of the sections A to F. These are
i Latitude circles ii Longitude lines iii Fourier coecients iv Legendre coecients v Vertical levels vi Fields Each section consists of a nest of loops over several of the above domains. Within each loop iterations are independent, and may be executed in parallel. Thus each section consists of a set of independent and similar tasks, the total number of which is given by the product of the number of iterations in each loop. The following table shows (to leading order) the number of iterations which are executed in each domain for each of the sections A to F. It also shows the total number of iterations and the type of task being executed. Here N is the truncation number (as in x3.1), nv is the number of vertical levels, and FX is the number of elds being handled in section X. (Recall from x3.2 that these numbers are liable to change, but are in the region of 5 to 15.) A dash indicates that there is no loop over this domain in this section. Section i ii A 3N=4 { B 3N=2 { C 3N=2 3N D E F
3N=2 { {
{ { {
Domain iii iv N { { { { {
v vi Total nv FA 3N 2 nv FA=4 nv FB 3Nnv FB=2 { { 9N 2 =2
{ { n v FD N=2 N nv FE N N { {
Type of task Vector dot products 3N point FFTs Grid point calculations and parametrisations 3Nnv FD=2 3N point FFTs N 2nv FE=2 Vector dot products N2 Matrix-vector products
Note that the number of iterations in section A over latitude circles and in section E over Fourier coecients are a factor of 2 smaller than might be expected. This is because
symmetry properties of the Legendre transform and its inverse allow pairs of coecients to be calculated together (see [9]). Each section can be seen to be composed of an iteration space with between two and four dimensions. The tasks which comprise these iteration spaces are themselves, in principle, parallelisable. However there are very good reasons for not exploiting parallelism down at this level. Tasks such as vector dot products and one-dimensional FFTs are relatively small, and do not exhibit such a high degree of data parallelism as is present in the sections at a higher level. This means that the overheads (for example communication in a distributed memory system, or task set up and synchronisation in a shared memory system) incurred in parallelising them would be much larger than those for parallelising at a higher level. The gridpoint calculations in section C are more computationally expensive tasks. However the algorithms used do not have any particularly obvious parallelism within them, so again it will be preferable to exploit parallelism at the highest level in this section, since there are O(N 2 ) independent and very similar tasks to be performed. The size of the iteration spaces are all large|with N = 213 and nv = 31 they range from about 4 104 (Section F) to 107 (Section A). Without increases in N and nv it will be a few years before the number of oating point processors in a massively parallel MIMD machine will approach the smaller gure. Until this occurs we can say that there is ample readily exploitable parallelism at these higher levels, and there is no advantage to be gained from looking at ner grained parallelism in this algorithm. Solving the problem of how to parallelise the algorithm therefore reduces to deciding how best to decompose the iteration space for each of the six sections. This is not a straightforward task. It is important to note here that the code requires a very large amount of memory ( 600 Mbytes for N = 213, nv = 31), with the result that the current implementation does not t in core on the Cray Y-MP. If this is the case then the issue of choosing the best decomposition is complicated by the need to minimise latencies involved in accessing out-of-core data. However, provided that the amount of main memory of future machines increases linearly with their oating point performance does, this should cease to be so much of a problem, since the total number of ops required to run a forecast is O(N 3) while the memory requirement is only O(N 2 ). Aside from out-of-core access diculties, there are two main considerations when choosing a decomposition of the iteration space. The rst is load balancing|we would like as far as possible to divide the iteration space into units that require the same amount of time to execute, otherwise some processes will be idle while others are completing. We can summarise the load balancing properties of the dierent one-dimensional domains for each section as follows {
Section A (inverse Legendre transforms) Domain i (latitude circles) Good iii (Fourier coecients) Poor { the length of the vector dot products is N ? jmj,
where m is the Fourier coecient index. v (vertical levels) Fairly good { there are some extra elds carried at the lowest vertical level. vi ( elds) Fair { the 5 basic variables require less work than other elds.
Section B (inverse Fourier transforms) i (latitude circles) Good v (vertical levels) As for section A vi ( elds) As for section A Section C (grid point calculations) i (latitude circles) Fair { parametrisation algorithms take about 15% longer near the Equator than they do near the poles. (Poor if reduced grid is used) ii (longditude circles) Good
Section D (Fourier transforms) As for section B
Section E (Legendre transforms) iii (Fourier coecients) As for section A iv (Legendre coecients) Good v (vertical levels) As for section A vi ( elds) As for section A Section F (time stepping) iii (Fourier coecients) Good iv (Legendre coecients) Good The above considerations assume that the number of processes divides the number of iterations in the domain in a reasonably `nice' way|of course this isn't always going to
be the case. Poor load balancing properties don't prevent parallelisation over a particular domain|it just means that the decomposition of iteration space might have to be irregular. The second consideration is parallel overheads. The form these take depends on the type of machine. For shared memory machines it is the task start-up and synchronisation costs, while on a distributed memory machine in is the communication costs. (Note: on a virtual shared memory machine both sorts of overhead are present even though the communication is hidden from the programmer.) These costs cannot be assessed independently for each section as they will depend on how the previous section has been decomposed. Determining an optimal decomposition also depends on the number of processors and other machine properties such as the interconnection topology of a distributed memory machine. This makes the issue a dicult one, where the availability of suitable software tools would be of great assistance. For a small number of processors (as on a Cray) load balancing considerations will usually dominate, but parallel overheads become more important as the number of processors increases. The algorithm does not appear to admit any obvious decompositions which clearly minimise overheads. This can be attributed mainly to the fact that there is no locality of reference present in the algorithm. Since we are continually switching between spectral space and grid space, there is no natural geometric decomposition for the whole algorithm. On a distributed memory machine, for example, matrix transpose type communication between some sections is unavoidable. In summary we can say that the algorithm exhibits a high degree of parallelism, but that exploiting this with maximum eciency is not straightforward.
4.3 Other Methods Although at present the spectral technique seems to be winning out over gridpoint methods in the meteorological community, the situation in the future is not clear. In the asymptotic limit of a large number of processors, gridpoint methods retain certain advantages over spectral methods, such as optimal parallel complexity, and the possibility of using adaptive grids. For further discussion see [7]. It is therefore possible that a return to gridpoint methods might be considered worthwhile at some time in the future. Another technique currently of interest is the idea of running a number of forecasts each with initial conditions slightly perturbed from those originally assimilated. The similarity (or lack of it) between these forecasts gives an indication of the likely accuracy of that particular forecast and when the atmosphere is in an unstable state may also give possible alternative evolutions. This is a version of the so called Monte Carlo technique. Separate forecasts in this method are completely independent and could therefore
be executed in parallel. With the limited maximum number of processors on traditional vector supercomputers new machines would have to be bought to implement this technique whereas with a massively parallel machine more processors could be added to the original system at a relatively small cost.
5 Conclusions Numerical weather forecasting is, and will continue to be, an important supercomputer application. So far, modestly parallel high-speed vector machines have proved to be the most ecient execution engines for this type of problem. Nevertheless it is still generally expected that this role will pass to massively parallel machines in the future. We have seen that the present spectral technique is indeed suitable for implementation on such machines with the caveat that considerable care must be taken to avoid unnecessary overheads. The ability of massively parallel machines to execute an in-core solution will alleviate a major proportion of the programming diculties currently encountered. At present ECMWF are directing their eorts at their current machine and its successor, the Cray Y-MP C90. However it seems likely that a massively parallel implementation will eventually prove attractive, given a machine that will meet the requirements for operational use, and that at least in the rst instance they will attempt a fairly direct port of their code to such a machine, since the current algorithm is suitable for this purpose.
References [1] Dent, D. The ECMWF model on the Cray Y-MP 8 (1990) ECMWF Internal Document. [2] Hortal, M., and A.J. Simmons (1990) Use of reduced Gaussian grids in spectral models. ECMWF Technical Memo. No. 168 [3] Hoskins, B.J. and A.J. Simmons (1975) A multi-layer spectral model and the semi-implicit method. Quart. J. R. Met. Soc. 101, 637-655. [4] Jarraud, M. and A.J. Simmons (1984) The spectral technique. Proceedings of ECMWF Seminar on Numerical Methods for Weather Prediction, September 1983, Vol 2. 1{59. [5] Kasahara, A. (1977) Computational aspects of numerical models for weather prediction and climate simulation. Methods in Computational Physics 17, 2{60. [6] Kauranne T. (1990) An introduction to parallel processing in meteorology. In Topics in Atmospheric and Oceanic Sciences Springer-Verlag, Berlin. [7] Kauranne T. (1990) Asymptotic parallelism of weather models. In Topics in Atmospheric and Oceanic Sciences Springer-Verlag, Berlin. [8] Ritchie, H. (1988) Application of the semi-Lagrangian technique to a spectral model of the shallow water equation. Monthly Weather Review, 116, 1587-1598. [9] Simmons, A.J., and D. Dent (1989) The ECMWF multi-tasking weather prediction model. Computer Physics Reports 11, 165-194.