using ppp to parallelize operational weather

0 downloads 0 Views 37KB Size Report
The PPP directives CSMS$DECLARE_DECOMP, and CSMS$CREATE_DECOMP are used in. Example 3.1a to define and initialize all of the data structures ...
USING PPP TO PARALLELIZE OPERATIONAL WEATHER FORECAST MODELS FOR MPPS Mark W. Govett, Adwait Sathye, James P. Edwards, Leslie B. Hart NOAA Forecast Systems Laboratory 325 Broadway Boulder, Colorado 80303 [email protected]

ABSTRACT The Parallelizing Preprocessor is being developed at the Forecast Systems Laboratory (FSL) to simplify the process of parallelizing operational weather prediction models for Massively Parallel Processors (MPPs). PPP, a component of FSL's Scalable Modeling System, is a Fortran 77 text analysis and translation tool. PPP directives, implemented as Fortran comments, are inserted into the source code. This code can be run as the legacy software or be processed by PPP whenever a parallel version of the code is required. PPP is being used to parallelize the United States Naval Research Laboratory's (NRL's) Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). This paper will describe the PPP tool and present several code examples highlighting its use. 1.

INTRODUCTION

The Scalable Modeling System (SMS) was designed and built by NOAA's Forecast Systems Laboratory (FSL) to simplify the parallelization of numerical weather prediction (NWP) models. The SMS is a layered set of high level libraries that provide portability, offer good performance, and improves the programmability of NWP models targeted for Massively Parallel Processors (MPPs). The SMS is composed of a layered set of three high level software libraries: the Nearest Neighbor Tool (NNT), the Scalable Run-time System (SRS), and the Parallelizing Preprocessor (PPP). At the lowest layer the SRS library was designed to provide low level I/O services necessary to support distributed memory processing (Hart, et al, 1995). The NNT library, layered above the Message Passing Interface (MPI) (Message Passing Interface Forum, 1995) communications software and SRS, encapsulates many of the complexities associated with parallelizing finite difference approximation models including data decomposition, interprocessor communication and I/O (Rodriguez, et al., 1995). Models parallelized using the SMS/NNT library, including the Regional Atmospheric Modeling System (RAMS) and the Rapid Update Cycle (RUC), are completely portable and have demonstrated good performance and speedup on a variety of shared and distributed-memory MPPs including Intel Paragon, SGI Origin, Cray T3E, IBM SP2, nCUBE, and workstation clusters (Baillie and MacDonald, 1993; Rodriguez, et al., 1995; Edwards, et al., 1996). Since the code is portable across MPPs, development costs are reduced since only one version of the code must be maintained.

Although the development of SRS and NNT have reduced the effort required to parallelize forecast models, the process of is still tedious, time consuming and error prone. Further, code changes resulting from continuing model development are frequently done by developers on the sequential version. Integrating these changes into an MPP ready version can require a significant re-parallelization effort. These factors have motivated the development of the third component of SMS, a high level language preprocessor called PPP.

2.

PPP

PPP is being developed at FSL to further simplify the process of parallelizing weather prediction models for Massively Parallel Processors. PPP was designed to simplify and speed the process of parallelizing sequential codes by automating many programming tasks required for parallelization. Directives, in the form of Fortran comments, are inserted into the sequential code by the user, and subsequently processed by PPP whenever a parallel version is required. Embedding PPP directives in the code still allows the code to be run in its sequential version or be easily converted into its parallel counterpart using PPP. Parallelization using PPP is a two step process. First, the user must carefully study the code to determine where directives should be placed for efficient execution. Once directives are in place, PPP can be run on this code to generate MPP ready Fortran source. As illustrated in Figure 2.1, only a single source for both sequential and parallel execution needs to be maintained. This version can continue to be developed and tested by the modeler on their local workstation, but can easily be parallelized by PPP and run on a high performance MPP when needed.

Original Source Code

Source Code with PPP Directives

Sequential Executable

PPP

MPP ready executable

Figure 2.1: PPP parallelization: Only a single source must be maintained for sequential or parallel execution. PPP generated code relies on calls to the SMS/NNT library to accomplish common tasks required to operate in a distributed memory environment. PPP directives have been developed to address data movement, data decomposition, address translation, memory management, nesting, reduction operations, and to define areas of parallel computation. Directives to address I/O code transformations are in progress. A complete listing of the PPP directives can be found at the FSL Advanced Computing Group’s web site (http://www-ad.fsl.noaa.gov/ac).

3.

PPP CODE EXAMPLES

Three examples will be presented illustrating code transformations using PPP. In Example 3.1a we present a code segment highlighting interprocessor communication or data exchange, and data decomposition. 1 2

CSMS$DECLARE_DECOMP( decomp, LX, LY, LZ)1 CSMS$CREATE_DECOMP( decomp, LX, LY, LZ)1

3 4 5 6 7 8 9

CSMS$PAR_REGION(decomp, I, J) DO I=2, N-1 DO J=2, M-1 D(I,J) = B(I+1,J) + B(I,J-1) + C(I,J) ENDDO ENDDO CSMS$END_PAR_REGION

10

CSMS$EXCHANGE (D)

Example 3.1a: Loop nesting code segment The PPP directives CSMS$DECLARE_DECOMP, and CSMS$CREATE_DECOMP are used in Example 3.1a to define and initialize all of the data structures associated with data decomposition including address translation arrays, structures to accomplish efficient interprocessor communication, and structures to support parallel I/O. Lines 1 and 2 declare the variable decomp used to access the decomposed data that is dimensioned LX, LY, LZ. The bulk of code, (lines 4 through 8), bounded by the directives CSMS$PAR_REGION and CSMS$END_PAR_REGION, describe those regions where parallel computations are done. PPP output results in the following code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

CCSMS$DECLARE_DECOMP( decomp, LX, LY, LZ)2 CCSMS$CREATE_DECOMP( decomp, LX, LY, LZ)2 CCSMS$PAR_REGION(decomp, I, J) DO I=decomp_s1(2, 0, 0), decomp_e1(N-1, 0, 0) DO J=decomp_s2(2, 0, 0), decomp_e2(M-1, 0, 0) D(I,J) = B(I+1,J) + B(I,J-1) + C(I,J) ENDDO ENDDO CCSMS$END_PAR_REGION CCSMS$EXCHANGE (D) call nnt_define_exch(decomp, decomp_xh, nnt_real, 1, decomp_status) call nnt_clear_all_vars(decomp_xh, decomp_status) call nnt_assign_var(decomp_xh, 1, d, nnt_do_all, decomp_status) call nnt_exchange(decomp_xh, decomp_status)

Example 3.1b: PPP processed loop nesting code 1

Some parameters for the decomposition directives have been omitted. Output from this directive contains 30+ lines of source and has been omitted.

2

Notice that code changes in Example 3.1b are limited to replacing do-loop indices with references to index translation vectors (decomp_s1, decomp_e1 in the i loop) and inserting SMS/NNT library calls for the exchange required to update D (line 10). Automatic translation of the do-loop start and stop fields for the i, j loops (lines 3, 5, 9, 10), are required for those dimensions in which arrays B, C and D are decomposed (in this case dimensions 1, and 2), so each processor can operate on its local copy of these arrays. The next two examples highlight common tasks in the COAMPS model that can be tedious and error-prone to hand parallelize but are easily transformed using PPP. The first example uses CSMS$FIXED_INDEX, a directive designed for global boundary initialization on decomposed arrays. Example 3.2a initializes wflux in the diffusion routine of COAMPS: 1 2 3 4 5 6

do k=1,kk do i=2, len wflux(i,2,k)= ((u(i-1,2,k-1)+u(i,2,k-1)-u(i-1,2,k)-u(i,2,k))*work1(i,2) + (v(i,1,k-1)+v(i,2,k-1)-v(i ,1,k)-v(i,2,k))*work2(i,2)*factor enddo enddo

Example 3.2a: COAMPS boundary initialization for a diffusion routine Parallelization of this code involves defining a parallel region over the i loop, and using the CSMS$FIXED_INDEX to treat the second array index of wflux, u, v, and work arrays as a global value. 1 2 3 4 5 6 7 8 9 10 11 12

do k=1,kk CCSMS$PAR_REGION (dh, i, j) CCSMS$FIXED_INDEX (2) do i = dh__S1(2,0,1), dh__E1(len,0,1) do j1 = dh__S2(2,0,1), dh__E2(2,0,1) + wflux(i,j1,k) = u(i-1,j1+2-(2),k-1)+u(i,j1+2-(2),k-1)-u(i-1,j1+2-(2),k)+ u(i,j1+2-(2),k)*work1(i,j1+2-(2))+v(i,j1+2-(2),k-1)+ v(i,j1+2-(2),k-1) -v(i,j1+1-(2),k)-v(i,j1+2-(2),k)*work2(i,j1+2-(2))*factor enddo enddo CCSMS$END_PAR_REGION enddo

Example 3.2b: PPP parallelized diffusion computation The PPP parallelization, shown in Example 3.2b, replaces the start and stop fields on the i loop (line 4) with translation arrays and generates a second do-loop nest on line 5 using the generated variable j1. The j1 loop start and stop parameters permit execution of lines 6 through 8 only for the processors that contain the second column of wflux. Notice how the right hand side array index references in the second dimension can become complex, since they must use offsets relative to the do-loop variable j1.

1 2 3 4 5 6

PARAMETER (MN = M*N) DIMENSION wflux(M,N,KK) do 10 k = 1, KK do 10 i = 1, MN wflux(i,1,k) = 0.0 10 continue

Example 3.3a: Loop collapsing is used to improve performance on vector machines. Example 3.3a illustrates a common technique used to improve performance on vector machines which involves collapsing loop nests to create longer vectors. While some programmers will redefine these arrays to reflect the reduced number of array dimensions, it is not required. 1 2 3 4 5 6 7 8

do 10 k=1,kk CCSMS$PAR_REGION (dh, i, j) CCSMS$UNCOLLAPSE (, ) do 10 j1 = dh__S2(1,0,1), dh__E2(n,0,1) do 10 i = dh__S1(1,0,1), dh__E1(m,0,1) wflux(i,j1+1-1,1) = 0.0 10 continue CCSMS$END_PAR_REGION

Example 3.3b: PPP processing resolves loop collapsing in COAMPS. Fortran allows wflux with dimension M, N, KK to be accessed by wflux(i,1,k), where i ranges from 1 to M*N. While these loops offer good performance on vector computers, they must be converted back to the original references for efficient parallelization. The SMS$UNCOLLAPSE directive shown in Example 3.3b, creates a new do-loop (line 4) operated on by the generated variable j1, and modifies the start and stop indices of the original i loop (line 5). Assignment statements within the scope of this directive are also modified.

4.

CONCLUSION

We are developing a directive based tool (PPP) to simplify the process of parallelizing Fortran 77 code by automating many programming tasks required for code parallelization. PPP improves the programmability of the SMS system while continuing to provide fully portable, high performance code for use on a wide variety of MPP systems. We have described a number directives that have been developed and presented three examples of their use in parallelizing the COAMPS model. PPP processing of these directives produces Fortran 77 and NNT code required for parallelization including data movement, data decomposition, local and global addressing translation, memory management and reduction operations. In addition, directives are being developed to support parallel I/O routines provided by the NNT library.

5.

REFERENCES

Baillie, C.F. and A.E. MacDonald, 1993: Porting the Well Posed Topographical Meteorological Model to the KSR Parallel Supercomputer, Parallel Supercomputing in Atmospheric Sciences, pp 26-35. Edwards, J, J. Snook, Z. Christidis, 1996: Use of a Parallel Mesoscale Model in Support of the 1996 Summer Olympic Games, Proceedings of the Seventh ECMWF Workshop on the Use of Parallel Processors in Meteorology. Hart, L., T. Henderson, B. Rodriguez, 1995: An MPI Based Scalable Runtime System: I/O Support for a Grid Library, Proceedings of the MPI Developers Conference, University of Notre Dame Message Passing Interface Forum, 1995: MPI: A message-passing interface standard. Journal of Supercomputing Applications, V8, No. 3/4. Rodriguez, B., L. Hart, T. Henderson, 1995: Parallelizing Operational Weather Forecast Models for Portable and Fast Execution. Journal of Parallel and Distributed Computing,v37, 159-170.