Modernization of Legacy Application Software

2 downloads 0 Views 175KB Size Report
or Cray T3E. A principal motivation for modernizing legacy code is increased ... would cause computation and memory costs to increase beyond the capacity.
Modernization of Legacy Application Software Je rey Howe1 , Scott B. Baden1, Tamara Grimmett2 , and Keiko Nomura2 1

2

Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla CA 92093-0114, USA fbaden, [email protected] http://www-cse.ucsd.edu/users/baden

Department of Applied Mechanics and Engineering Sciences University of California, San Diego, 9500 Gilman Drive, La Jolla CA 92093-0411, USA ftgrimmet, [email protected]

Abstract. Legacy application software is typically written in a dialect of Fortran, and must be reprogrammed to run on today's microprocessorbased multicomputer architectures. We describe our experiences in modernizing a legacy direct numerical simulation (DNS) code with the KeLP software infrastructure. The resultant code runs on the IBM SP2 with higher numerical resolutions than possible with the legacy code running on a vector mainframe.

1 Introduction Legacy application software is typically written in a dialect of Fortran, and optimized for execution on a vector class architecture like the Cray C90 or T94. Such codes often run on just one processor and must be reprogrammed to run on today's microprocessor-based multicomputer architectures like the IBM SP2 or Cray T3E. A principal motivation for modernizing legacy code is increased memory capacity. For example, the Cray T94 at the San Diego Supercomputer Center has only about 2 Gigabytes of memory, but just 8 nodes of SDSC's 128node IBM SP2 provide this same memory capacity.1 We will discuss our experiences in modernizing a legacy application code, called DISTUF [13, 7], using the KeLP software infrastructure developed by Fink and Baden [5]. The resultant code, called KDISTUF, can run at a higher numerical resolution than the legacy code, by virtue of increased memory and computational capacity. We discuss our methodology for parallelizing this application and present performance measurements on the IBM SP2.

2 Preliminaries We rst describe the starting point for our study: the DISTUF code. We then give a brief overview of the KeLP infrastructure, which we used to convert DISTUF to KDISTUF. 1

See http://www.npaci.edu/Resources/Systems/compute.html.

2

To appear in Proc. PARA98

2.1

DISTUF

performs direct numerical simulations (DNS) of incompressible homogeneous sheared and unsheared turbulent ows. The signi cance of DNS is the elimination of turbulence closure models thus allowing recovery of the fundamental physics directly from simulation results. Complete resolution of all relevant scales is therefore required and the total number of computational grid points needed is proportional to the 9/4 power of the Reynolds number.2 DISTUF solves the three-dimensional, time-dependent Navier-Stokes, continuity, and energy (passive scalar) equations. The equations are discretized in an Eulerian framework using a second-order nite-di erence method on a staggered grid. The Adams-Bashforth scheme is used to integrate the equations in time. Pressure is treated implicitly; DISTUF employs a fast Poisson solver [14] which combines fast Fourier transforms and Gaussian elimination to solve for the pressure eld. Boundary conditions are periodic. The data structure for DISTUF consists of a single logical 3-dimensional grid. Each point on the grid employs 9 double precision eld variables: vector-valued velocity, scalar temperature and pressure, and 4 temporaries. DISTUF is divided into two computational phases: (1) a nite di erence computation to solve advection and di usion equations for the velocity and temperature terms, and (2) a Fourier method to solve the Poisson equation. The computed velocity values from the rst phase are then updated by the pressure solution to give the nal velocities at the new time step. DISTUF is written in Fortran 77 and has evolved over a period of 14 years. The original code was roughly 10,000 lines long including comments written in both English and German. We extracted a subset of this code, removing a post-processing phase that computed various statistics of the ow. The code we worked with was about 5,300 lines long excluding comments. Nomura and co-workers have previously employed DISTUF in studies of the structure and dynamics of small-scale turbulence in sheared an unsheared ows [10{12]. A typical "large-scale run" employs a grid resolution of 1283 and runs for 5.0 to 10.0 dimensionless time units{at 128 time steps per time unit{consuming 1.5 to 3.0 CPU hours on a Cray YMP. However, resolution is limited. Of critical importance to studies of small-scale turbulence is to analyze higher Reynolds number Re ows. As mentioned, the number of computation points required for DNS is proportional to Re 49 ; any signi cant increase in Reynolds number would cause computation and memory costs to increase beyond the capacity of current vector-class supercomputers. For example, the Cray T94 only has sucient memory to support 2563 resolution, whereas an IBM SP2 with 256 Megabytes of memory per node could run such a problem on just a small fraction of the machine{16 processors. Only the SP2 could run at 5123 resolution. DISTUF

2

The Reynolds number is a ow parameter which e ectively describes the ratio of the large scales to small scales of turbulence.

Modernization of Legacy Software 2.2

3

The KeLP Infrastructure

is publicly-available C++ class library that enables the rapid development of irregular and block structured applications on MIMD multicomputers. It provides a middle-level layer sitting on top of MPI to simplify global to local mappings and communication optimizations. KeLP applications have been shown to deliver performance comparable with the equivalent MPI codes, including stencil based methods and the FFT, which are employed in DISTUF [4]. KeLP provides a simple machine independent model of execution. It runs on multicomputers such as the IBM SP2, Cray T3E, SGI-Cray Origin 2000; on workstation clusters; and on single and multiple-processor workstations. KeLP de nes only 7 new data types plus a small number of primitives to manipulate them. The casual programmer requires only basic knowledge for most of them. Space limits us to only a brief discussion of KeLP. Additional details may be found at the KeLP web page at http://www-cse.ucsd.edu/groups/hpcl/scg/kelp. KeLP supports a task parallel model which isolates parallel coordination activities from numerical computation. This simpli es the design of application software and promotes software re-use. In particular, the KeLP programmer may employ existing serial numerical kernels with known numerical properties, and leverage mature compiler technology. These kernels may be written in Fortran 90 or Fortran 77 or any other language the programmer chooses. KeLP provides two types of coordination abstractions: structural abstraction and communication orchestration. Structural abstraction permits the programmer to manipulate geometric sets of points as rst class language objects and includes a calculus of geometric operations. Data layouts across processors may be managed at run time, which facilitates performance tuning. Structural abstraction simpli es the expression of global-to-local mappings and is also used to coordinate and optimize communication, a process known as communication orchestration. Communication orchestration permits the programmer to encapsulate details concerning data motion; data dependencies are expressed in high-level geometric terms freeing the user from having to manage the details of managing message passing activity. KeLP has a minimalist design, and does not provide automated data decomposition facilities. The KeLP philosophy is to handle such capabilities change through layering: KeLP applications usually employ one or more Domain Speci c Libraries, application programmer interfaces providing common facilities for a particular problem class. The DSL encapsulates or hides the expert's knowledge, freeing the user from having to manage low level implementation details. The KeLP programmer may therefore develop complicated applications in a fraction of the time required to code the application with explicit message passing, i.e. MPI, and as previously noted, at a comparable performance. We used a DSL called DOCK to facilitate the implementation of KDISTUF. DOCK is written in KeLP and is packaged with the public distribution. DOCK supports HPF style BLOCK decompositions [8] that were used to parallelize DISTUF. (The details may be found in the KeLP User's Guide [3].) We may view the DOCK abstractions as specialized versions of the basic KeLP classes, which are

KeLP

4

To appear in Proc. PARA98

constructed using the C++ inheritance mechanism. The organization of a typical KeLP application is as follows. 1. Top level KeLP/DOCK code that manages the data structures and the parallelism. 2. Lower level code written in C++, C, or especially Fortran to handle numerical computation.

3 Modernization of DISTUF As mentioned previously, we simpli ed the DISTUF code prior to parallelizing it, removing a post-processor and modifying the initial conditions. The resultant code was 5,300 lines long. Howe and Grimmett began the code conversion in mid June 1997. Most of the 5,300 lines of code were converted by the end of September 1997 after approximately four man-months of e ort. An additional three man-months of e ort was required to tune performance. The conversion was completed in June 1998; this was a part time e ort. In the process of converting the code, the 5,300 line Fortran 77 program shrank by nearly 40% to 3336 lines. A 492-line KeLP wrapper, written in C++, was added to handle parallelization. Other than an FFT computation, all parallelization was expressed at the KeLP level, which enhanced the modularity of the software. The original DISTUF source code was divided into 52 separate les. The code conversion process left 42 of these les unchanged. The remaining ten les contained a total of 864 lines of code, or 26% of the total source code. However, not all this code actually changed, and we believe that a more accurate estimate of the fraction of code that changed is closer to 20%. The conversion process for the ten modules was facilitated by good coding style in the legacy code: these modules access data only within a plane, and therefore are not subject to dependence analysis, which was carried out by hand. We represented the main computational data structure as a 4D array, adding a 4th index to specify the eld variable number. As mentioned, most modules access the array sequentially in 2D slices. Therefore, the most logical way to distribute the data was to employ a 1D [*,*,BLOCK,*] partition. The KeLP code managed the data structures, including data distribution and ghost cell communication required by the advection-di usion phase. The KeLP wrapper made calls to Fortran, converting KeLP data descriptors into a form that Fortran can understand. As discussed below, the KeLP library includes an external interface to Fortran 77 to facilitate this inter-operation. Parallel control ow is expressed using a KeLP nodeIterator as shown in Fig. 1. The XArray4< Grid4> represents a 4 dimensional distributed array of doubles. This array, distributed outside the pprep01( ) routine, contains information about the distribution of the data across processors. Each element of the distributed array is a local 4-dimensional array living on one processor, and includes information about the local bounds of the array. The nodeIterator loop causes each processor execute the Fortran routine pprep01( ) in SPMD fashion. Various macros and one datatype handle the details of the

Modernization of Legacy Software

5

KeLP-to-Fortran interface as described in the caption, including the global to local mapping of the data array and name-mangling. The fillGhost( ) routine (not shown) handles ghost cell communication. The convection-di usion phase employed a stencil width spanning 3 planes in both directions. Thus, the ghost cell layers were three cells deep. #define pprep01 FORTRAN_NAME(pprep01_, PPREP01, pprep01) void pprep01(double *, FORTRAN_ARGS4, double *, FORTRAN_ARGS3); void fillGhost(XArray4& X); void f_prep(XArray4& A){ fillGhost(A); Array3 awork(NNI+2,NNK+2,15); // allocatable array FortranRegion3 FW(awork.region());

// XArray is a distributed array // Fill ghost cells (not shown) // scratch array like an f90

for (nodeIterator ni(A); ni; ++ni) { int p = ni();

// Execute pprep01( ) in parallel // on each processor p

// shape of the array for fortran

FortranRegion4 FR(A(p).region());

// The region (bounding box) for // processor p's local subarray pprep01(FORTRAN_DATA(A(p)), FORTRAN_REGION4(FR), FORTRAN_DATA(awork),FORTRAN_REGION3(FW)); } }

Fig. 1. KeLP code for stencil-based computation. The KeLP-to-Fortran interface is managed by macros FORTRAN REGION, FORTRAN DATA, and FORTRAN NAME, along with the FortranRegion datatype. FORTRAN NAME handles name mangling, the remaining constructs handle the global-to-local mapping of the data.

The pressure solver has a non-localized communication structure, as it must compute two-dimensional Fourier transforms in planes orthogonal to those employed in the rst phase. However, unlike the convection-di usion code, we handled some aspects of parallelization within the Fortran module. We originally installed KDISTUF on the Cray T3E, and used the SCILIB's FFT, which was able to interoperate with distributed KeLP arrays. However, the IBM PESSL library did not have a compatible FFT, and so we wrote Fortran code to rearrange data in preparation for calls to the manufacturer's supplied parallel 2D FFT routine, pscfft2D. A major diculty in modernizing this legacy application was in contending with sequence association. This now deprecated practice has been used historically to emulate dynamic memory and to improve vector lengths on vector architectures. This infamous problem is well known to the HPF and Fortran 90 community [9, 2, 1], and it seriously impedes the process of modernizing of large

6

To appear in Proc. PARA98

legacy Fortran 77 codes in general. We therefore had to restructure the code signi cantly, de-linearizing 1-dimensional arrays and loops back to their multidimensional counterparts.

4 Results We ran KDISTUF on an IBM SP2 with 160MHz POWER Super Chip (P2SC) processors. Each node had 256 Megabytes of memory. We report timings for various problems sizes and numbers of processors. Table 1 reports xed-sized speedups with N=128. On 8 processors the running time is about 1200 sec. per unit of simulated time. Performance is comparable with the legacy DISTUF code running on the Cray C90, about 1100 sec.

Table 1. Timings in CPU seconds for KDISTUF running on the IBM SP2 with N=128. We ran for 256 timesteps, which corresponds to 1.0 units of simulated time. Processors Time Computation Communication FillGhost Transpose Miscellaneous Total Eciency

1 CPU 7856 N/A N/A N/A N/A 7856 1.00

2 CPU % 3661 87 446 11 56 1.3 390 9.2 112 2.6 4219 100 0.93 N/A

4 CPU % 1832 83 266 12 56 2.5 210 9.5 103 4.7 2201 100 0.89 N/A

8 CPU % 922 77 181 15 56 4.7 125 10 108 9.0 1204 100 0.82 N/A

The timings in Table 1 are broken down into 3 parts: computation, communication, and miscellaneous. Communication is further broken down into the time to ll in ghost cells and the time to perform the transpose for the FFT. Miscellaneous work includes initialization and output. Computational work scales almost perfectly, initialization and output times are insigni cant, and ghost cell communication is a modest constant. The bottleneck for this computation is the transpose, but if we scale the problem appropriately with the number of processors we may e ectively run the larger problems that motivated this e ort. We also ran larger problem sizes: N=256 and N=384. Due to an as-yet unresolved memory allocation problem, we were unable to collect xed size speedups and we report just one set of timing data for each problem size. For N=256, KDISTUF runs on 16 processors at a rate of 4.3 hours of CPU time per unit of simulated time (a total of 512 timesteps.) The code spends 88% of its time in local computation. Communication accounts for just 13% of the total running time, with 70% of that time spend in transpose. We also ran a problem with N=384 on 64 processors. However data for this case are inconclusive due to the unresolved memory allocation problem. As a stopgap measure we were

Modernization of Legacy Software

7

forced to allocate more processors than was indicated (32). Not surprisingly communication costs were signi cant, with the transpose accounting for 33% of the total running time, and local computation accounting for just 53%. We are currently investigating this problem so that we can reduce communication costs and hence run at higher resolution.

5 Discussion and Conclusions Converting legacy software is an economically important activity, but is also a delicate process. Production users are reluctant to jeopardize an investment in working software. But, they are keenly aware of the need to periodically upgrade application software in order to leverage the latest technological advances in their computing environment. Though our experience has been a positive one, our results were obtained for a relatively small application code consisting of thousands of lines of Fortran 77. By comparison, large industrial or research codes are much bulkier, hundreds of thousands to millions of lines long. We have ignored management issues since our programming team comprised just two individuals. Nevertheless, some important lessons have been learned here, which generalize to larger scale applications. Our approach to modernizing a legacy code resulted in good parallel speedups. We used the KeLP run time library which facilitated the process. The modernization went more quickly than anticipated, though we were slowed down by diculties with sequence association, memory consumption{as-yet unresolved at this time{and the lack of an appropriate FFT routine. We cannot overemphasize the need for standardized FFTs that deliver portable performance across diverse computing platforms. In \KeLPifying" or parallelizing the DISTUF Fortran 77 code, we employed an important underlying principle of \minimal disturbance." We attempted to work around diculties in the Fortran 77 code at the KeLP (C++) level rather than recoding the Fortran. This strategy payed o : we had to change only about 20% to 25% of the original Fortran code and added just 500 lines of C++ code wrapper. However, we had to settle for less than optimal performance in the interest of conserving programming e ort. For example, we did not experiment with 2-D partitionings, which may be more ecient than the 1-D partitionings on larger numbers of processors. The changes to the Fortran 77 code would have been extensive, though the KeLP wrapper supports alternative partitionings. Code re-use is extremely important and has led to success in other conversion e orts, for example using PCN [6]. A major consideration in modernizing DISTUF is to increase the amount of available memory in order to improve resolution. The resultant improved code, called KDISTUF, will be employed by Grimmett and Nomura on high performance parallel computers. Using this parallel code, they will be able to compute at higher resolutions than possible with the legacy code running on the Cray T94. Because KeLP applications are portable, KDISTUF may also run on various parallel computers such as the Cray T3E and on clusters of workstations.

8

To appear in Proc. PARA98

Acknowledgments This work was supported in part by U.S. Oce of Naval Research Contract N00014-94-1-0657, the National Partnership for Advanced Computational Infrastructure (NPACI) under U.S. National Science Foundation contract ACI9619020, and a University of California, San Diego faculty startup award. The IBM SP2 used in this study is located at the San Diego Supercomputer Center (SDSC); computer time was supported both by SDSC and by a UCSD School of Engineering block grant. The authors wish to acknowledge the assistance of Stephen J. Fink, who designed and implemented KeLP.

References 1. T. Brandes and K. Krause. Porting to hpf: Experiences with dbetsy3d within pharos. In Proc. 2nd Annual HPF User Group meeting, June 1998. 2. M. Delves and H. Luzet. Semc3d code port to hpf. In Proc. 2nd Annual HPF User Group meeting, June 1998. 3. S. J. Fink and S. B. Baden. The kelp user's guide, v1.0. Technical report, Dept. of Computer Science and Engineering, Univ. Calif., San Diego, March 1996. 4. S. J. Fink, S. B. Baden, and S. R. Kohn. Run-time support for irregular blockstructured applications. Journal of Parallel and Distributed Computing, 1998. In press. 5. S. J. Fink, S. R. Kohn, and S. B. Baden. Flexible communication mechanisms for dynamic structured applications. In IRREGULAR '96, Santa Barbara, California, August 1996. 6. I. Foster, R. Olson, and S. Tuecke. Productive parallel programming: The pcn approach. J. of Sci. Prog., 1:51{66, 1992. 7. T. Gerz, U. Schumann, and S. Elghobashi. Direct simulation of stably strati ed homogeneous turbulent shear ows. J. Fluid Mech., 200:563{594, 1989. 8. High Performance Fortran Forum, Rice University, Houston, Texas. High Performance Fortran Language Speci cation, November 1994. 9. C. Koelbel. Making hpf work: Past success and future challenges. In Workshop on HPF for Real Applications, July 1996. 10. K. K. Nomura and S. E. Elghobashi. Mixing characteristics of an inhomogeneous scalar in isotropic and homogeneous sheared turbulence. Phys. Fluids, 4:606{625, 1992. 11. K. K. Nomura and G. K. Post. The structure and dynamics of vorticity and rate-ofstrain in incompressible homogeneous turbulence. (Submitted to J. Fluid Mech.), 1997. 12. K.K. Nomura, G.K. Post, and P. Diamessis. The interaction of vorticity and rateof-strain in turbulent homogeneous shear ow. (In Preparation), 1998. 13. U. Schumann. Dynamische datenblock-verwaltung in fortran. Technical report, Institut fur Reaktorentwicklung, Gessellschaft fur Kernforschung M.B.H., Karlsruhe, Germany, August 1974. 14. U. Schumann. Algorithms for direct numerical simulations of shear-periodic turbulence. In Soubbaramayer and J.P. Boujot, editors, Lecture Notes in Physics, volume 218, pages 492{496. Springer, 1985.

Suggest Documents