A General Programming Model for Developing Scalable Ocean Circulation Applications Aaron C. Sawdey and Matthew T. O’Keefe Laboratory for Computational Science and Engineering 4-174 EE/CS Building University of Minnesota Minneapolis, MN 55455 (612) 625-6306
[email protected] Wesley B. Jones Silicon Graphics Computer Systems 2011 N. Shoreline Blvd. MS 580 Mountain View, CA 94043 January 6, 1997
Abstract In this paper we describe our efforts in porting a global ocean model— the Miami Isopycnic Coordinate Ocean Model or MICOM—to clusters of symmetric multiprocessors (SMPs). This work extends our previous efforts in porting this same application to the massively parallel Cray T3D This
research was supported by the Office of Naval Research under grant no. N00014-941-0846, by the National Science Foundation under grant no. CDA-9414015 and grant no. ASC9523480, by Silicon Graphics Inc. through an equipment grant and the “Distributed Virtual Memory Research Project,” a joint research project between Silicon Graphics and the U.S. Army Research Laboratory supported through the DoD High Performance Computing Modernization Program.
1
machine. Our programming environment provides a “defensive” yet general programming model that can efficiently work across PVP, MPP, SMP cluster and DSM (Distributed Shared Memory) hardware. SC-MICOM, the SMP cluster version of our ocean code achieves good scalability both within and across the SMP architecture and has achieved up to 120 Mflops (C90 equivalent) per processor on the new SGI Origin2000 machine with a single 195 MHz R10000 processor and almost 90 Mflops per processor on a 30processor Origin machine. This code simultaneously exploits locality and parallelism to improve single node performance while tolerating potentially large network latencies and slow main memory systems.
1 Introduction With the recent introduction of the Cray T3E, SGI Origin2000, HP Exemplar and the Fujitsu VPP500, the performance of highly parallel computers is beginning to rapidly outstrip the performance of traditional shared-memory vector machines. For some users in the scientific computing community, such as the European Centre for Medium Range Weather Forecasting (ECMWF) [5] and NRL’s Layered Ocean Model (NLOM) team [9] [10], parallel processing has become routine. These groups and other pioneering users in the community have been willing to parallelize their applications and endure rapid changes in the underlying parallel technology. The lessons learned in these developments must now be distilled into better tools (including languages, compilers, source-to-source translators, libraries, debuggers, operating systems, and pre- and post-processing software), better codes (through a consistent set of code design principles), and mechanisms to distribute this technology throughout the high performance computing user community. Much work remains before the larger climate, weather, and oceanographic applications community can benefit from parallelism.
1.1 The Fortran-P Programming Model Our team has focused on developing a programming model and supporting tools that make parallel programming for climate, weather and oceanographic models and other regular grid applications. This focus arose out of the extremely demanding requirements for this class of applications:
Atmospheric codes, especially operational weather models, typically must run within a fixed time limit or the numerical results are uninteresting Oceanographic codes, though less likely to have operational time constraints, are notorious for requiring huge amounts of both memory and processor time.
2
As Bleck [4] has pointed out: Though basin-scale ocean circulation simulations can resolve winddriven horizontal circulation cells (gyres) and thermally-driven verticalmeridional overturning cells with course-mesh models at least qualitatively, obtaining a quantitatively correct picture of oceanic current systems requires models capable of resolving not only the gyre-scale motion but also flow instabilities which manifest themselves on scales of only a few tens of kilometers. Thus, computing requirements in ocean modeling have always been and will continue to be high. In addition to weather and oceanographic applications, we have collaborated with Paul Woodward and his team to develop parallel versions of PPM, a compressible gas dynamics code used in the study of turbulence, convection in stars, and a variety of high Mach number flows [1]. This effort focused on source-tosource translation of a subset of Fortran 77—known as Fortran-P—that allows programmers to cleanly express numerical algorithms based on regular grids. The goal of the Fortran-P model is to allow programmers to write serial Fortran code which can then be translated to parallel code1 : a domain decomposition is applied so that the regular grid is partitioned across processors in the parallel machine. Aggressive use of halo regions allows many loop nests to execute before communication or synchronization are required [1]. We were successful in developing a Fortran-P translator and applied various versions of it to help parallelize several codes, including PPM [1, 13] and ARPS[6, 7], a non-hydrostatic mesoscale weather prediction model developed by K. Droegemeier at the University of Oklahoma. The Fortran-P source-to-source translator was targeted to generate CM Fortran, a Fortran 90 dialect with directives to indicate data layout in distributed memory machines. An early precursor of HPF (High Performance Fortran), CM Fortran was successful in allowing programmers to quickly write codes in data parallel style and execute them on the CM-5. However, users who demanded high efficiency often found that the high-level expression of parallelism inherent in CM Fortran made it difficult to express (to the compiler) optimizations that would significantly increase performance, often by factors of 2 to 10 [1, 3, 9]. In addition, very low performance sometimes resulted from using certain source code expressions (loop structures versus array syntax patterns) that were equivalent to others that were highly efficient; it was generally hard to develop any intuition about what would perform well and one depended on experimentation and discussions with other users to determine the programming idioms necessary to attain good performance. 1
This parallel code could be data parallel (such as HPF), message-passing, threads, or perhaps even F?? [14].
3
2 The Massively Parallel MICOM: MP-MICOM Driven by a desire to gain more portability and control over performance achieved, and to explore how Fortran-P would map to other parallel machines, we began to explore message-passing implementations for Fortran-P-based applications late in 1993. At around that time we began a collaboration with two groups: one was the U. of Miami ocean modeling team, led by Rainer Bleck, and the other was with Cray Research through analyst Robert Numrich. The goal of the Miami collaboration was to modify (initially manually, later semi-automatically) the source code of the Miami Isopycnic Coordinate Ocean Model (MICOM) so that it would fit the constraints of the Fortran-P model. Cray was introducing the Cray T3D in the Fall of 1993 and we targeted this machine using Numrich’s SHMEM (get/put) one-sided message passing. Numrich, the developer of the SHMEM interface, was looking for users to try out his model and we were willing: we found it to be both elegant and efficient relative to other message-passing implementations, such as PVM, available at that time. However, we had a problem: we had no direct access to the Cray T3D during the Fall of 1993 which meant that we could not get started on the MICOM port to the T3D until early 1994, unless we could find a way to simulate Numrich’s SHMEM API on some other hardware. Conveniently, we had a 20-processor R4400-based SGI Challenge machine available to us for pre- and post-processing simulation datasets. We developed a SHMEM library for the SGI Challenge using IRIX mechanisms for mapping together portions of share group processes’ address spaces so that they could transparently share data with other processes in the group across the SHMEM interface [2, 8]. In the case of MICOM, halo regions around each subdomain mapped to a processor were shared using this mechanism. Using the SHMEM library and software routines for parallel input and output and dataset preparation developed at Minnesota, we successfully parallelized MICOM and performed some very highresolution North Atlantic circulation calculations with the Cray T3D [4, 24]. We call this first parallel version the ocean code MP-MICOM (short for MultiProcessor or Massively Parallel, depending on your architectural bias). On these machines, both with relatively small secondary caches, we achieved about 10 Mflops per processor (Alpha 21064, MIPS R4400) [24].
3 The SMP Cluster MICOM: SC-MICOM At this point, we could have gone back directly to our original mission of developing parallelizing translators for regular grid applications but we felt that, although our MICOM ports to the Cray T3D and SGI Challenge machine were successful, these architectures were not fully general in that they had only a single level of cache locality and parallelism. On the horizon we could see future paral4
lel machines that would consist of clusters of symmetric multiprocessors—some vendors, such as HP/Convex (Exemplar) and Silicon Graphics2 (Power Challenge Array) were already beginning to launch products—and that our original approach would not provide the efficiency and scalability that applications demanded on these architectures3 . We wanted to determine the parallel source code constructs we should generate to be most efficient for this new, general class of parallel machines. In addition, with processor secondary cache capacities growing rapidly (they are now at 4 MBytes and next-generation microprocessors will likely have 8- to 16-MBytes of secondary cache) relative to efficient grid subdomain sizes, we saw that the day would soon come when we could easily fit all of a single subdomain from a domain-decomposed grid into secondary cache, gaining a tremendous performance boost and putting microprocessors within striking distance of the fastest vector processors. So we began an effort to introduce another level of domain decomposition within the “outer” level of domain decomposition across SMPs. For our SMP cluster version of MICOM—which we have dubbed SC-MICOM—this “inner” decomposition, shown in Figure 1, would provide parallelism that multiple processors within the SMP could efficiently load-balance and execute, would also provide a tuning mechanism to precisely control the subdomain size and aspect ratio to optimize cache performance, and opened up significant opportunities for overlapping communications with computation between SMP boxes [11]. We were, however, very careful to isolate the parallel control code from the physics and dynamics routines so that the API for the modeler and physics package developers was kept simple and understandable. In SC-MICOM, each SMP is assigned a rectangular subsection of the global problem domain; in Figure 1 four subsections (SMP 1-4) have been assigned to the four SMP machines in this configuration. Within each SMP, each domain subsection is further decomposed into small subdomains (sometimes referred to as patches); in Figure 1 we show the 16 subdomains into which the SMP 1 has been partitioned. The grid arrays associated with each subdomain have storage for fake zones (also known as halo or ghost cells ) which exploit redundant computations 2
Our collaborator Paul Woodward (Director of the LCSE) and his team, including David Porter, B. Kevin Edgar and Steve Anderson, performed the largest fluid dynamics calculation (1024x1024x1024 grid points) on an array of SGI Challenge machines made available to them at SGI’s Mountain View manufacturing facility. SGI Chief Executive Officer Ed McCracken later stated that SGI’s decision to introduce the Power Challenge array was given a strong boost by the success of this calculation. 3 It should be pointed out that MPP machines such as the Cray T3E and Fujitsu VPP500 are a subset of the SMP cluster class of machines for which each "SMP" has 1 processor. Uniprocessors are a subset with only 1 processor and 1 “SMP”. Hence, we argue that this class of machines is fully general and therefore the most appropriate as a “defensive” programming target for portable and efficient code.
5
SMP 1 1
2
3
4
7
8
10
11
12
14
15
16
SMP 2 6
13
SMP 3
SMP 4
HALO
Figure 1: The two-level domain decomposition used in SC-MICOM. and memory storage to reduce communication and synchronization frequency [7, 27, 28]. Any work thread in the SMP can compute any subdomain, while a single I/O thread handles MPI communication between SMP boxes and serial code within a box. For the arrays, the first indices are the physical location within the subdomain. The last index is the subdomain number. The physics and dynamics routines never see the subdomain number; array variables are passed in as subroutine arguments which contain all the data needed to compute on that subdomain as shown in Figure 2. Note that no communication happens directly in these routines. Instead, a separate communications function for each physics routine is passed to the thread generator and is used to indicate which array variables need their fake zones updated. This update function is in fact overloaded to perform three separate tasks:
copy data for updating fake zones in a specified subdomain copy data from a subdomain to an outgoing message buffer copy data from an incoming message buffer to subdomain fake zones. The API for the main timestep control routine is characterized by singlethreaded code regions for performing I/O and initialization tasks and calls to the threadit routine to accomplish parallel work. The barotropic portions of code are somewhat more involved than the other physics routines that are part of the baroclinic phase. Three barotropic timesteps are performed between communication 6
Sections of Problem State
Processor i Arguments
Processor i+1 Arguments
Figure 2: SC-MICOM data layout within the SMP. events; this number is limited by the number of fake zones being used. The threadit routine takes a physics function and physics update function as arguments. Within threadit, an atomic fetch-and-add is used in a loop to assign subdomains to work threads. After a work thread finishes a subdomain, it notifies the I/O thread and starts computing the next available subdomain4. When all subdomains for that threadit call have been computed, the subdomain fake zones are updated. Note that within an SMP box, subdomains that must be computed before sending messages to other SMPs (i.e., the patches on the perimeter of the subsection assigned to the SMP) are computed first as shown in Figure 3. When all subdomains for a message are finished, a single large message is packed up and sent. Once all messages have been sent, the SMP control code shifts to start receiving messages that neighboring SMPs are sending. These are single, large message transfers containing all the arrays needed for updating neighboring fake zones along the entire subsection edge on the SMP. This approach maximizes the potential for overlapping computation and communications in the parallel code. In one experiment across 2 SGI Power Challenges connected by a 10-megabit Ethernet connection we found this overlapping improved performance by 25% compared to no overlap. 4
This is often referred to as self-scheduling or dynamic scheduling.
7
1. Compute the cells along the edge that need to be sent to neighboring SMPs.
2. Begin computing cells along the interior. Meanwhile, messages from neighboring SMPs begin to arrive.
3. All of the messages arrive and are filled into ghost cells before the computation finishes so the next computation step can begin without waiting for messages.
Figure 3: Outer subdomains are computed first to allow communication to overlap with computation.
3.1 SC-MICOM on the SGI Power Challenge Array We measured SC-MICOM performance on both the SGI Power Challenge and the new SGI Origin2000 machines. All performance measurements are stated in C90relative Mflops: we measured floating-point operations and computation time on the C90 and then used the relative time on the other architectures to scale the Mflops number appropriately. Both architectures use the R10000 MIPS microprocessor. SC-MICOM has performed well on current SMP clusters yielding good speedups within and across SMPs even on slow networks, and achieving nearly 9 times (about 87 Mflops sustained) the single node performance on the latest MIPS microprocessor, the 195-MHz MIPS R10000, compared to our earlier parallel MICOM runs on the Alpha 21064 and MIPS R4400 chips. We expect further improvements in processor clock frequency (from 195-MHz to 275-MHz), reductions in networking overhead and latency, and improvements in the MIPS compilers and our own parallel software libraries to increase this performance in the future. As noted previously, the “inner” decomposition of an SMP subsection into subdomains allows the size and aspect ratio of a patch to be chosen for best performance. For SC-MICOM using a perfect 800x800x16 “bathtub” grid on an R10000-based SGI Power Challenge machine requiring 1.86 GBytes of memory, the best patch configuration turned out to be roughly 130x40 (with 16 layers) which yielded 87 Mflops per processor. In contrast, when a single 500x500x16 SC-MICOM grid was computed on the same machine but without an inner decomposition it achieved only 42 Mflops, less than half the speed of the machine compared to the performance when a well-chosen inner decomposition was applied. Though further study is necessary, we found that patches that are long in the xdirection—the stride-1 direction—achieved the best performance, presumably due to better cache re-use. Performance increased by about 23% when larger 2-MByte 8
MF/Thread 100 1 thread 80
60 19 threads 40 0
200
400 600 800 1000 Total Megaflops Processors are 195MHz R10k w/ 2MB L2 Figure 4: Performance per thread as number of processors (threads) increased.
Total CPU time / singlethread CPU time
2.0 internal fz updates computation 1.8
1.6
1.4
320x320 problem divided into 4x6 patch grid 90MHz R8000 w/ 4MB L2 Normalized to single-thread timings
1.2
1.0
1
2
4 6 Number of threads
8
Figure 5: The parallel efficiency of the internal fake zone updates is poor compared to that of the numerical computations.
9
Stride 1 direction
Processor 1 reads Processor 2 writes L2 Cache Line Figure 6: Illustration of false sharing within L2 cache lines in MICOM. secondary caches were used instead of 1-MByte caches. About 325 Mflops were achieved on 4 processors while 16 processors yielded over 800 Mflops within a single SMP box. Scaling with increasing processors inside the SMP was limited primarily, we believe, by cache coherence, bus bandwidth and synchronization overhead as the number of processors increases. Figure 4 dramatically illustrates the limited scaling past about 12 threads. Through the use of the R10000 hardware counters and some analysis of the code, we have traced the scalability problem to a problem with false sharing [15] of cache lines. As described earlier, data is copied between subpatches within each SMP to update the fake zones of all the subpatches. Figure 5 shows how the internal fake zone update part of SC-MICOM scales poorly compared to the numerical computation part. Figure 6 illustrates how multiple processors can be reading and writing the same L2 cache line at the same time, thus causing excessive numbers of cache invalidations during this phase of the computation. This problem could be eliminated through the use of an additional buffer to store data destined for fake zones as it is computed, so the processors would only be writing to the cache line illustrated in Figure 6. Scaling across multiple SMP boxes was measured on a 4-SMP SGI Challenge Array at the University of Minnesota. Each Challenge machine had four R10000 (2-MByte L2 caches) processors active; the 4 machines were interconnected via a fast HiPPI network. On a 750x750x16 bathtub grid decomposed across the 4 SMPs (375x375x16 grid points per box) we achieved 990 Mflops across the 16processors in these 4 machines, nearly 200 Mflops more than that achieved on a similar grid within a single SMP with 16 processors. 10
R
R
R R
Router Node
R
Link
R
R
R
Express Link (optional)
R
Figure 7: Internal network for a 32-processor SGI Origin2000 system. Each node contains one or two processors and a memory. Further experiments showed that the per-SMP performance decreased by 10% when going from 1 to 2 SMPs (doubling the problem size) and a further 14% when going from 2 to 4 SMPs (doubling the problem size again). These tests were performed without the latest SGI software for fast transfers over HiPPI known as “HiPPI bypass” which bypasses buffering the data in kernel space and TCP/IP processing, both of which slow transfers and require significant processor resources and bandwidth. These scaling results show both the efficiency and potential scalability of our programming model for SMP clusters.
3.2 SC-MICOM on the SGI Origin2000 Machine We also performed several tests on the latest SGI machine known as the Origin2000. This architecture differs from the Power Challenge primarily in that it does not use a single bus to interconnect processors but instead uses a hypercubelike network augmented with additional links to increase bisection bandwidth, as shown in Figure 7. In addition, the Origin machines support distributed shared memory via a scalable cache-coherence mechanism implemented with distributed directories. As shown in Figure 8, a 32-processor Origin machine (unlike the 16-processor Power Challenge) scales nearly linearly. We believe this improved scaling is due to the fact that both the memory bandwidth and cache coherency directories of the Origin2000 increase as you increase the number of processors. The 2.5 gigaflops performance achieved equals our previous MP-MICOM results on 256 processors of the Cray T3D [24]. This was made possible primarily 11
3.0 Origin 2000 Power Challenge Linear (perfect)
2.5
Gigaflops
2.0
1.5
1.0
0.5
0.0
0
8
16 Processors
24
32
Figure 8: A comparison of the performance and scalability of SC-MICOM on the Power Challenge and Origin2000 systems. by the increased single processor performance of the MIPS R10000, which is about a factor of 10 faster than the T3D processor when running MICOM. Though we don’t expect another factor of 10 increase in the next two years, it is clear that the performance increases in each new generation of microprocessors means that they will be competitive with other supercomputer architectures of the future.
4 Related work Aggressive optimization of general ocean circulation codes has a long history due to the extraordinary computational demands in fully eddy-resolving regimes. The original Bryan-Cox GFDL models were structured so that computations were performed on 2D slabs (either horizontal or vertical) in main memory while the full 3D state of the models was kept out on disk or on some other secondary store (such as Cray’s SSD). Semtner and Chervin used this kind of approach to achieve the first eddy-resolving global calculation [16, 17]. Semtner provides a colorful and fascinating history of ocean modeling from its beginnings in the 1960s up to today [18]. Related work in SMP cluster programming includes a team at CSRD [20, 21] who developed a cluster version of the GFDL Bryan-Cox ocean model for the CEDAR architecture, an early SMP cluster architecture built at the University of Illinois. Unlike recent SMP clusters designs, CEDAR multiprocessors (constructed from Alliant FX/8 machines) had access to a shared global memory across a mul12
tistage network through which messages and data could be passed. The CEDAR version of the GFDL model executed 2D vertical slabs within a single Alliant machine, executing one dimension concurrently. The 4 Alliant machines in CEDAR would then execute multiple slabs concurrently; alternative partitionings that were either in the east-west, north-south, or both dimensions were tried with the latter approach (2D decomposition across the 4 Alliant’s) yielding the best results. In their approach the CEDAR global memory acts like secondary storage, holding the 3D state of the model. The CEDAR version of the GFDL model uses two levels of parallelism but unlike SC-MICOM its data decomposition is not selfsimilar (i.e., the same both within and across SMPs). We believe that a self-similar decomposition exploits more locality and parallelism, especially within an SMP, and that this will be important in the most common supercomputing SMP cluster configurations: small number of SMP nodes, each with many processors. We were fortunate to work closely with Woodward’s team as they pioneered self-similar data decompositions for SMP clusters on PPM gas dynamics applications on the SGI Power Challenge array so that we could learn from both their successes and failures [13]. It should be noted that Woodward’s early successes using the SMP cluster architecture for large hydrodynamics calculations led the U.S. Department of Energy to use PPM as the benchmark for the ASCI (Advanced Strategic Computing Initiative) Blue program procurement. Many other groups are developing parallel ocean models. A good survey by Ashworth of the state-of-the-art as of 1992 is found in [22]. Ashworth also developed a simple parallel shallow-sea model to benchmark several parallel machines. Wait and Harding [23] also parallelized a shallow-sea model on a Transputer array using overlapped fake zones to simplify programming (but not to reduce communication frequency) and overlapping between communication and computation. Smith, Dukowicz, and Malone [19] with help from C. Kerr at GFDL developed a parallel version of the Semtner-Chervin ocean model called POP (Parallel Ocean Program) for the CM-2 and CM-5. POP provided a mechanism for executing on both data parallel and message-passing architectures. Smith, Dukowicz, and Malone also worked to improve the barotropic solver in POP which had become a significant bottleneck on the CM-2 and CM-5 MPP architectures. Though data parallel versions of POP and MICOM were developed for the Connection Machine architecture [19, 4], performance problems with the compilers combined with a lack of flexibility and machines supporting the data parallel programming paradigm has resulted in its demise within the computational oceanography community. Recent work in parallel ocean modeling reported at the 1994 ECMWF conference and elsewhere [9, 10, 24, 25, 26] show that most investigators are using message-passing and two-dimensional decompositions, resulting in parallel columns of seawater being mapped directly to processors or threads. Oberhuber [25] reports impressive single-node performance and very good speedups in the 13
parallel version of his isopycnal model.
5 Conclusions and Future Work Though sophisticated and therefore complex, SC-MICOM has been structured so that the APIs for the physics and dynamics routines and the model timestep are simple and consistent. These routines can be written with little knowledge of the underlying parallel design of the code. We believe this code design paradigm and programming model can be applied to other parallel models and that it represents a good compromise between code complexity, efficiency and portability. We believe that SMP clusters and DSM microprocessor-based hardware will provide cost-effective, scalable, user-friendly parallel hardware platforms that will initially compete with but ultimately surpass parallel vector architectures. For scientific codes, single node microprocessor performance is increasing by factors of 4-10 every 3 years. Given appropriate code designs, the SMP advantages in load balancing and efficiency means that they will overtake massively parallel message-based architectures. Future work on SC-MICOM will include additional performance optimizations for the SGI Origin and other DSM architectures, such as reduction of false sharing. We are attempting to identify performance-critical code features as they relate to the architecture, including the relationship between subdomain size and secondary cache size. As with MP-MICOM, the SC-MICOM code will be transitioned to production mode with the help of the MICOM team at the University of Miami.
References [1] Matthew O’Keefe, Terence Parr, B. Kevin Edgar, Steve Anderson, Paul Woodward and Henry Dietz, “The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors,” The Journal of Scientific Programming, vol. 4, no. 1, pp. 1-22, Spring 1995. [2] Aaron Sawdey, “Using the Parallel MICOM on the Cray T3D and SGI Challenge Multiprocessor,” Technical Report, University of Minnesota, 1994. Available on the WWW at: http://www-mount.ee.umn.edu/˜okeefe/micom [3] P. Woodward, “Interactive Scientific Visualization of Fluid Flow,” IEEE Computer, vol. 26, no. 10, pp. 13-26, October 1993. [4] Rainer Bleck, Sumner Dean, Matthew O’Keefe and Aaron Sawdey, “A Comparison of Data-Parallel and Message-Passing Versions of the Miami Isopycnic
14
Coordinate Ocean Model,” Parallel Computing, vol. 21, no. 10, pp. 1695-1720, 1995. [5] David Dent, Lars Isaksen, George Mozdzynski, Matthew T. O’Keefe, Guy Robinson and Fritz Wollenweber, “Performance Measurements of the ECMWF Integrated Forecast System,” Proceedings of the Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, Reading, England, November 1994. Proceedings published by World Scientific Publishers (Singapore) in Coming of Age, edited by G-R. Hoffman and N. Kreitz, 1995. [6] Kelvin K. Droegemeier, Ming Xue, Kenneth Johnson, Matthew T. O’Keefe, Aaron C. Sawdey, Gary W. Sabot, Skef Wholey, Kim Mills, and Neng-Tan Lin, “Weather Prediction: A Scalable Storm-Scale Model,” published in High Performance Computing, Addison-Wesley Publishers, ed. by G. Sabot, 1995. [7] Matthew T. O’Keefe and Aaron C. Sawdey, “Translation Tools for High Performance Computing,” Proceedings of the Les Houches Workshop on High Performance Computing in the Geosciences, Les Houches, France, June 1993. Proceedings published by Kluwer Academic Publishers in High Performance Computing in the Geosciences, ed. by Francois X. Le Dimet, 1995. [8] Aaron C. Sawdey and Matthew T. O’Keefe, “A Software-level Cray T3D Emulation Package for SGI Shared-memory Multiprocessor Systems”. This software and reference are available on the WWW at: http://www-mount.ee.umn.edu/mcerg/T3Dem.html [9] Alan Wallcraft, “The Navy Layered Ocean Model,” presentation at the First International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, Inn at Semi-ah-moo, Blaine, WA, September 16-19, 1996. [10] A. Wallcraft, D. R. Moore, “A Scalable Implementation of the NRL Layered Ocean Model,” Tech. Report NRL CR 7323–96-0005, 1996. [11] Aaron Sawdey and Matthew O’Keefe, “A Modular and Parallel Isopycnic Ocean Model,” presentation at the First International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, Inn at Semi-ah-moo, Blaine, WA, September 16-19, 1996. [12] Robert Numrich, “F–: A Parallel Fortran 90,” presentation at the First International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, Inn at Semi-ah-moo, Blaine, WA, September 16-19, 1996. Also available as a technical report from Cray Research, Eagan, MN.
15
[13] P. R. Woodward, “Perspectives on Supercomputing: Three Decades of Change,” IEEE Computer, invited paper, October 1996 (special 50th anniversary issue), in press. [14] Robert Numrich, “F–: A Parallel Fortran Language,” technical report, Cray Research, Eagan, MN, April 1994. To be published in The Journal of Scientific Programming, 1996. [15] John Hennessy and David Patterson, Computer Architecture: a Quantitative Approach, Morgan Kaufmann Publishers, San Francisco, CA, 1996, p. 669. [16] A. J. Semtner and R. M Chervin, “A Simulation of the Global Ocean Circulation with Resolved Eddies,” J. Geophys. Res., vol. 93, pp. 15502-15522. [17] A. J. Semtner and R. M Chervin, “An Ocean Modeling System for Supercomputer Architectures of the 1990s,” in Climate-Ocean Interaction, M.E. Schlesinger, ed., Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990, pp. 87-95. [18] A. J. Semtner, “Modeling Ocean Circulation,” Science, v. 269, #5231, 1995. [19] R. Smith, J. Dukowicz, and R. Malone, “Massively Parallel Global Ocean Modeling,” Tech. Rep. LA-UR-91-2583, Los Alamos National Laboratory, Los Alamos, NM, 1991. [20] L. DeRose, K. Gallivan, E. Gallopolous, “Experiments with an Ocean Circulation Model on CEDAR,” CSRD Report No. 1200, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, February 1992. [21] L. DeRose, K. Gallivan, and E. Gallopolous, “Status Report: Parallel Ocean Circulation on CEDAR,” Parallel Supercomputing in Atmospheric Science, G.-R. Hoffman and T. Kauranne eds., World Scientific, Singapore, 1993, pp. 157-172. Paper appeared in Fifth ECMWF Workshop in Use of Multiprocessors in Meteorology, European Centre for Medium Range Weather Forecasts, Nov. 1992. [22] M. Ashworth, “Parallel Processing in Environmental Modeling,” ibid, pp. 1-25. [23] R. Wait and T. J. Harding, “Numerical Software for 3D Hydrodynamic Modeling using Transputer Arrays,” ibid, pp. 453-464.
16
[24] Aaron C. Sawdey, Matthew T. O’Keefe, Rainer Bleck, and Robert W. Numrich, “The Design, Implementation, and Performance of a Parallel Ocean Circulation Model,” Proceedings of the Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, Reading, England, November 1994. Proceedings published by World Scientific Publishers (Singapore) in Coming of Age, edited by G-R. Hoffman and N. Kreitz, 1995, pp. 523-550. [25] J. Oberhuber and K. Ketelsen, “Parallelization of an OGCM on the Cray T3D,” ibid, pp. 494-504. [26] C. Gwilliam, “The OCCAM Global Ocean Model,” ibid, pp. 447-454. [27] B. Rodriguez, L. Hart, and T. Henderson, “A Library for Portable Parallelization of Operational Weather Forecast Models,” ibid, pp. 148-161. [28] T. Henderson, C. Baillie, et al., “Progress Toward Demonstrating the Operational Capability of Massively Parallel Processors at the Forecast Systems Laboratory,” ibid, pp. 162-176.
17