A Performance Evaluation of the Convex SPP ... - Semantic Scholar

2 downloads 1533 Views 282KB Size Report
Figure 3 would seem to reflect the increased cost of maintaining co- herency and ..... comet Shoemaker/Levy 9 to galaxy dynamics to the large scale structure.
A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer Thomas Sterling y Kevin Olson x

Daniel Savarese y Clark Mobarry { Phillip Merkey

Peter MacNeice z Bruce Fryxell x

Abstract The Convex SPP-1000 is the first commercial implementation of a new generation of scalable shared memory parallel computers with full cache coherence. It employs a hierarchical structure of processing communication and memory name-space management resources to provide a scalable NUMA environment. Ensembles of 8 HP PA-RISC 7100 microprocessors employ an internal cross-bar switch and directory based cache coherence scheme to provide a tightly coupled SMP. Up to 16 processing ensembles are interconnected by a 4 ring network incorporating a full hardware implementation of the SCI protocol for a full system configuration of 128 processors. This paper presents the findings of a set of empirical studies using both synthetic test codes and full applications for the Earth and space sciences to characterize the performance properties of this new architecture. It is shown that overhead and latencies of global primitive mechanisms, while low in absolute time, are significantly more costly than similar functions local to an individual processor ensemble.

1 Introduction The Convex SPP-1000 is the first of a new generation of scalable shared memory multiprocessors incorporating full cache coherence. After the innovative but ill-fated KSR-1 [18], parallel computing system vendors briefly pursued more conservative approaches to harnessing the power of state-ofthe-art microprocessor technology through parallel system configurations. But issues of ease-of-programming, portability, scalability, and performance have led to plans by more than one vendor to offer full global name-space  Center of Excellence in Space Data and Information Sciences, NASA Goddard Space

Flight Center y Department of Computer Science, University of Maryland z Hughes STX x Institute for Computational Science and Informatics, George Mason University { NASA/GSFC Space Data Computing Division, Greenbelt, Maryland

1

scalable architectures including the mechanisms for maintaining consistency across distributed caches. This paper presents the most in-depth performance study yet published of the Convex SPP-1000. Extending beyond bus based systems of limited scaling employing snooping protocols such as MESI, this emerging generation of multiprocessors exhibits hierarchical structures of processing, memory, and communications resources. Correctness of global variable values across the distributed shared memories is supported by means of levels of directory based reference trees for disciplined data migration and copying. When implemented in hardware, such mechanisms greatly reduce the overhead and latencies of global data access. However, in spite of such sophisticated techniques, cache miss penalties in this class of parallel architecture can be appreciably greater than for their workstation counterparts imposing a NUMA (Non-Uniform Memory Access) execution environment. The Convex SPP-1000 provides an opportunity to better understand the implications of these architectural properties as they relate to performance for real-world applications. This paper presents findings from studies designed to examine both the temporal costs of this new architecture’s low level mechanisms and its scaling properties for a number of complete scientific parallel programs. The Convex SPP-1000 architecture reflects three levels of structure to physically and logically integrate as many as 128 microprocessors into a single system. At the lowest level, pairs of HP PA-RISC processors are combined with up to 32 Mbytes of memory, two communications interfaces, and memory and cache management logic to form the basic functional units of the system. Sets of four of these functional units are combined by a 5 port cross-bar switch into a tightly-coupled cluster, or hypernode. Within a hypernode, all memory blocks are equally accessible to all processors and a fully hardware supported direct-mapped directory-based protocol provides cache coherence. Up to 16 hypernodes are integrated into a single system by means of four parallel ring interconnects. The SPP-1000 is the first commercial system to employ full hardware support for the SCI (Scalable Coherence Interface) [17] protocol to manage distributed reference trees for global cache coherence. Each functional unit of a hypernode connects to one of the four global ring interconnects and provides cache copy buffer space in its local memory. The use of a mix of CMOS and GaAs technology based components and strong hardware support for basic mechanisms push the state-of-the-art in what can be achieved with this class of parallel architecture. The studies presented in this paper have yielded detailed measurements exposing the behavior properties of the underlying system elements and the global system as a whole. Synthetic experiments were performed to reveal the times required to perform such critical primitive operations as synchronization, data access, and message passing. Examples of science problems from the Earth and space science disciplines were used to characterize the scaling properties of the global system. These ranged from problems with 2

regular static data structures to applications with irregular dynamic data structures. Both shared memory parallel thread and message passing programming/execution styles were examined. The results show a range of behaviors that generally favor this class of architecture but which expose problems due to its NUMA nature that are not resolved with current techniques. The next section of this paper provides a more detailed description of the Convex SPP-1000 architecture, sufficient for understanding of the experimental results that follow. A brief discussion of the software environment used for programming and resource management is presented in Section 3. It must be noted that the Convex software environment is in a state of flux with improvements incorporated on almost a weekly basis. Section 4 presents the findings of a suite of experiments performed to expose the characteristics of the primitive mechanisms used to conduct and coordinate parallel processing on the Convex multiprocessor. These findings are followed in Section 5 with the results of scaling studies performed on four Earth and space science application codes. A detailed discussion of the findings and their implications for future architecture development are given in Section 6 followed by a summary of conclusions drawn from this evaluation in Section 7.

2 A Scalable Shared Memory Architecture The Convex SPP-1000 is a scalable parallel computer employing HP PARISC 7100 microprocessors in a hierarchical NUMA structure. The SPP1000 architecture incorporates full hardware support for global cache coherence using a multi-level organization that includes the first commercial implementation of the emerging SCI protocol. The SPP-1000 mixes CMOS and GaAs technologies to provide short critical path times for custom logic and leverage cost effective mass market memory, processor, and interface components. This section describes the Convex SPP-1000 architecture at sufficient detail to understand the empirical results presented later in the paper.

2.1 Organization The Convex SPP-1000 architecture is a three-level structure of processors, memory, communication interconnects, and control logic as shown in Figure 1. The lowest level is the functional unit comprising two processors with their caches, main memory, communications interfaces, and logic for address translation and cache coherence. The second level combines four functional units in a tightly coupled clustered called a hypernode using a five port cross-bar switch. The fifth path is dedicated to I/O. The third and top level of the SPP-1000 organization employs four parallel ring networks to integrate up to sixteen hypernodes into a single system. Cache coher3

Figure 1: Convex SPP-1000 System Organization ence is implemented within each hypernode and across hypernodes. Within a hypernode, a direct mapped directory based scheme is used. Between hypernodes, a distributed linked list directory based scheme based on the SCI (Scalable Coherent Interface) protocol is fully implemented in high speed hardware logic.

2.2 HP Processor The SPP-1000 system architecture derives its compute capability from HP PARISC 7100 [15] microprocessors. The processor architecture is a RISC based design emphasizing simple fixed sized instructions for effective pipelining of execution. This 100 MHz clocked chip includes 32 general purpose registers of 32 bit length and 32 floating point registers which may be used singly in 32 bit format of in pairs for 64 bit precision; both IEEE floating-point format. A number of additional register for control and interrupt handling are included. The HP 7100 supports virtual memory with on-chip translation provided by a Translation Lookaside Buffer. Caches are external with separate 1 Mbyte data and instruction caches and a cache line size of 32 Bytes.

2.3 Functional Unit The Functional Unit of the SPP-1000 integrates all the basic elements required to perform computation and from which scalable systems can be assembled. Each functional unit includes two HP PA-RISC processors with their respective instruction and data caches, Two physical memory blocks, each up to 16 Mbytes is located on every functional unit. This provides local, hypernode, and global storage as well as global cache buffer space. The CCMC element of the functional unit provides all hardware logic required to support the multilevel cache consistency protocols at the hypernode and global levels. An additional agent element of the functional unit manages

4

the communications and address translation, again through hardware logic. The functional unit interfaces to the crossbar switch within its host hypernode and to one of four ring networks connecting it to all other hypernodes within the system.

2.4 Hypernode Local parallelism is achieved through hypernodes of four functional units. They are tightly coupled by means of a full 5 port crossbar switch. Four of the ports are connected, each to a single interface of one of the functional units. The fifth port is used to interface to I/O channels external to the system. Access to any of the memory banks within a hypernode by any of the functional units of the hypernode take approximately the same time. The hypernode supports its own cache coherence mechanism. It is a directory based scheme using direct mapped tags similar to the experimental DASH system [19]. Each cache line is 32 bytes in length and has sufficient tag bits to indicate the local processors that maintain active copies in their dedicated caches. The tags also indicate which other hypernodes in the system share copies of the cache line as well. Access to global resources from the hypernode are through the second communications interface on each functional unit as described in the next subsection.

2.5 SCI Global Structure The principal architecture feature providing scalability is the global interconnect and cache coherency mechanisms. Four ring networks are used to connect all (up to 16) hypernodes. Each hypernode is connected to all four ring networks. Within a hypernode, one ring network is interfaced to one of the four functional units by that unit’s second high speed interface logic. That ring network connects one quarter of the system’s total memory together as the aggregate of the memory on the functional units it joins in the different hypernodes. A cache buffer is partitioned out of the functional unit memory to support cache line copies from the other hypernode memories on the same global ring. Thus an access to remote memory from a processor first goes through the hypernode crossbar switch to the functional unit within that hypernode associated with the ring on which the requested cache line resides on the external hypernode. Then the request proceeds from that functional unit out onto the ring interconnect and into the remote hypernode and functional unit holding the sought after data. Cache coherency is provided by a distributed reference tree. Each remote cache line that is shared by another hypernode processor(s) is copied to the cache buffer in the local hypernode and the functional unit associated with the appropriate ring interconnect. Thus, future accesses to that cache line from within the hypernode need only go to the correct part of the global cache buffer. Writes result in

5

the propagation of a cache line invalidation to remote hypernodes where necessary. All management of the globally distributed cache is conducted according to practice established by the SCI protocol.

2.6 Memory Organization System memory is organized to provide ease-of-use and adequate control for optimization. The physical memory can be configured to provide several levels of latency that may map to the application programs’ memory access request profile. Where possible, local memory with low latency and no sharing management overhead can provide fastest response time on a cache miss. Memory partitioned to serve processors anywhere in a given hypernode but precluding access from external hypernodes gives fast response through the cross-bar switch and can be shared among all processors within the hypernode. Finally, memory partitioned such that it can be shared by all hypernodes may be organized in two ways. Either the memory may be defined in blocks which are hosted by a single hypernode or the memory is uniformly distributed across the collection of hypernodes. In the former case, the memory block is interleaved across the functional units in the host hypernode. In the latter case, the memory is interleaved across hypernodes as well as functional units within each participating hypernode. Cache throughput permits on average one data access and one instruction fetch per cycle (10 nsec). A cache miss that can be serviced by memory local to the requesting functional unit, by memory shared among processors within a given hypernode, or by global cache buffer within the hypernode all exhibit latencies of approximately 50 to 60 cycles depending on cross-bar switch and memory bank conflicts. The Convex SPP-1000 provides full translation from virtual to physical memory addresses.

3 System Software and Programming Model The underlying system software support and programming interface is a determining factor in the efficient use of any architecture. If a system is difficult to use, its potential performance gains become outweighed by the increased development time required to exploit the system. The Convex SPP-1000 takes a novel approach towards its user interface. Unlike the Cray T3D [11], there is no host system between the user and the machine. Users have direct access to the system, which appears as a single entity running one operating system. In actuality, each hypernode runs its own operating system kernel that manages interactions with other hypernodes, such as job scheduling and semaphore accesses. The operating system handles both process and thread resource management. System utilities allow users to overriding operating system defaults that control where processes will execute and the type and maximum number of threads to use in parallel pro6

grams. The Convex SPP-1000 supports both the message passing and shared memory programming models to achieve parallelism in programs.

3.1 Message Passing Message passing on the SPP-1000 is performed using the Convex implementation of Parallel Virtual Machine (PVM) [26, 9], which is perhaps the most widely used message passing library in use today. Distributed systems running PVM, such as a cluster of workstations, run one PVM daemon on each node (workstation or processor) to coordinate message buffer management, task synchronization, and the sending/receiving of messages. Because the SPP-1000 is a shared memory system, ConvexPVM does not allocate one daemon per processor. Instead one PVM daemon runs on the entire machine, coordinating intrahypernode and interhypernode communication. This model minimizes the interference present in purely distributed systems caused by daemon processes running on the same processors as PVM tasks. To reduce the cost of message passing, ConvexPVM allows tasks to use a shared message buffer space, instead of forcing each task to allocate buffers in private memory. A sending process packs data into a shared memory buffer that the receiving process accesses after the send is complete. This process does not require any interaction with the PVM daemon [9] and reduces the number of copies necessary to transfer data. Because the cost of passing messages with PVM on the SPP-1000 is significantly less than that of transmitting them across a local area network (see figure 4), PVM tasks function much like coarse-grained threads. In fact, as will be seen in section 5, a PVM implementation of an application can achieve almost one half the performance of a shared memory implementation.

3.2 Shared Memory Although the SPP-1000 provides relatively efficient message passing support, adopting a shared memory programming style more efficiently exploits the system’s parallelism. Convex currently provides an ANSI compliant C compiler with parallel extensions and a Fortran compiler supporting Fortran 77, as well as a subset of Fortran 90, with future plans to support C++. Programs can run in parallel on the SPP-1000 by spawning execution threads managed by the operating system and sharing the same virtual memory space. Threads can be created either by using the vendor’s low level Compiler Parallel Support Library (CPSlib), which provides primitives for thread creation and synchronization, or a high level parallel directive interface used by the compilers to generate parallel code. The compilers can automatically parallelize code they determine can run in parallel, but we have found that

7

they do not do a good job of weighing the overheads of parallelism in making their decisions. It is unreasonable to expect a compiler to make perfect decisions in choosing what code to parallelize, which is why it is important for there to be extensive support for user specification of parallelism. The vendor provided compilers support an extensive range of parallel directives that allow them to efficiently parallelize code. They also allow some flexibility in the allocation of memory and placement of data structures in the memory space of executing threads. Five classes of virtual memory are available to programs: Thread Private - Data structures in memory of this type are private to each thread of a processes. Each thread possesses its own copy of the data structure that maps to a unique physical address. Threads cannot accesses other threads’ thread private data. Node Private - Data structures in memory of this type are private to all the threads running on a given hypernode. Each hypernode posses its own copy of the data structure which is shared by all the threads running on it. Near Shared - Data structures in memory of this type are assigned one virtual address mapping to a unique physical address located on the physical of a single hypernode. All threads can access the same unique copy of the data structure. Far Shared - Data structures in memory of this type are assigned one virtual address. However, the memory pages of the data are distributed round robin across the physical memories of all the hypernodes. Block Shared - This is identical to Far Shared memory except that the programmer can specify a block size that is used as the unit for distributing data instead of the page size. Classes of parallelism supported include both synchronous and asynchronous threads. Synchronous threads are spawned together and join in a barrier when they finish; the parent thread cannot continue execution until all children have terminated. Asynchronous threads continue execution independent of one another; the parent thread continues to execute without waiting for its children to terminate. Ordering of events and mutual exclusion can be managed with high level compiler directives called critical sections, gates, and barriers which allow the compiler to generate appropriate object code that allocates and properly coordinates the use of semaphores. Although the abstraction of low level parallel mechanisms through the use of compiler directives facilitates the programming process, the programmer still has to have a more than cursory understanding of the memory hierarchy to write efficient programs. Parallel loops can achive marked performance gains just by making scalar variables thread private to eliminate cache thrashing. 8

Figure 2: Cost of Fork-Join

4 Global Control Mechanisms A set of synthetic codes were written to measure the temporal cost of specific parallel control constructs. These measurements were conducted across both hypernodes, first scaling with high locality (first 8 threads are on one hypernode) and next with uniform distribution (each hypernode has an equal number of threads running on it). The accuracy of the measurements were limited by the resolution of the timing mechanisms available and the intrusion resulting from their use. The multitasking nature of resource scheduling also proved to be a source of error, prompting the execution of many experimental runs to expose the true system behavior. Depending on the measurement of interest, either averages of the combined measurements or the minimum values observed were used.

4.1 Fork-Join Mechanism Figure 2 shows the fork-join time in microseconds as a function of the number of threads spawned. The graph shows two plots that highlight the increased cost of a fork-join across two hypernodes. The high locality plot demonstrates the cost of the fork-join where the first 8 threads are spawned on the same hypernode and subsequent threads are spawned on the remaining hypernode. The uniform distribution plot shows the cost of the forkjoin where an equal number of threads are spawned on each hypernode (except in the 1 thread case). The principal observations to be garnered from Figure 2 are:



The fork-join time is proportional to the number of threads spawned with high locality across a single hypernode. Moving from 2 to 8 pro9

Figure 3: Cost of Barrier Synchronization cessors each additional pair of threads costs approximately 10 microseconds.

 

The fork-join time is roughly proportional to the number of threads spawned with uniform distribution between hypernodes. Moving from 2 to 16 processors each additional pair of threads costs approximately 20 microseconds. A significant overhead, on the order of 50 microseconds, is incurred once threads start to be spawned on two hypernodes.

4.2 Barrier Synchronization Figure 3 reports two metrics for both the high locality and uniform distribution cases: Last In - First Out: the minimum time measured from when the last thread enters the barrier to when the first thread afterward continues. Last In - Last Out: the minimum time measured from when the last thread enters the barrier to when the last thread continues. The results from our earlier study of only one hypernode of the SPP-1000 [24] are also shown. Both the previous and current study used the same experimental method. A time-stamp was taken before each thread entered the barrier and after each thread exited the barrier. From this data an approximation of the barrier costs could be derived. All timing data have been corrected for the overhead involved in performing the measurements. Figure 3 shows that the minimum time for a barrier (last in - first out) involving more than one thread is approximately 3.5 microseconds on a single 10

hypernode, incurring an additional cost of 1 microsecond once threads on a second hypernode become involved. The release time of the barrier, the total time to continue all suspended threads, possesses a more complex behavior. In the high locality case on just one hypernode, the barrier appears to cost roughly 2 microseconds per thread beyond the second thread involved. Once threads on a second hypernode become involved, there is an additional penalty, as evidenced by both the high locality and uniform distribution cases. This behavior may be caused by the implementation of the barrier primitive, which has each thread decrement an uncached counting semaphore [4] and then enter a while loop, waiting for a shared variable to be set to a particular value. The last thread to enter the barrier sets the shared variable to the expected value, thus releasing the other threads from their spin waiting. Because this shared variable is cached by all of the threads, coherency mechanims are invoked when the final thread alters its value. This incurs a variable temporal cost depending on the status of the system reference tree. Figure 3 would seem to reflect the increased cost of maintaining coherency and updating the reference tree as a greater number of processors become involved. The behavior of the uniformly distributed case is accounted for by the parallel updates of internal system data structures of the two hypernodes.

4.3 Message Passing The impact of the global interconnect is examined from the point of view of message passing. For reasons of measurement accuracy, the experiments performed measure the time it takes to send a message out to a receiving processor and to get the message back. The time measured does not include the cost of buidling the message in the first place. The round trip times were measured between a pair of processors on a single hypernode and then again on a pair of processors on separate hypernodes. The experimental results of measurements taken for different message size are shown in Figure 4. For messages under 8K bytes in size, the round trip message passing time is approximately constant for both local and global messages. Local messages take about 30 microseconds round trip while messages between hypernodes over the SCI interconnect require approximately 70 microseconds for a ratio of 2.3 between global and local message passing. This is an excellent behavior on the part of the global mechanisms. From the standpoint of message passing, the SPP-1000 can be considered as truly scalable. This of course does not include possible compounding factors as contention which would result in a more heavily burdened system. However, earlier experiments on a single hypernode showed little degradation as message traffic was increased appreciably [24]. Degradation is observed as the message size exceeds 8K byte size. As 11

Figure 4: Cost of Round Trip Message Passing the message size, measured in pages, doubles we find a substantial increase in message transfer times for both the local and global cases. Some complex behaviors are seen and an exact explanation for the rate on transfer time is not immediately clear.

5 Applications In this section we investigate the performance of the SPP1000 on four different applications.

5.1 PIC Introduction Particle in cell(PIC) codes are extensively used in numerical modelling of plasmas. Their strength is studying kinetic behaviour of collisionless plasmas with spatial dimensions in the range from tens to thousands of debye lengths. They have also on occassion been used in N-body astrophysical simulations and are very closely related to vortex methods used in CFD. PIC codes attempt to mimic the kinetic behaviour of the charged particles which constitute the plasma. They do this by tracking the motion of a large number of representative particles as they move under the influence of the local electromagnetic field, while consistently updating the field to take account of the motion of the charges. Also known as particle-mesh(PM) codes, they have an hybrid eulerian/lagrangian character. The particle data is distributed in proportion to the mass density and in that sense the codes are lagrangian. However, for efficiency reasons the forces acting on the particles are calculated by solving field equations using finite difference approximations on a fixed mesh which adds an eulerian character. For statistical ac12

curacy reasons we would like to use as many particles as possible. Machine limitations have dictated particle numbers in the range of 105 – 107. The particles are finite sized charge clouds, not point particles. The use of point particles would introduce too much short wavelength noise in the electric field. These charge clouds are comparable in size to a single cell of the mesh used to solve the field equations. In this paper we report on a 3D electrostatic PIC code running on the Convex SPP1000 at Goddard Space Flight Center, using both shared memory and message passing programming styles, and for reference purposes quote the performance of the same application on one processor of a Cray YMPC90. 5.1.1

Plasma PIC Code Definition

For each of the N particles in the model we must solve the equations of motion,

d dt xi = v i

(1)

and

d (v) = 1 F (x ) dt i mi i where subscript i denotes the ith particle, and x, v , F

(2)

and

m are position,

velocity, force and mass respectively. In an electrostatic code the force on particle i is

F i = qi E(x ) (3) where qi is the particles charge and E the electric field. The electric field E , i

is obtained by solving Poisson’s equation,

r2 = ?(x)

(4)

E(x) = ?r:

(5)

for the electric potential, where  is the charge density and then evaluating its gradient, At each timestep the code has to perform the following basic steps: 1. Compute the charge density. This is a scatter with add. The model particles are small but finite sized charge clouds which contribute to the charge density of any grid cells with which they overlap. 2. Solve for  and then

E

E at the grid points.

3. Interpolate to the particle locations in order to estimate the force acting on each particle. This is a gather step. 4. Push the particles, ie. integrate equations (1) and (2) over t for each particle. 13

Load initial particle positions and velocities

Deposit particle charge on mesh − a scatter operation

Solve particle equations of motion

∆t

Solve for E on the mesh

Interpolate E to particle positions and compute F − a gather operation

Figure 5: Flowchart for a Particle-Mesh code. The flow chart for this scheme is shown in figure 5. In combination these four steps of the PIC algorithm involve computation and communication between two different data structures. The field data has the qualities of an ordered array in the sense that each element has specific neighbors. The particle data has the qualities of a randomly ordered vector, in which element i refers to particle i, and no element has any special relationship to its neighbors in the vector. Steps 2 and 4 are parallelizable in rather obvious ways, since they involve only simple and completely predictable data dependencies, and do not couple the two data structures. Steps 1 and 3 however do couple the two data structures, with complicated and unpredictable data dependencies which evolve during the simulation. It is these steps which invariably dominate the execution times of parallel PIC codes. The particle equations of motion are advanced in time using an efficient second order accurate leap frog integration scheme. Periodic boundary conditions were assumed in all 3 directions. To obtain field solutions, equation (4) was solved by transforming it into fourier wavenumber space using fft routines called from the system VECLIB, solving the resulting algebraic equation in wavenumber space, and then reversing the transforms. The test problem run was of a monoenergetic electron beam propagating through a population of plasma electrons with maxwellian velocity distribution. The beam was distributed throughout the physical domain and had a 14

Figure 6: Time to solution and speed-up for the two PIC calculations described in the text. The curved solid lines are for the shared memory version and the dashed lines for the pvm version. For reference the CPU time recorded by the hardware performance monitor on 1 processor of the C90 is shown by the flat solid line. number density roughly 1=10th the density of the background electron population. We ran calculations of two different sizes. Each calculation began with 8 plasma electrons and 1 beam electron in each mesh cell, and we varied the problem memory requirements by varying the mesh size. These sizes are listed in the table below. The calculations were run through 500 timesteps. For reference puposes the performance results for these calculations on 1 processor of a Cray YMP-C90 are listed in table 1. Performance results for the PVM message passing and shared memory versions on the Convex SPP1000 are given in figure 6.

Table 1: Performance on 1 C90 processor. Mesh

No. of particles

Mflop/s

Total CPU Time

32 x 32 x 32 64 x 64 x 32

294912 1179648

355 369

112.9 436.4

15

Each particle requires 11 data words to specify its properties. This accounts for most of the memory required by the shared memory model. The size of the smallest problem was chosen so that it would barely fill the cache on the 16 processor machine. The shared memory version consistently outperforms the pvm version, as we would expect.

5.2 Finite Element Simulations of Fluids The Finite Element Method (FEM) has been extensively used to model solid materials and fluid flow. The FEM simulates a physical system by partitioning the physical world into a collection of finite elements (triangles or tetrahedra). Because of its versatility, it is lately gaining popularity in the Earth and space science community. Some examples of its applications are numerical weather prediction, the simulation of solar wind interaction with planetary atmospheres, and the modeling of magnetohydrodynamic turbulence. Several features make the FEM attractive. Its unstructured nature allows the modeling of relatively intricate geometries. The simulation of realistic boundary conditions is also possible and relatively simple. In addition, the FEM is naturally suited for adaptive mesh refinement, a technique by which high spatial resolution is dynamically applied only in the regions where it is determined to be necessary, thereby enabling the efficient use of available resources. 5.2.1

FEM Algorithm Description

To many people to the distinguishing characteristic of the FEM is the fairly arbitrary and adaptable spatial meshes that it allows. The FEM partitions space into small non-overlapping regions called elements. An element is typically a triangle in 2D or a tetrahedron in 3D. The vertices of the elements are co-located with the points in space where the prognostic quantities are stored. The meshes of elements and points can be quite arbitrary in principle. However, meshes with elements with fairly uniform size and compact shape produce the best numerical results. The computational phases are divided into computation on elements and computation on points with communication phases separating them. The computation on elements involves the evaluation of spatial derivatives. The computation on points involves the aggregation of vertex contributions and the evaluation of transport fluxes for the prognostic quantities. A FEM application needs to address issues of irregular global communications among processors. There are three classes of global communications used in the discrete evolution equations. First, there are global maxima (minima) found to find quantities such as the largest permissible time step. Second, the are global gathering of data from points of the mesh to the vertices of the elements. Third, there are global aggregations of data 16

(add, maximum, or minimum) from vertices of the elements to the points of the mesh. The second and third classes of global communication are critical to the performance of any parallel implementation, in particular the third class is called the “scatter-add” problem. The prototype FEM application currently solves two-dimensional gas dynamics problems with general boundary conditions. A simple first-order in space (lumped mass matrix) and time, unstructured, 2D, FEM, gas dynamics code written in Fortran-77 with compiler directives was chosen as the test applications to allow quick turn around time in source text modifications. Morton ordering[27] was performed on the points and elements to enhance cache locality for the gathers and scatters. 5.2.2

FEM Application Performance

The prototype FEM application was run on the Exemplar for two different data set sizes and two different codings of the same algorithm. The small data set was chosen to be about the size of the aggregate cache size of the Examplar with 16 processors. The small data set has a mesh with 46545 points and 92160 elements; and the large data set has a mesh with 263169 points and 524288 elements. Note that there is about two elements to every point on the mesh; and an average (maximum) of 6 (7) elements communicating with every point of the 2D meshes. Depending on how a mesh is generated, these critical parameters will vary and so will FEM code performance. The tightest serial coding of the prototype FEM algorithm is used to compute useful Mflop/s. The minimal number of Cray Research Incorporated (CRI) “hpm” floating point operations per point update measured as on the small size data set is 437 floating point operations/point update (220 floating point operations/element update). These numbers will be used as a conversion factor to useful Mflop/s for our data sets. The algorithm optimized for the CRI C90 runs at 0.57 point updates/microsecond (1.14 element updates/microsecond) on a single head of a C90. Thus we claim 250 Mflop/s versus the 293 Mflop/s measured by “hpm” for the C90 optimized code, because we made it run with a smaller wall clock time by introducing redundant transport flux calculations at the vertices. The algorithm with vector style coding compiled with the serial compiler (O2) ran at 0.072 point updates/microsecond (0.14 element updates/microsecond) on the Morton ordered data sets (31 Mflop/s). However the same code compiled with the current (April 1995) parallelizing compiler (-O3) ran at 0.042 point updates/microsecond (0.083 element updates/microsecond). A graph of the scaling of the performance for the two problem sizes is given in Figure 7. The non-monotonic scaling between 8 and 9 processors is being investigated.

17

Figure 7: The performance of the FEM codes on the small and large data sets. The horizontal line is the performance of a single head of a C90. Curves small1 and large were computed using the same code. Curve small2 was computed using a second coding of the same numerics.

5.3 The Gravitational N-body Problem The solution of the gravitational N-body problem in Astrophysics is of general interest for a large number of problems ranging from the breakup of comet Shoemaker/Levy 9 to galaxy dynamics to the large scale structure of the universe. This problem is defined by the following relation where the gravitational force on particle i in a system of N gravitationally interacting particles is given by,

F~i =

X Gm m r~ N

i j ij 2 + 2 )3=2 ( ij j=1

r



(6)

where G is the universal gravitational constant, mi and mj are the masses the particles i and j , r~ij is the vector separating them, and  is a smoothing ~i length which can be nonzero and serves to eliminate diverging values in F when r~ij is small. This parameter also serves to define a resolution limit to the problem. This equation also shows that the problem scales as N 2 and modeling systems with particle numbers larger than several thousand is infeasible.

18

5.3.1

Tree Code Definition

Tree codes are a collection of algorithms which approximate the solution to equation 6 [2, 14, 21]. In these algorithms the particles are sorted into a spatial hierarchy which forms a tree data structure. Each node in the tree then represents a grouping of particles and data which represents average quantities of these particles (e.g. total mass, center of mass, and high order moments of the mass distribution) are computed and stored at the nodes of the tree. The forces are then computed by having each particle search the tree and pruning subtrees from the search when the average data stored at that node can be used to compute a force on the searching particle below a user supplied accuracy limit. For a fixed level of accuracy this algorithm scales as N log(N ) although O(N ) algorithms are also possible. Since the tree search for any one particle is not known a priori and the tree is unstructured, frequent use is made of indirect addressing. Further, the tree data is updated during a simulation as the particles move through space. Therefore, this algorithm is not only of interest for its scientific application, but is also of computational interest due to its unstructured and dynamic nature. 5.3.2

Performance Results for Tree Code

We have ported a FORTRAN 90 version of a tree code to the Convex SPP which was initially developed for the Maspar MP-2 and implemented using the algorithm described in Olson and Dorband [21]. The changes to the original code were straightforward and the compiler directives and the shared memory programming model facilitated a very simple minded approach to be taken. The main alterations to the MasPar code were to distribute all the particle calculations evenly among the processors and make all intermediate variables in the force calculation thread-private. Each processor then calculates the forces of its subset of particles in a serial manner. All indirect accesses are made by each thread of execution into the tree data stored in global shared memory. Further, these indirect addresses are made in the innermost loop of the tree search algorithm, thus relying on the ability to utilize rapid, fine grained memory accesses allowed by the shared memory programming model. This scheme also allows for more efficient use of the data cache on subsequent floating point operations. The program was run on three problem sizes (32K, 256K and 2M particles), applying from 1 to 16 processors in two configurations of the processors. The first configuration ran 1,2,4 and 8 processors on a single hypernode and the second ran 2,4,8 and 16 processors across two hypernodes. Figure 8 shows the parallel speedup for each of the cases measured relative to the single processor rate of 27.5 Mflop/s. We see that the performance degradation incurred across multiple hypernodes is small; it is between 2 and 7 percent. It is also clear that the performance at 16 processors is affected by the problem size. The task granularity changes linearly with the problem size as

19

Figure 8: N-Body Performance Scaling do the overall memory requirements. However, the balance between local and global memory accesses varies non-linearly; it is determined by the proportion of information at each level of the tree and by the proportion of the depths searched by the algorithm. To determine the effect of multiple hypernodes on the scaling of this application, tests should be run on a system with more than two hypernodes. From this initial data it is not possible to predict how speedup will change as additional hypernodes are added. Finally, the 16 processor 384 Mflop/s result compares favorably to a highly vectorized, public domain tree code [14] which achieves 120 Mflop/s on one head of a C90. A message passing version of this code has also been developed using the PVM library (Olson and Packer 1995). This code has been ported to vasious architectures such as a network of distributed workstations, the Intel Paragon, the Cray T3D, and the Beowulf clustered workstation [23]. Since the Convex SPP also supports the PVM libaries, the code was easily ported to this architecture. The single processor performance of the code was quite good in this case and is somwhat faster than that quoted above for the code written using the shared memory programming model and is also faster than for any of the archtitectures mentioned above. The overheads of packing

20

and sending messages, however, are prohibitive and overall performance is degraded relative to the shared memory version of the code. Optimizations of this code continue.

5.4 Piece-wise Parabolic Method The first code to be discussed solves Euler’s equations for compressible gas dynamics on a structured, logically rectangular grid. The code, named PROMETHEUS [13], has been used primarily for computational astrophysics simulations, such as supernova explosions [3, 20], non-spherical accretion flows [12], and nova outbursts [25]. The equations are solved in the form

@ + r  ~v = 0 (7) @t @~v + r  ~v~v + rP = ~g (8) @t @E + r  (E + P ) ~v = ~v  ~g (9) @t where  is the gas density, ~v is a vector specifying the gas velocity, P is the gas pressure, E is the internal plus kinetic energy per unit mass, ~g is the acceleration due to gravity, and t is the time coordinate. In order to complete the equations, one must also include an equation of state for computing the pressure from the energy and density. These equations are solved using the Piecewise-Parabolic Method (PPM) for hydrodynamics [6]. This is a very high-resolution finite volume technique, which is particularly wellsuited to calculation of flows which contain discontinuities such as shock fronts and material interfaces. This is especially important for astrophysics calculations, where hypersonic flows are frequently encountered and viscosities tend to be extremely small. The original formulation of the algorithm has been extended to allow the use of a general equation of state [7]. In addition, the capability of following an arbitrary number of different fluids has been incorporated. Systems can be studied in either one, two, or three spatial dimensions using rectangular, cylindrical, or spherical coordinates. The results discussed below are for calculations performed on a two-dimensional rectangular grid. This code was parallelized using a domain decomposition technique, in which the grid was divided into a number of rectangular tiles. This approach has been used successfully on a large number of parallel computers, including the Cray C-90, MasPar MP-1 and MP-2, IBM SP-1 and SP-2, Cray T3D, and Intel Paragon. For the case of the SPP1000, each processor is assigned one or more tiles to calculate. Each tile is surrounded by a frame of “ghost” points which are used for specifying boundary conditions. Since the formulation of the PPM algorithm used in PROMETHEUS is a nine-point scheme, the “ghost” points need to be updated only once per time step if 21

Table 2: PMM Performance Grid Size

 480  480 120  480 120  480

No. of Tiles

No. of Procs

Mflop/s

 16  16 4  16 4  16

1 2 4 8

29.9 58.2 118.8 228.5

 48  48 12  48 12  48

1 2 4 8

23.8 47.8 95.9 186.2

 16  16

1 4

29.9 118.5

120

4

120

4

 480  480 120  480 120  480 120

12

120

12

120 240

 480  960

4 4

frame is four grid points wide. The only communication required using this approach is that four rows of values must be exchanged between adjacent tiles once per time step. Since a few thousand floating point operations are needed to update each zone for a single time step, the ratio between communication costs and computational costs is quite low. The results of these runs is given in table 2.

6 Discussion The studies described in this paper have been a rich source of experience in the use of the SPP-1000. Those experiences have ranged from frustration to delight. Elation over successes have been tempered by difficulties in achieving them. Frustration with system flaws has been mitigated by unpleasant experiences with a number of competing machines. Under favorable conditions, it has been observed that the HP PA-RISC processor has proven to be an effective, often superior computing element than those used on other multiprocessors in terms of sustained performance on real problems. As an example, efficient implementation of floating point divide provided significantly improved performance over other processors for at least one of our problems. Programming a single hypernode proved remarkably easy and returned excellent scaling across eight processors in all cases. When starting with a good sequential code, parallelizing was often easy for a single hypernode. Under favorable but achievable circumstances, a single hypernode sustained performance approached that of a single head of a CRI C90.

22

The primitive mechanisms provided by hardware support yielded very good performance compared to what would be expected by software support. But global operation of these mechanisms incurred measureable cost. In some cases, this was in the range of a factor of 2 which is entirely acceptable. In other cases, the difference in intra-hypernode versus inter-hypernode behavior could be between a factor of 4 and 10. Under such circumstances, domain decomposition and problem partitioning became crucial to minimize performance degradation. Scaling of full applications ranged widely from excellent (better than 80%) efficiency to poor where performance was seen to degrade between 8 and 16 processors. In spite of the original intent of cache based systems, this class of memory hierarchy is not transparent to the computational scientist and is a significant factor in optimizing the parallel code. It was observed that problems that largely resided in cache versus those that were big enough to consume large portions of main memory code easily show performance difference of a factor of three for the same application and this just on a single hypernode. Cache miss penalties to global data versus hypernode local data were measured at about a factor of eight on average. While this is excellent compared to software message passing methods, that difference still could produce serious degradation for the naive programmer. A valued aid in achieving such optimized codes was the availability of hardware supported instrumentation including counters for cache miss enumeration and timing. Also, an excellent tool, CXpa provided good average behavior profiling that exposes at least coarse grained imbalances in execution across the parallel resources. With these means of observing system behaviour, code modifications were made rapidly and to good effect. If vendors are going to insist on gambling system performance on latency avoidance through caches, then they should make available the means to observe the consequences of cache operation. An unanticipated problem results from the general nature of the computer operating system. It can run on various numbers of processors and is fully multitasking. The problem is that most scientific applications are written with data structures and control processes based on powers of 2. Most of the test codes required 16 processors and could not easily be recast to run on 15 processors. As a result, operating system functions shared execution resources with the applications. Because these applications maintained their heritage of beinging statically allocated to processors, there was no way to dynamically schedule threads and adaptively respond to various processor workloads. Thus critical path length depended on exigencies of operating system demands as well as those of the applications of interest. Finally, the Convex SPP-1000 system, particularly the system support software, is still in a state of maturing. This restricted access to some of the optimizations that will shortly be available. It is anticipated that performance will continue to improve in the near future as improvements migrate into user environments. One of these involved means of memory partitioning. Neither 23

node-private nor block-shared modes were operational, limiting control of memory locality. Another involved the parallelizing compiler that did not yet incorporate all of the optimizations found in the serial processor compiler. A last requirement yet to be fully satisfied is the need for fine-tuned libraries for certain critical subroutines such as parallel FFT, sorting, and scatter-add.

7 Conclusion The Convex SPP-1000 shared memory multiprocessor is a new scalable parallel architecture with unique architectural attributes. Its hierarchical structure combined with multilevel cache coherence mechanisms makes it one of the most interesting parallel architectures commercially available today. This paper has reported the findings of a set of studies conducted to evaluate the performance characteristics of this architecture. The system behavior has been examined from the perspectives of fine grain mechanisms and application level scalability. Synthetic test programs were developed to exercise basic mechanisms for synchronization, scheduling, and communications. Full application codes were drawn for the Earth and space sciences community and selected to invoke varying degrees of global system interaction. It was determined that hardware support for critical mechanisms yielded excellent operation compared to software alternatives, but that the costs of global interactions greatly exceeded equivalent operation at the local level. Scaling was impacted by the additional costs of global communication and address space sharing. While some applications showed good, if somewhat depressed, scaling globally, other demonstrated poor or even negative scaling for the high end of system configurations tested. Of course, this was influenced by programming style and mapping and alternative implementations of the same applications might yield improved behavior. As part of the overall experience, it was found that programming within the shared memory environment was easier than the distributed memory contexts of some other machines and that availability of performance instrumentation and visualization tools greatly assisted in optimizing application performance. Many of the experiments performed exhibited sensitivity to cache behavior and it was concluded that even in this globally cache coherent system, the cache structure was too restrictive to permit a generalized programming methodology with acceptable level of effort approaching that of vector based hardware systems. In spite of the set of tests conducted and reported here, significant work has yet to be done before the performance behavior is fully understood and can be predicted. Among the near term activities to be undertaken is running on larger configuration platforms. Also, more detailed characteristics of the range of cache behaviors need to be revealed. Finally, more dynamic load balancing and lightweight threads needs to be developed and implemented 24

on this system to ease the programming burden. But as it is, this system stands as among the most advanced currently available.

8 Acknowledgments This research has been supported by the NASA High Performance Computing and Communication Initiative.

References [1] A. Agarwal, D. Chaiken, K. Johnson, et al. “The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor,” M. Dubois and S.S. Thakkar, editors, Scalable Shared Memory Multiprocessors, Kluwer Academic Publishers, 1992, pp. 239-261. [2] J.E. Barnes and P. Hut, “A Hierarchical O(n log n) Force Calculation Algorithm,” Nature, vol. 342, 1986. [3] A. Burrows, and B. Fryxell, Science, 258, 1992, p. 430. [4] CONVEX Computer Corporation, “Camelot MACH Microkernel Interface Specification: Architecture Interface Library,” Richardson, TX, May 1993. [5] D. Chaiken, J. Kubiatowitz, and A. Agarwal, “LimitLESS Directories: A Scalable Cache Coherence Scheme,” Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), 1991, pp. 224-234. [6] P. Colella and P. R. Woodward, “The Piecewise-Parabolic Method for Hydrodynamics,” Journal of Computational Physics, 54, 1984, p. 174. [7] P. Colella and H. M. Glaz, J. Comput. Phys, 59, 1985, p. 264. [8] CONVEX Computer Corporation, “Exemplar Architecture Manual,” Richardson, TX, 1993. [9] CONVEX Computer Corporation, “ConvexPVM User’s Guide for Exemplar Systems,” Richardson, TX, 1994. [10] CONVEX Computer Corporation, “Exemplar Programming Guide,” Richardson, TX, 1993. [11] Cray Research, Inc., “CRAY T3D System Architecture Overview,” Eagan, Minnesota.

25

[12] B. Fryxell and R. E. Taam, “Numerical Simulations of Non-Axisymmetric Accretion Flow,” Astrophysical Journal, 335, 1988, pp. 862-880. [13] B. Fryxell, E. Muller, ¨ and D. Arnett“Hydrodynamics and Nuclear Burning,” Max-Planck-Institut fur ¨ Astrophysik, Preprint 449. [14] L. Hernquist, “Vectorization of Tree Traversals,” Journal of Computational Physics, vol. 87, 1990. [15] Hewlett Packard Company, “PA-RISC 1.1 Architecture and Instruction Set Reference Manual,” Hewlett Packard Company, 1992. [16] Intel Corporation, “Paragon User’s Guide,” Beaverton, Oregon 1993. [17] IEEE Standard for Scalable Coherent Interface, IEEE, 1993. [18] Kendall Square Research Corporation, “KSR Technical Summary,” Waltham, MA, 1992. [19] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessey, “The Directory-Based Cache Coherence Prototocl for the DASH Multiprocessor,” Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 49-58, June 1990. [20] E. Muller, ¨ B. Fryxell, and D. Arnett, Astron. and Astrophys., 251, 1991, p. 505. [21] K. Olson and J. Dorband, “An Implementation of a Tree Code on a SIMD Parallel Computer,” Astrophysical Journal Supplement Series, September 1994. [22] Thinking Machines Corporation, “Connection Machine CM-5 Technical Summary,” Cambridge, MA, 1992. [23] T. Sterling, D. Becker, D. Savarese, et al., “BEOWULF: A Parallel Workstation for Scientific Computation,” To appear in Proceedings of the International Conference on Parallel Processing, 1995. [24] T. Sterling, D. Savarese, P. Merkey, J. Gardner, “An Initial Evaluation of the Convex SPP-1000 for Earth and Space Science Applications,” Proceedings of the First International Symposium on High Performance Computing Architecture, January 1995. [25] A. Shankar, D. Arnett and B. Fryxell, Ap. J. (Letters), 394, 1992, p. L13. [26] V. Sunderam, “PVM: A Framework for Parallel Distributed Computing,” Concurrency: Practice and Experience, December 1990, pp. 315339. 26

[27] M.S. Warren and J.K. Salmon, “A Parallel Hashed Oct-tree N-Body Algorithm,” Proceedings of Supercomputing ’93, Washington: IEEE Computer Society Press, 1993.

Authors Thomas Sterling Thomas Sterling recevied his Ph.D. in electrical engineering from MIT, supported through a Hertz Fellowship. Today he is Director of the Scalable Systems Technology Branch at the University Space Research Association (USRA) Center of Excellence in Spcae Data and Information Sciences. He is also currently an Associate Adjunct Professor at the University of Maryland Department of Computer Science. Prior to joining CESDIS in 1991, Dr. Sterling was a Research Scientist at the IDA SRC investigating multithreading architectures and performance modeling and evaluation. Dr. Sterling is a member of IEEE, ACM, and Sigma Xi. His current research includes the development of the ESS Parallel Benchmarks, the design of the Beowulf Parallel Workstation, and participation in the PetaFLOPS computing initiative. He can be contacted via e-mail at [email protected].

Daniel Savarese Daniel Savarese graduated with a B.S. in astronomy from the University of Maryland in 1993. He is currently a Ph.D. student in the Department of Computer Science at the University of Maryland. His current research involves the performance analysis of caches in shared-memory parallel architectures. His research interests include parallel computer systems architecuture, parallel algorithms, real-time sound and image processing. He can be contacted via e-mail at [email protected].

Phillip Merkey Phillip Merkey obtained his Ph.D. in mathematics from the University of Illinois in 1986. Dr. Merkey joined the Center of Excellence in Space Data and Information Sciences in 1994 after working for the Supercomputing Research Center as Research Staff Member. He is a member of the Scalable Systems Technology Branch involved in the development of the ESS Parallel Benchmarks, performance evaluation, modeling and analysis of parallel computers. Other interests include discrete mathematics, information theory, coding theory and algorithms. 1He can be contacted via e-mail at [email protected].

27

Suggest Documents