Hierarchical Tiling: A Methodology for High Performance

2 downloads 308 Views 417KB Size Report
If a pile of register tiles is to be executed in sequence by one processor, it is desirable to choose di erent input and output surface names for each tile, and then to ...
Hierarchical Tiling: A Methodology for High Performance Larry Cartery

Jeanne Ferrantey Susan Flynn Hummelz Kang-Su Gatlin{

Bowen Alpernx

Abstract

Good parallel algorithms are not enough; computer features such as the memory hierarchy and processor architecture need to be exploited to achieve high performance on parallel machines. Hierarchical tiling is a methodology for exploiting parallelism and locality at all levels of the memory/processor hierarchy: functional units, registers, caches, multiple processors, and disks. Hierarchical tiling concentrates on the interaction between multiple levels of tilings. One novel idea of hierarchical tiling is the naming of the values on the surface of a tile. Names determine where values are stored in the memory/processor hierarchy. Storage for the surface of a tile is materialized at that level, while interior elements of the tile only require storage as temporaries at a lower level of hierarchical memory. A second distinctive feature is that hierarchical tiling provides explicit control of all data movement, both within and between the levels memory/processor hierarchy. This is accomplished by using a three-stage tiling discipline to choreograph data movement. It allows for inter- and intra-level overlapping of computations and data movement. The size and the shape of the tiles impact the parallelism and communication to computation ratio at each level. We illustrate guidelines and modeling techniques for selecting tile sizes and shapes, as well as the choreography of data movement, using two examples. In the rst, uniprocessor performance was sped up by a factor of three, achieving perfect use of the superscalar processor pipeline, and the cache miss penalty was reduced to an insigni cant level. A high level of multiple processor eciency was also achieved. In the second example, taken from a production code, performance gains up to a factor of two (depending on the machine and the program variant) were seen. These improvements ware largely due to hierarchical tiling's ability to improve instruction-level parallelism.

1 Introduction Obtaining high performance on parallel machines involves not just nding a good parallelization but also optimizing data movement within a processing node and exploiting each processor's capabilities. Hierarchical tiling builds on a large body of literature on tiling, notably that of Wolf and Lam [WL91b]. It applies both optimizations of parallelism and locality at all levels of the processing-element/memory hierarchy. Thus, it encompasses many program optimization techniques that are often considered separately. Intuitively, a tile is a set of computations that are selected to be executed closely to each other, both in time and on the same processing element. A surface is a set of values that are computed in one tile and used in a di erent tile. Formal de nitions are given in section 3.2. One novel idea of our work is the naming of the values on the surface of a tile. Names determine where values are stored in the memory/processor hierarchy, and what data must be communicated between a tile and its adjacent tiles and subtiles. Interior elements of a tile only require storage as temporaries at a lower level of hierarchical memory. For example, this technique applied at the register level (as described in section 4.1) allows the results of some computations to be stored in registers and reused without further data movement. For parallelism (as illustrated in section 4.4), data to be communicated can be stored directly in bu ers, possibly eliminating copy instructions. In the nal  Preliminary versions of portions of this paper appeared in three conference papers[CFH95, CFH95, ACG95]. y UCSD, CS&E Dept., 9500 Gilman Dr., La Jolla, CA 92093-0114, carter, ferrante @cs.ucsd.edu. This work was

f g supported in part by NSF Grant CCR-9504150. z Polytechnic University, Brooklyn, NY 11201 and IBM T.J. Watson Research Center, [email protected]. Voice: (718) 260 3205 Fax: (212) 533 1706 This work was supported in part by NSF Grant CCR-9321424. x IBM's T. J. Watson Research Center, Yorktown Heights, N.Y. 10598 { UCSD, CS&E Dept. La Jolla, CA 92093-0114

1

example (section 5), tiling allows one to convert an initial program that is simple but wasteful of storage into one that is storage-ecient. A second distinctive feature is that hierarchical tiling provides explicit control of all data movement, both within the memory hierarchy and between processing elements. By specifying both the input and output names and the scheduling of tiles, the choreography of all data movement is fully speci ed. This allows us to overlap computation with data movement at various levels, when possible. The size and shape of the tiles a ects the amount of parallelism available and the amount of communication required at each level. Tiling's goals include providing enough parallelism to keep all processing elements busy and to reduce the communication to computation ratio. These goals are partially con icting: small tiles increase available parallelism, while large tiles reduce the communication to computation ratio. The tiling choices at di erent levels interact. For levels where the interactions between parallelism and locality is complex, e.g., nodes of a distributed-memory supercomputer, performance models can aid in the selection of tile sizes and shapes. The usefulness of hierarchical tiling is demonstrated with experiments in two very di erent problem domains. Section 4 explores a simpli ed PDE code. Hierarchical tiling improves the performance of the IBM SP1's superscalar processors by factor of three, reduces the e ect of cache misses to an insigni cant level, and results in ecient parallel performance. The second experiment, protein matching, is an important dynamic programming application. Section 5 shows that hierarchical tiling can be used to get good performance on this application on a variety of platforms: an IBM SP2, an Intel Paragon, a Cray T3D, and a farm of DEC Alpha machines. These two applications share a similar computational structure; they require little interprocessor communication and they have mostly sequential memory accesses. As a result, they do not particularly highlight the well-known fact that tiling can produce dramatic performance gains at the interprocessor parallelism and cache levels. Instead, they illustrate that tiling can attain exceptionally good performance from multiple functional units and superscalar pipelines. Also, the second example shows an interesting interaction between tiling at the highest and lowest level of granularity. Hierarchical tiling is presented here as a method of hand-tuning performance; while outside the scope of this paper, many of the techniques can be incorporated into an automatic preprocessor or optimizing compiler. This is the current aim of the Hierarchical Tiling Project at UCSD.

2 Background Iteration-space tiling, also known as partitioning or blocking, is an important and well-known technique [JM86, GJG88, IT88, W87, RAP87, W89, HA90, L90, LRW91, RS91, WL91, WL91b, B92, KM92] that has been used both to achieve parallelism and improve locality. Both tiling and hierarchical tiling focus on very structured computations where the bulk of the execution time is spent in nested loops. Given such a loop, the iteration space is N , where represents the set of integers, and is the depth of nesting. If the loop bounds are known, then a nite subset of N can be used. The Iteration Space Graph (ISG) [RS91] is a directed acyclic graph whose nodes represent the initial values and computations in the loop body, and the edges represent data dependences [PW86]1 between nodes. An example ISG is given in gure 1. Tiling partitions the iteration space into uniform tiles of a given size and shape (except at boundaries of the iteration space) that tessellate the iteration space. A tiling breaks up the problem of scheduling an ISG for execution into a \macro-problem" of scheduling the tiles, and a \micro-problem" of scheduling the iterations within the tile. Scheduling comprises two aspects: assigning iterations to processing elements, and choosing an execution order on each processing element. The schedule that results from tiling is typically embodied in the optimized program by a new, more deeply nested set of loops. Hierarchical tiling uses an abstract machine model that captures the degree of parallelism and the memory capacity of each level of the processor/memory hierarchy of a machine. Each level is tiled by dividing a tile of a higher level into subtiles. Thus this work (like that of [NJVLL93]) explores the interaction of tiling at the di erent levels. But whereas traditional tiling a ects data movement implicitly by altering the order Z

Z

N

Z

1 Since the ISG has a distinct node for each value produced, only true dependences need to be represented. Discarding output and anti-dependences enhances the opportunities for optimizations.

2

for i = 0 to T-1 for j = 0 to S-1 A[j+1] = u*A[j] + v*A[j+1]

output values

input value loop iteration i

data dependence

A[0]

j

A[S]

Figure 1: Example of an iteration space graph (ISG). of references to memory, hierarchical tiling exercises explicit control over all data movement in the abstract machine model. This can exposes opportunities for overlapping data movement with computations.

3 Overview of Hierarchical Tiling Hierarchical tiling determines which operations are to be executed on which processor in which order. Furthermore it attempts to choreograph the program, that is, to direct the movement of data between processing nodes and up and down the memory hierarchy of the target computer. Normally, a program relies on other mechanisms for its choreography: the compiler chooses register loads and stores, the hardware chooses what lines of data to move into and out of cache, the operating system directs the movement of data between main memory and disk storage, and, depending on the system, some combination of program, system software and hardware choose the interprocessor data movement. Hierarchical tiling takes responsibility for the entire process, though ultimately the other mechanisms may overrule its choices. Hierarchical tiling proceeds as follows: 1. The processing elements, memory hierarchy, and communication capabilities of the target computer are modeled. We use the Parallel Memory Hierarchy (PMH) model [ACFS94], which is a tree of memory modules with processors at the leaves of the tree. Each module has a separate address space, and all movement of data is explicit. 2. The Iteration Space Graph is recursively partitioned into tiles and the tiles are scheduled for execution on memory modules in the PMH model. The entire ISG is scheduled on the root module of the PMH. A leaf module executes a tile by performing its operations. A non-leaf module executes a tile by partitioning it into subtiles and scheduling their execution on its children. Conceptually, the hierarchical tiling is a PMH program. An important part of creating a hierarchical tiling is assigning storage. Storage is needed for the \surface" of the tile. Values that are on the surface of a subtiles but in the \interior" of higher-level tiles need not be assigned storage in (nor be moved into or out of) the higher-level modules. 3. The PMH program of step 2 is translated back into a \real" computer program for execution by the target computer. Because real computers don't have the simplicity of the PMH model, this involves using a variety of techniques, each tailored for a particular architectural feature. These steps are described below and illustrated in more detail in sections 4 and 5. 3

3.1 The computation model

The Parallel Memory Hierarchy model captures architectural features that are crucial to achieving high performance: the multilevel memory hierarchy, the fact that data are moved in blocks, the possibility of overlapping communication with computation, and the multilevel hierarchy of parallelism.2 Figure 2 shows the PMH tree for the IBM SP1 computer. The modules are concurrent processors that can communicate with their parents and children along the channels. Modules near the root of the PMH tree typically have more memory and wider channels than their descendents, but they are slower. Thus, they can compute loop bounds, increment index variables, and pass blocks of data to their children, but time-consuming tasks like rearranging data within blocks or performing computations on the data should be delegated to lower-level modules. The model has parameters that give the communication costs of each channel, and the memory capacity and number of children of each module. A procedure for deriving a model for given computer is given in [ACF93]. In the PMH model, all of the channels can be active at the same time, although two channels cannot simultaneously touch the same block. Disks and "Global Communication" Space

Main Memory

Main Memory

Main Memory

Cache

Cache

Cache

Reg.

Reg.

Reg.

E O

E O

E O

Figure 2: PMH Model of the IBM SP1. Boxes labeled E (for even) and O (odd) are functional units that model the two-stage oating-point pipeline. Each module has a separate name space. Data movement along a channel is controlled by the parent module. The child can be working with other data in its memory while blocks are being transferred to or from its parent, but must not access the data in transit. As we will describe in the next section, we \program" the PMH using a discipline that allows us to sidestep issues such as synchronization and deadlock.

3.2 Recursive partitioning

Hierarchical tiling recursively partitions the nodes of the Iteration Space Graph into tiles and schedules the tiles on modules of the PMH tree. The entire ISG is the tile for the root of the PMH tree. A module executes a tile by partitioning it into subtiles which become the tiles that are passed down to, and executed by, modules at the next level down the PMH tree. Leaf modules (i.e., the processors) actually perform the operations of the tiles they receive. The input nodes of a tile are those nodes not in whose values are used by nodes in . If a tile produces a value in a node that is used by some other tile, then that node is called an output node of . The set of input (or output) nodes of are called its input (output) surface. Only the surface values of a tile need to be stored. As is re ned into subtiles further down the PMH tree, temporary storage is allocated (in lower memory modules) for the new surfaces exposed. During execution, each tile is processed in three distinct phases: 1. 's input surface is moved from the parent module into the module that will execute it. (Any portion of the surface that is already in need not be moved.) T

T

T

T

T

T

T

T

T

P

C

C

2 The PMH model may not be ideal. Further work such as [SSM89] is needed to understand which machine characteristics are important for performance.

4

2.

executes the tile. During this time, no data of may be moved between and . Thus, must be suciently small that its entire surface can t in the memory of . ( may need to be smaller still, since may need additional memory for the surfaces of past and future tiles, as well as for new surfaces of the subtiles it creates while executing .) executes only one tile at a time. 3. When the execution of is complete, the output surface of can be moved back up to . can cache some of the output surface for future use. Although each module executes only one tile at a time, there are many opportunities for concurrency. First, can execute by partitioning it into subtiles that it feeds to its children in parallel. Second, since is fed a sequence of tiles n from , the three phases can be pipelined. More speci cally, while is executing tile i , the results of the previous tile i?1 and the input values for i+1 can be moved between and . Thus, hierarchical tiling allows for ecient data prefetching and pipelined execution | sometimes more than can be supported by the underlying hardware. Data already in a child module can be cached there for later reuse. This can eliminate recopying input surfaces from parent to child and, in the case of an output surface that is not used in other modules, movement from child to parent. One particularly bene cial form of data caching is called surface sharing. This occurs when (part of) the input or output surface of one tile is (part of) the input surface of the next tile executed on the same processor. Surface sharing reduces data movement and requires no extra storage space, since both tiles have allocated space for the shared surface already. The optimal size and shape of tiles depend upon the dependences of the ISG and the parameters of the modules, and con icting requirements must be balanced. Giving a recursive partitioning of the ISG into tiles and a module , we de ne a -tile (e.g. a register tile, cache tile, etc.) to be a tile at the coarsest level of partitioning that allows the tile's input and output surfaces to t in simultaneously. Typically, a pile of -tiles with surface sharing between adjacent tiles in the pile will be executed sequentially in . A measure of interest is the surface to volume ratio of a pile of -tiles. This ratio is the same as the communication to computation ratio for module . The surface area of a pile excludes the shared surfaces; thus it is often advantageous to use a tall pile of wide, at -tiles, rather than using more symmetrically-shaped -tiles. Making the -tiles larger improves the communication to computation ratio, but often smaller tiles get parallel execution started earlier. Some guidelines for creating ecient hierarchical tilings are summarized below. These will be illustrated in the examples. C

T

P

C

C

T

T

C

T

T

C

C

T

P

C

T

C

T

P

C

T

T

T

P

C

C

C

C

C

C

C

C

C

C

C

 Use a tile shape that \contains" the dependence vectors from an initial node. As discussed in [RS91],     

this insures a legal tiling with simple loops. If parallel execution is possible, generate independent subtiles. If load balancing may be a problem, use substantially more subtiles than the number of processors. When executing a pile of tiles on a given module, the shared surfaces should be large. Use squat rectangles or slabs that are stacked along their large surface to maximize the amount of shared surface. Keep the input surface of a tile separate from its output surface. This prevents storage-related dependences which would limit parallelism. Swap input and output names to save storage at shared surfaces, if appropriate. This often leads to a \ping-pong" style of programming with two tiles per loop iteration. Strive for relaxed synchronization. The abstract model may not be accurate. Leave a gap between the time when a result is available and when it is required somewhere else.

When a tile is partitioned, there may be odd-shaped partial subtiles at its border. It can yield better performance to further partition each partial tile, but this requires additional programming e ort. A straightforward approach (used in this paper) is to generate simple, untiled code for the partial tiles.

5

3.3 Producing a program

The nal step in hierarchical tiling is converting the PMH program to an executable program. Some aspects of this are easy, for instance, message-passing primitives give explicit control over interprocessor communication. Certain language features may help, e.g. CICO directives [LCW94] guide cache choices. However, the programming languages may not provide a way to express all the desired data movement. Programs can only indirectly specify register usage or schedule instructions. One must sometimes \trick" the compiler into producing the desired code, or, as a last resort, write assembly-language code. Other aspects of data movement, such as cache and page replacement, may not even be under be compiler control.

3.3.1 Registers

Each register tile is composed of nodes which represent the computation of the body of the innermost loop; thus it is straightforward to translate such a node into executable code. Given a register tile of given shape and size, variable names need to be chosen for each value produced by nodes in the tile. Choosing scalar variables for values that are only used locally is desirable, since these in practice are more likely to be allocated to registers. The number of names should re ect the expected number of available registers (perhaps using fewer, in order to accommodate small inaccuracies in the model.) The names should also be chosen so that no data dependences are introduced by reuse, if possible. One strategy is to choose totally distinct names to hold the values, insuring no data dependences are introduced. Once names are chosen, we merely repeat the generated code for each node in the tile, using the selected names, in an order so that dependences between nodes are respected. If a pile of register tiles is to be executed in sequence by one processor, it is desirable to choose di erent input and output surface names for each tile, and then to swap the use of these names in the next tile to be executed, i.e., to \ping-pong" the names. The loop representing the pile of tiles then has as its body code corresponding to two successive tiles.3 Figure 6 is an example of such code.

3.3.2 Cache

For problems too large to t in cache, an additional level of tiling is needed. A cache tile (which itself consists of multiple piles of register tiles) is selected so that each cache tile's input and output surface ts in cache. Again, to give some leeway, a somewhat smaller tile than the maximum should be used. It may be convenient to program a cache tile as a subroutine with appropriate input and output surfaces as parameters. In particular, this facilitates ping-ponging, since a pile of cache tiles can be programmed using a loop with two calls to the cache tile, swapping the names of the input and output surfaces. In the common case that there is no pre-fetching mechanism available to the cache, the program cannot control any data movement into and out of cache. If available, pre-fetch instructions can be inserted at an appropriate point before the cache tile so that the input and output surfaces of the tile are likely available before its execution. If a computer has multiple levels of cache, a level of tiling can be used for each.

3.3.3 Main memory and disk

For problems that are too large to t in main memory, performance can su er dramatically due to the cost of paging data between main memory and disk. When this happens, the most expedient solution often is to nd a computer with more memory. Although not illustrated in our examples, the techniques described in this paper (both ours and the previous tiling references) can be applied to some problems with great success [TAC94]. The standard techniques cited in section 2 can greatly reduce the amount of data that needs to be paged out to disk and brought back into main memory during the course of a computation. However, if the transformed program still relies on the demand-paging mechanisms provided by the computer system, the program will stop each time a page miss occurs. There are a number of ways the problem has been addressed, each with its own peculiarities. For instance, multiprogramming allows other jobs to run during this waiting

3 If there is data reuse from an output surface of one tile to the input surface of two tiles ahead, then a \ping-pang-pong" strategy employing three sets of surface names could be used. This technique can obviously be generalized beyond three, with diminishing bene t.

6

time. But multiprogramming may not make much sense on massively parallel computers, because each concurrent job has less real memory, and because it introduces dicult timing uncertainties into parallel programs. Hierarchical tiling integrates the paging problem with all other levels, including parallelism. It makes it natural to generate code that uses prefetching for disk. This can be accomplished by having a separate thread that anticipates when the program will need data from disk, references the memory locations needed thereby prefetching the page. It also allows the possibility that explicit disk read and writes be used. Depending on the target computer system, explicit I/O may have signi cantly better bandwidth than paging.

3.3.4 Parallelism

Hierarchical tiling supports coarse-grained parallelism by the scheduling of tiles on multiple processors, and the generation of the necessary communication between them. Tile scheduling can be done statically or dynamically, and for either shared or distributed address space programming models. Since tiles have distinct input and output names, the generation of communication is easier (for the compiler) with hierarchical tiling than with languages like HPF. Such communication can be generated with bu ered block moves corresponding to the size of the surface communicated. Furthermore, nonblocking sends can be used to relax interprocessor synchronization.

4 Example 1: PDE-like Code This section will use the SP1 and a simple doubly-nested loop as a running example: for I = 1 to T for J = 1 to S A(J) = C*(A(J-1) + (A(J) + A(J+1)));

This example comes from a paper by Wolf and Lam [WL91]. The value computed in a typical loop iteration depends on the old value of the current array element, the value just computed, and the old value of the next element in the array. The explicit parentheses in the code imposes a particular sequence of oatingpoint operations; we do not allow algebraic reformulations for our experiments. Even with this restriction, there are nevertheless many opportunities for resequencing and overlapping the computations, leading to parallelized code with improved utilization of the processors. Following the convention of Wolf and Lam, the ISG can be pictured as a parallelogram as in gure 3. Each node of the parallelogram represents the three oating-point operations described above. Each node has data dependence edges from its neighbors to the left, below, and below-left, as illustrated by the arrows (the arrows going into the bottom and left-hand nodes are for the initial data values). Notice that if two

Figure 3: Iteration space graph for PDE example. nodes are connected by a sequence of dependence edges, then the sink node is above and to the right of the source. Therefore if two nodes lie on a line with negative slope, there is no dependence between them, and they can be executed in either order. There are many possible computation orders that respect the data dependences in the ISG. For instance, Wolf and Lam [WL91] improve cache usage on a uniprocessor by partitioning the parallelogram into vertical piles of rectangles (and partial rectangles), and evaluating the piles left to right (using a bottom to top order within each strip). Similarly, two rectangular tiles can be executed in parallel, provided that one tile lies entirely below and to the right of the other. 7

4.1 Functional Units and Registers

Although standard tiling chooses what operations will be executed in the innermost loop, it leaves the jobs of instruction scheduling and register allocation to the compiler. Our approach attempts to control these choices by the ne-grained tiling of the ISG. The advantages of the approach include:  It is possible to get more ecient use of pipelined instruction units and registers.  Nodes that are not on a tile's surface do not need to be stored in memory or reloaded into registers. In the example PDE code, these improvements more than double the speed over the naive code, as described in this section. The RISC System/6000 processor used by the SP1 can initiate a oating point instruction each cycle, but each instruction requires two cycles to complete. To achieve maximal usage of the processor, the result of one

oating-point instruction should not be used as an operand of the next instruction, though it can be used by the one after that. To model this pipeline as a PMH, two separate oating-point units are associated with each register module, as in gure 2. One of these leaf modules, even, executes the oating-point instructions that come in even-numbered cycles, while odd executes the others. We strive for a relaxed synchronization, allowing several cycles between when a result is produced by one leaf module before being used by the other, as well as between loads, uses and stores. A second feature of the processor is that loads to the oating-point unit are executed by the xed-point processing unit. This is modeled conveniently in the PMH by saying the bus between cache and registers can be active simultaneously with the buses between registers and even or odd. A nal feature of the SP1's processor is that oating-point stores actually do take a cycle of the oatingpoint unit. Thus, it is desirable to reduce the number of stores. We create the hierarchical tiling starting from the bottom level of the PMH model (corresponding to the nest-granularity tiling) and work upwards. For simplicity4 we assume that even and odd each execute a single node of the ISG as an atomic (6-cycle) action. Following this view, the \memory" of the oating-point unit holds four values (three input and one output), each nest-granularity tile is a single node. Translating this to a conventional programming language is easy: the statement R2 = C*(R17 + (R5 + R3));

says that the registers module moves R17, R5 and R3 down to even (or odd), which executes the tile, and then the result is returned to register R2. Now things become less trivial. We must choose a size and shape for the tiles that are executed in registers. To enhance surface sharing, we want to use a pile of short, wide tiles as in gure 4(a).5 Suppose we use tiles that are 1 node high and wide, for some well-chosen6 value . Unfortunately, each node in such a tile is dependent on its left-hand neighbor, so no parallelism is possible. To keep both even and odd busy, we choose instead a tile that has two horizontal line segments as in gure 4(b). K

T3 T2 T1

K

O O O O O O O O O O O O O O O O O O O O O O O O

E E E E E E E E E E E E

a)

b)

O O O O O O O O O O O O

Figure 4: Pile of register tiles. Nodes labeled E can be executed in parallel with those labeled O. Figure 5 shows the names in registers' address space. Notice that (with one exception) the register names are all distinct. This is partly necessary and partly convenience. For instance, the result labeled EB2 could not be stored in register EA2, since the value in EA2 still needs to be used in computing EB3. 4 A more aggressive approach would be to break each node into its three component oating-point operations. Doing this is unnecessary in the example. 5 Alternatively, we could use tall, narrow tiles and execute a horizontal sequence of tiles. Both work. 6 Making K large is desirable to amortize the store operation over 3

oating-point operations. However, the surface of a 1  tile is 2 + 2, which limits to 15 since there are 32 registers. K

K

K

K

8

LB

EB1

EB2

EB3

EB4

LA

EA1

EA2

EA3

EA4

OB1

OB2

OB3

OB4

EB4

OA1

OA2

OA3

OA4

Figure 5: Register tile and names for PDE example. The nodes labeled EB are executed by even, those labeled OB are executed by odd. The other labeled nodes are used as inputs in computing these nodes (see gure 2 for dependences). However, it is possible to store EB2's result in register EA1. Reusing register names would reduce the number of registers needed (allowing for larger tiles) but it creates (not insurmountable) diculties at the next level up the PMH tree. The one exception to having distinct register names is that EB4 is used twice. This creates an output dependence between two nodes in the tile | OB1 must be computed before the upper EB4. Since OB1 is the rst node to be executed by odd and EB4 is the last for even, this satis es our desire for relaxed choreography. Other than that one dependence, the operation of even and odd are completely independent. Finally, we must arrange to pass a vertical pile of these tiles down from cache to registers. This is where our choice of register names is convenient | we can \ping-pong" the input and output surface names by swapping all the A's for B's in the register names. After two tiles are passed, we end up in the same state as when we started. This allows us to translate the PMH choreography into the conventional program shown in gure 6.7 Figure 7 shows the number of cycles per node for the original code,8 the same code compiled using the -Pk preprocessor,9 and code of gure 6 produced by hierarchical tiling. The relatively slow performance of the original code is caused in part by the fact that each oating-point operation is dependent on the previous one. If this were the only problem, however, the code would take only 6 cycles. There appears to be an additional 3-cycle delay caused by a hardware interlock between the store in one iteration and the loads of the next. The unrolled code produce by the -Pk preprocessor allows for better use of the oating-point pipe. If the usage were perfect, the code would take four cycles per node (one for each of the three operations and one for the store). However, there appears to be three additional cycles for every four nodes due to various pipeline delays. The main advantage of the code produced by hierarchical tiling is that instead of storing the value of each node of the ISG, only one in eight (the rightmost node of each tile) requires storing in cache. The code of gure 6 requires a minimum of 50 cycles (48 operations plus two stores) to compute 16 nodes. The fact that it actually runs in 50 cycles shows the second advantage of hierarchical tiling { it allowed perfect use of the oating-point pipeline and of the overlapped loads provided by the hardware.

4.2 Cache

The inner loop code produced by hierarchical tiling in the previous section computes an 8-node wide vertical strip of the Iteration Space Graph. We incorporated it into a program (called \Reg+FMA") that computes the entire ISG. The program is straightforward but messy due to the ragged-edged, sometimes-triangular border pieces. The resulting code has a substantial overhead for small problem sizes. Nevertheless, one might hope that as the problem size increased, the performance would approach the 3.125 cycle/node speed of the inner loop. The actual timings, shown in gure 8, show that performance of Reg+FMA gets worse when the problem size exceeds 4000. The same e ect can be seen more clearly in the original code in gure 8, where cycles/node increases from about 9 to above 10 when the problem size exceeds 4000. The problem is cache misses. The cache on a processor of the SP1 holds 4096 eight-byte oating-point numbers. 7 In this code, the unnecessary output dependence (which arises because EB4 is both used and de ned in tile I) could have been eliminated by using a \ping-pang-pong" strategy of executing three register tiles in the inner loop. 8 All compilations in this example used the AIX XL Fortran Compiler, Version 3.2.0, with the -O3 optimization level. 9 -Pk -Wp,-optimize=5,-scalaropt=3,-roundoff=3 was speci ed. This mainly unrolled the loop four times.

9

/* initialization */ LA=LEFT[1]; EB4=BOTTOM[J+4]; EA1=C*(LA +(BOTTOM[J] +BOTTOM[J+1])); OA1=BOTTOM[J+5]; EA2=C*(OA1+(BOTTOM[J+1]+BOTTOM[J+2])); OA2=BOTTOM[J+6]; EA3=C*(OA2+(BOTTOM[J+2]+BOTTOM[J+3])); OA3=BOTTOM[J+7]: EA4=C*(OA3+(BOTTOM[J+3]+BOTTOM[J+4])); OA4=BOTTOM[J+8]; for (I=1; I < PileHeight-2; I+=2) { /* tile I LB =LEFT[I+1]; EB1=C*(LB +(LA +EA1)); EB2=C*(EB1+(EA1+EA2)); EB3=C*(EB2+(EA2+EA3)); EB4=C*(EB3+(EA3+EA4));

*/ OB1=C*(EA4+(EB4+OA1)); OB2=C*(OB1+(OA1+OA2)); OB3=C*(OB2+(OA2+OA3)); OB4=C*(OB3+(OA3+OA4)); RIGHT[I]=OB4;

/* tile I+1 */ LA =LEFT[I+2]; EA1=C*(LA +(LB +EB1)); EA2=C*(EA1+(EB1+EB2)); EA3=C*(EA2+(EB2+EB3)); EA4=C*(EA3+(EB3+EB4));

OA1=C*(EB4+(EA4+OB1)); OA2=C*(OA1+(OB1+OB2)); OA3=C*(OA2+(OB2+OB3)); OA4=C*(OA3+(OB3+OB4)); RIGHT[I+1]=OA4;

} /* complete pile and save top surface */ TOP[J+5]=C*(EA4 +(EB4+OA1)); TOP[J+6]=C*(TOP[J+5]+(OA1+OA2)); TOP[J+7]=C*(TOP[J+6]+(OA2+OA3)); TOP[J+8]=C*(TOP[J+7]+(OA3+OA4)); RIGHT[I]=TOP[J+8]; if (I+1 == PileHeight) { ... code for one remaining tile ... }

TOP[J+1]=EA1; TOP[J+2]=EA2; TOP[J+3]=EA3; TOP[J+4]=EA4;

Figure 6: Inner loop code produced by hierarchical tiling of registers and processors for PDE example. Standard tiling at the cache level only, which does not make particularly good use of the processor, was applied in the program labeled \CacheOnly" in gure 8. The at performance curve indicates that cache misses have been successfully amortized against a large block of computation, but the overall performance is lacking. Following the hierarchical tiling discipline (section 3.2), we made a subroutine TILE(H, W, LEFT, BOTTOM, RIGHT, TOP) that computes an H node high by W node wide tile. TILE takes inputs in LEFT(1:H) and BOTTOM(0:W), producing outputs in RIGHT(1:H) and TOP(0:W). It does the bulk of its work by using the code of the inner loop, but lls in the ragged top and bottom nodes to handle a complete rectangle. To insure the tile ts in the CACHE module, we need 2(H + W) + 2  4096, and to keep the choreography \relaxed", use of a somewhat smaller tile is encouraged. The program labeled \All" in gure 8 is blocked for cache, registers and the oating-point unit by choosing Original Code 9.0 Using -Pk preprocessor 4.75 With Reg+FMA Tiling 3.125

Figure 7: Cycles per node for inner loop.

10

S=T= Original (with -Pk) CacheOnly (with -Pk) Reg+FMA All

2000 4000 6000 8000 10000 12000 9.04 9.09 10.22 10.24 10.24 10.25 4.65 4.67 5.49 5.49 5.49 5.49 9.20 9.20 9.20 9.20 9.20 9.21 5.31 5.31 5.31 5.31 5.31 5.31 3.33 3.30 3.37 3.41 3.42 3.43 3.37 3.29 3.26 3.25 3.24 3.23

Figure 8: Performance (in cycles/node) on 1 SP1 processor. tiles at the cache level of size 800 high and 64 wide10 , each consisting of columns of register tiles. The gure shows that at large problem sizes, the \All" code produced by hierarchical tiling runs in 32% of the time required by the original code; 61% of the time of best code developed using competitive techniques.

4.3 Main memory and disk

Since the full iteration space graph t in main memory, tiling for disk would have no advantage. Of course, if the problem size had been larger, we could have illustrated the use of hierarchical tiling at the disk level as we did for registers and cache.

4.4 Multiple Processors

To tile our example ISG on multiple processors, we chose a simple (block cyclic) decomposition of the horizontal axis of the parallelogram, where each processor executes a vertical pile of tiles by calling the subprogram TILE of section 4.2. Thus, a processor executes the tiles in its pile by passing them down to its CACHE module. When a processor nishes one pile, it then starts its next cyclically-assigned pile. Assigning a vertical pile of tiles to a processor allows surface sharing along the width of the tile. We used a ping-pong strategy for the TOP and BOTTOM surfaces of tiles in a pile. An advantage of hierarchical tiling is that the LEFT and RIGHT surfaces of tiles on di erent processors can be communicated with block moves of size H (the height of the tile). Furthermore, nonblocking sends can straightforwardly be used to relax interprocessor synchronization. Our program uses per-processor arrays of LEFT and RIGHT bu ers for communication. When selecting optimal sizes for the width and height of the tiles at the processor level, two con icting concerns must be addressed: larger tiles improve the communication to computation ratio; small tiles get everything started earlier when there are inter-tile dependences. To aid in the selection of and , we developed a performance model of the parallel execution of the ISG for our example from which we derived a performance prediction function based on the values of and . For a given execution time of TILE, it is possible to solve the equation analytically for the optimal and values. However, the execution time of TILE depends on the values of and , so these \optimal" values will only be approximate. We therefore used a plot of the equation for a representative TILE execution time (of 3.4 cycles per node) to locate the and parameter ranges in the area of optimal performance (see gure 9). To obtain optimal performance and to verify the accuracy of our predictions, we experimented with widths and heights in this area on 16 processors. The model and experimental results are discussed below. W

H

W

H

W

W

W

W

H

H

H

H

4.4.1 Analysis and Experiments We derived a formula to relate the performance of the hierarchically tiled program to the height and width of the highest-level tiles. The program was run on the SP1 to verify the accuracy of our analysis. For the 62.5 MHz/sec SP1, we use the message communication cost estimate given in [TC94] of 50 microseconds (3125 cycles) plus .12 microsecond (7.5 cycles) per byte transferred. We assign half of the 3125 cycle latency 10

The results would be very similar for other parameters.

11

w, h, (3.4+120/w+3215/(w*h))*(1+17*h*(16*h+14*w)/(2*16000*16000))

Performance 25 20 15 10 5 500 400 100 300

500 700 900 1100 Width 1300

200

300 Height

100 1500

Figure 9: Plot of performance prediction function. to the sending processor and half to the receiver, but charge both processors the per-byte cost since neither processor is able to overlap computation with the actual sending or receiving of data. The choice of tile size and shape is based on three concerns:  Each tile involves two messages (one received and one sent). The communication latency (3125 cycles) should be much smaller than the tile's volume (which represents the time required to execute the tile). Each node in the iteration space in our example takes only a few cycles, suggesting tiles with   3125.  The communication throughput or inverse-bandwidth (60 cycles per doubleword) should be amortized against the volume/surface ratio. Since each tile involves communicating 2  doublewords, this suggests   60  2  or  120.  At the beginning and end of the computation, full parallelism cannot be achieved due to the data dependences between the tiles. To keep the number of idle cycles small, the tiles should be relatively small.11 The partially-con icting concerns are re ected in the following formula for the predicted parallel execution time on processors: W

H

H

W

H

H

W

P

Perfpar (

W; H

) = (Perfseq (

W; H

3125 ) + 120 W + WH )  (1 + ( + 1)( H P

HP

+

WP

? 2 ) (2 )) W =

ST

The rst factor of this product is the computation plus communication time per node. It shows that the larger and are, the less parallel performance is degraded by the amortized communication time. The second factor is an expression that takes into account the fact that maximum parallelism is only achieved on a 2 portion of the computation. Assuming that  and  2, this factor is approximately 1+ 2 suggesting that the total number of tiles (i.e. ) should be large compared to . To derive this formula, we assumed that   , and that the time to execute each tile and communicate the results is exactly tile = Perfseq ( ) + 120 + 3125. This assumption ignores the W

H

H

W

P

W HP

S T =W H

S=P

t

W

WH

=S T

P

H

W; H

H

11 It may be that a mixed strategy | using small tiles at the ends of the computation and larger tiles in the middle | would be advantageous. This is closely related to the technique of fractiling [H93].

12

fact that some tiles are not full  rectangles, and that there will certainly be irregularities in the communication times. However, the formula does take into account that full -way parallelism doesn't begin immediately, using the following reasoning. Let = . The -th pile of tiles is the rst one to contain at least tiles. Once execution reaches the -th pile of tiles, all of the processors fall into a staggered lockstep behind processor .12 Full -way parallelism will be established by the time processor begins its -th tile. This will occur at time ( + ? 2) tile , and at this time, about ( ? 1)( + ? 2) 2 tiles (those in the initial triangle) will have been already computed. As execution continues from this time forward, the parallel execution continues, but the amount of staggering increases as long as each pile of tiles is a little higher than the previous pile. When the computation reaches the progressively shorter piles on the right side of the parallelogram, the amount of stagger will get reduced again. Full parallelism can continue until the -th processor from the end completes all but its last ? 1 tiles.13 The parallelism from this point on decreases at the same rate that it increased during the rst piles. The phases of staggered execution are illustrated in gures 10 and 11, which is a visualization of an execution of the code on 4 and 16 processors of the SP1 using the PV program visualization system. W

H

P

k

P

P H=W

k

k

k

P

P

k

k

P

t

P

k

P

=

k

P

k

Figure 10: Phases of staggered execution of the ISG on 4 processors. By this reasoning, the execution time is approximately 2( + ? 2) tile (to execute the initial and nal triangles), plus the time to execute the remaining ( ) ? ( ? 1)( + ? 2) tiles at the full-parallelism rate of tiles every tile cycles. The formula follows from multiplying the execution time by ( ) to get 12 While the -th processor is executing the -th tile in this pile, processor + (mod ) will be executing the ( ? ) tile in k

ST = W H

P

P

t

P

k

P

t

k

P = ST

j

k

i

P

its next pile (for ). 13 Full parallelism may end earlier if the staggering between adjacent processors is unequal. i < P

13

j

i

Figure 11: Phases of staggered execution of the ISG on 16 processors.

14

"measured_perf" "predicted_perf"

Performance 12 11 10 9 8 7 6 5 4 160 120 200

400

600 800 Width

Height

80 1000

40

1200

Figure 12: Graph of actual versus predicted performance in the area of optimality parallel cycles per node. One could nd the best tile width and height analytically; however the fact that the sequential performance is better for larger tiles shifts the optimum values a bit. Figure 12 shows a plot of the formula when = 16 and = = 16000 near the region of optimality. The plot assumes Perfseq ( ) is constantly 3 4. It also shows the actual parallel performance on the SP1. For each combination, multiple runs were made, and the best times are reported. The performance numbers plotted are cycles per node of the iteration space (elapsed time  cycles per second  (  )). A sample of the timing results is shown in gure 13. The best performance, 3 80 cycles/node (compare to the P

W; H

S

T

:

W; H

P= S

T

:

&

H

&

40

60

80

100

120

140

160

700

4.19

4.12

4.09

4.10

4.14

4.16

4.17

800

4.02

3.96

3.92

3.94

3.97

3.93

3.97

900

3.89

3.80

3.81

3.81

3.92

4.01

4.17

1000

3.95

3.95

4.08

4.12

4.21

4.29

4.40

W

Figure 13: Performance (in processor-cycles/node) on 16 processors of an SP1 for S = T = 16000. unoptimized serial performance of 10.25), occurred for = 900 and = 60, and gives reasonable eciency (85%). There is very good agreement between the predicted and measured performance for 400   1000, and asymptotically the curves behave similarly. The inaccuracy when 1000 is not surprising, since our assumption that  is violated. The inaccuracies for small values of are only partly explained by the extra overhead in the parallel program; there may also be contention (resulting from the higher communication to computation ratio) not captured by our linear communication costs model. This and other \bumps" in the plot illustrate the diculties of modeling multiprocessor performance. The parallel performance is quite close to the best serial performance we obtained (3 80 versus 3 22) on this problem size, which is somewhat surprising given the ne-grained synchronization of the problem and the coarse-grained parallelism of the SP1. These results demonstrate that hierarchical tiling can handle multiple processor parallelism, and good eciency can be achieved. W

H

W

W >

S=P

W

W

:

15

:

5 Example 2: Protein Matching Code This section describes how hierarchical tiling was applied to a production code. Although this is an entirely di erent algorithm from the previous example, it is similar since the iteration space graphs for the two problems are closely related. It illustrates how the ideas introduced in the simple example are realized in a program that was encountered in practice. This example also exhibits an interesting interaction between the highest and lowest levels of tiling that was not present in the earlier example. The Smith-Waterman-Gotoh algorithm is a computationally-intensive algorithm for evaluating how well two sequences of characters match each other. It is a dynamic-programming string-matching algorithm that allows the match to have gaps of unused characters from either or both strings. The gaps corresponding to insertions or deletions of genetic material in the strings. The algorithm is fundamental to the analysis of proteins and genes. The code runs at the San Diego Supercomputing Center, and allows molecular biologists to search a database of thousands of known reference sequences r[i] for ones that may be related to a query sequence q. Pseudocode for the program is shown in Figure 14. The program is \embarrassingly parallel", in that the computation for each reference string is independent. Evaluating each query-reference match involves (conceptually) lling in a 2-dimensional table, where each table entry has three components, r gap, q gap, and nogap. The computation of an entry requires about 11 operations, and uses three other table entries: the ones below, to the left, and diagonally below-left. The ISG is a 3-D solid. For each reference string, there is a 2-dimensional grid of nodes with one node for each table entry. These grids are stacked to form a solid. Since the reference strings are not all the same length (they vary from hundreds to thousands of characters), the solid has a very irregular face. If the reference strings are sorted according to length, this face will take on a smoother, but still sloping appearance. Each 2-dimensional grid has the same dependences between iterations as depicted in gure 3, though the grid itself is a rectangle rather than a parallelogram. The dimensions of the rectangle are the length of the query string, q length, and the length of the reference string, r length[i]. There are two variants of this algorithm of relevance to the tiling experiments, called the xed-point and the oating-point variants. They di er in the number representations and the implementation of the ADD and MAX operations. In the xed-point variant, ADD and MAX are ordinary integer addition and maximum operations. In the oating-point variant, ADD represents a oating-point multiply and MAX represents a oating-point addition.14 On this code, hierarchical tiling involves tiling at three levels: tiling for parallelism on a multicomputer, tiling for cache, and tiling for instruction-level parallelism. We experimented on an Intel Paragon, an IBM SP2, and two DEC Alpha-based systems, a T3D and a farm of Alpha workstations. The experiments are described in detail in the next sections. It turned out that for the xed-point variant, tiling produced modest improvements at the processor and cache levels. But for the oating-point variant, there was a large gain in processor performance achieved by executing two or three planes of the ISG concurrently on a processor. This illustrates an important point: it may be necessary to consider the entire multilevelled tiling problem as interrelated. The best performance used a non-optimal choice of tile size at the higher levels of tiling (the node and cache levels) in order to get a signi cant performance advantage at the functional-unit level.

5.1 Storage mapping under hierarchical tiling

The pseudo-code of Figure 14 uses triply-index variables, which made writing it particularly simple, but it is wasteful of memory.15 Hierarchical tiling can eliminate this waste by discarding the program's storage mapping and introducing storage only at the levels of memory where it is needed. The input surface to the entire ISG is the query and reference strings and the Value matrix. The output surface is the score array. Since the computation for each i is independent of all other i's, there only needs to be one j{k table for The oating-point variant uses an exponential number representation, that is, a score of is represented by the oatingpoint number 2s . For  1 5, encoding the score by s produces nearly-identical scores to the xed-point variant, and both algorithmic variants select exactly the same set of strings as good matches [ACG95]. 15 To be historically accurate, we note that we began our experiments on this application with an implementation that used space-ecient one-dimensional matrices. We wrote the pseudo-code to help us understand the program and facilitate our experiments. We show, however, that hierarchical tiling can provide a sequence of program re nements that proceed from the easier code to the space-ecient version. 14

s

b

:

s

b

16

initialize q_gap, r_gap, nogap, score for i from 0 to number_of_r_strings-1 do for k from 0 to r_length[i]-1 do for j from 0 to q_length-1 do

/* can do in parallel */

r_gap[i,j,k] := ADD( ExtendGap, MAX( r_gap[i,j,k-1], ADD( nogap[i,j,k-1], StartGap ) ) ) q_gap[i,j,k] := ADD( ExtendGap, MAX( q_gap[i,j-1,k], ADD( nogap[i,j-1,k], StartGap ) ) ) nogap[i,j,k] := MAX( MAX( ADD( nogap[i,j-1,k-1], Value[q[j], r[i][k]] ), ZERO ), MAX( q_gap[i,j,k], r_gap[i,j,k] ) ) score[i]

:= MAX(score[i], nogap[i,j,k])

Figure 14: Pseudo-code for the Smith-Waterman-Gotoh protein string matching algorithm. The query string, q is compared to a set of reference strings, r[i]. The algorithm uses dynamic programming to evaluate the best match, using the Value matrix to score the paired characters, and assigning a penalty of StartGap + n * ExtendGap for skipping over n consecutive characters in either string. The three arrays, r gap, q gap, and nogap have the score of the match between the initial j characters of the query string and the initial k characters of the i-th reference string. The scores in r gap and q gap are the scores that arise when the match ends in a gap of unused characters from one or the other string. each value of i that is being computed concurrently. Secondly, it is not necessary to allocate storage for an entire j{k table. Instead, this computation for each table is partitioned into cache tiles, and storage in main memory is only needed for the cuts introduced by the tiling. A ping-pong strategy for executing piles of cache tiles only uses two vectors of nogap values and two of r gap values, plus a small amount of storage for the surface of a single cache tile. In fact, we used code that, instead of ping-ponging, used the same array for the input and output surface of each pile.16 The code of Figure 16 re ects these changes.

5.2 Tiling for multicomputer parallelism

The original code executed the ISG on a parallel computer by assigning one reference string at a time to each processing node, using a master-slave (or worker-manager) paradigm. This achieves load balancing, which is important since the reference strings are of di erent lengths. An alternate strategy is to give several (two or three in our experiments) planes to a processor at a time. Since these strings are evaluated concurrently, we sorted the reference strings according to length so the strings in a group have nearly identical lengths. There are still enough groups (far more than the number of processors) so that the master-slave paradigm achieves near-perfect load balancing.17 The potential advantage of assigning several strings at once to a processing node, and passing the strings in parallel to cache and register tiles, is that more parallelism is available at the lowest levels. This can enable a pipelined or superscalar processor to have fewer wasted cycles. The potential disadvantage is that the cache and registers have to hold data from several independent problems, and the amount of space Eliminating ping-ponging this way requires taking care to save nogap[i,j-1,k] before overwriting it with nogap[i,j,k], since that value is used (referred to as nogap[i,j-1,k-1]) in the next loop iteration. 17 In fact, the number of reference strings is growing rapidly, as new sequences are being produced by the human genome and other projects. Meanwhile the number of nodes per machine is decreasing; SDSC's Paragon had 400 nodes, their newer SP2 has 16 nodes and their T3D has 128. 16

17

available per string is smaller. As a result, to prevent cache thrashing or register spilling, each string is broken into more pieces, but this results in more cache misses and register loads. The experiments described later demonstrate that for the xed-point variant, using only one string at a time gave the best performance, but that two were better for the oating-point variant. As with other embarrassingly-parallel master-slave codes, the computation of a tile on a node can be overlapped with the communication of the data for the next tile to that node. However, we didn't perform these experiments, since interprocessor communication is an insigni cant fraction of the total time.

5.3 Tiling for cache

In the original code, when the strings are long, the computed values migrate out of cache before they are used.18 Tiling the ISG reduces the number of cache misses. In the rst set of experiments, we made two improvements to the code. First, we tiled the program for cache, by partitioning the iteration space in the k dimension into swaths of width swathsize = 50. Second, we eliminated indirect addressing in the inner loop (namely the occurrence of Value[q[j],...]). This was done by precomputing, for each possible character c in the alphabet of q, the vector V[c][k] = Value[c, r[i][k]], and then replacing Value[q[j], r[i][k]] by V[c][k], where c = q[j]. This results in the pseudocode of gure 16. Intel i860 Power2 Alpha 21064 Query string length 500 2000 5000 500 2000 5000 500 2000 5000 10000 Original code 38.2 39.5 43.1 18.8 17.8 17.7 49.0 51.7 51.2 52.3 Cache-tiled, optimized code 35.4 34.5 34.4 17.4 16.8 16.8 49.8 48.5 48.7 48.7

Figure 15: Cycles per iteration for the inner loop of the xed-point variant of the protein matching code. (These gures were derived by dividing the cpu time of the program by the product of the string lengths; thus, they include the amortized overhead of the entire program.) The experiments summarized in gure 15 show the e ect of these improvements. For all three machines, when the query string is of length 500, it remains in cache. The performance change for this length shows the e ect of removing the indirect addressing. For the I860 and the Power2, this improvement was about 8%; the \improvement" to the Alpha actually slowed it down a little. For the original code on the Paragon and the Alpha, the number of cycles per iteration increased for the longer query strings, due to the processors' small (8KB) caches. In both cases, the cache tiling was successful, as the optimized code does not exhibit any performance degradation on the longer strings. On the other hand, the SP2 has a large enough cache to hold the longest strings of our experiments. Thus, both the original and optimized codes go slightly faster as the strings increase in length, since the overhead is amortized over more iterations of the inner loop. The improvement due to more e ective cache utilization is modest for two reasons. First, as in the PDE example, the data stored in each cacheline are used sequentially (i.e., there is spatial locality), so in the unoptimized version, there is only a cache miss on occasional loop iterations (depending on the size of a cache line). Second, each node of the protein-matching ISG involves more operations than for the PDE example, and thus the relative impact of cache misses is even smaller.

5.4 Tiling for instruction-level parallelism

There are several potential advantages of tiling at the level of registers. First, by executing several nodes of the ISG in one loop iteration, the loop overhead of the branch and counter increment is amortized over more operations. This advantage could also be achieved by simply unrolling the loop, which corresponds to choosing a tile that is several ISG nodes long in the direction of the innermost loop (the k direction). 18 Furthermore, the long strings will occasionally displace portions of the Value array. This array is about 4KB in size and is referenced frequently, so it normally resides in cache.

18

initialize score for i from 0 to number_of_r_strings-1 do /* can do in parallel */ for j from 0 to q_length-1 do r_gap_vert[j], nogap_vert[j] := U, 0 for kk from 0 to r_length[i]-1 by swathsize do this_swath := min(swathsize,r_length[i]-kk)-1 for c in alphabet do for k from 0 to this_swath do V[c][k] := Value[c, D[i,k+kk]] for k from 0 to this_swath do q_gap_horz[k], nogap_horz[k] := U, 0 temp := 0 for j from 0 to q_length-1 do nogap := nogap_vert[j] r_gap := r_gap_vert[j] c:= q[j] for k from 0 to this_swath do r_gap := ADD( ExtendGap, MAX( r_gap, ADD( nogap, StartGap ) ) ) nogap :=

MAX( ADD( temp, V[c][k] ), ZERO )

temp := nogap_horiz[k] q_gap := ADD( ExtendGap, MAX( q_gap_horz[k], ADD( temp, StartGap ) ) ) nogap := MAX( nogap, MAX( q_gap, r_gap ) ) score := MAX(score, nogap) nogap_horiz[k], q_gap_horz[k] := nogap, q_gap temp := nogap_vert[j] nogap_vert[j] := nogap r_gap_vert[j] := r_gap

Figure 16: Pseudo-code for the optimized Smith-Waterman-Gotoh program.

19

A second advantage of tiling is that the number of loads and stores per node of the ISG can be reduced. This is accomplished by choosing a tile that is several nodes high in the j direction. The output values of the interior nodes of these higher tiles only exist in registers and are not moved out to cache. There is a limit to how high the tile can be, however, since the larger the tile, the more registers are needed. When the register limit is reached, the compiler produces spill code, resulting in dramatically degraded performance. A third advantage of tiling at the processor level is that it can provide independent operations to keep the functional units busy. Observe that the inner loop of the code has 11 MAX and ADD operations. There is \critical path" of dependent operations of length ve, speci cally, the computation of r gap in one iteration from its value in the previous iteration is given by r gap := ADD( MAX( ADD( MAX( MAX( r gap, q gap), nogap), StartGap), r gap), ExtendGap). It can happen that it takes longer to execute this critical path than it would take to execute the 11 independent operations. For instance, suppose our processor can execute any three independent MAX and ADD operations per cycle. With enough independent work, the processor could average 11 3 = 3 7 cycles per node of the ISG. However, if each node is dependent on the previous node, then the critical path reduces performance to 5 cycles per node. The experiments described later show that this consideration did not reduce performance for the xed-point variant of the program, but was signi cant in the oating-point variant. =

Tile Height 1 1 string 14.7 2 strings 17.9 3 strings 17.6

SP2 2

13.7 20.1 23.8

:

T3D Farm 3 1 2 3 1 2 3 14.2 31.6 30.0 28.7 43.1 37.5 36.7 20.9 38.6 38.4 38.3 43.1 55.9 49.0 25.0 37.2 37.1 36.9 62.0 62.0 54.8

Figure 17: Performance of the xed-point variant, in cycles per node, for various register-tile choices. The best shape for each machine is highlighted. (These gures, unlike the earlier ones, represent times for only the operations of the ISG. They do not include the amortized cost of computing the array or other overhead.) V

We experimented with nine di erent register tiles. In gures 17 and 18, the rows of the table represent computing 1, 2, or 3 reference strings at a time. The columns represent using tiles that computed 1, 2, or 3 values of j at a time. The tiles were iterated in the k direction, so the inner loop executed a three-dimensional pile of fty tiles, with from one to nine table entries per tile. These piles were iterated to form the cache tiles of width fty described earlier. The query and reference strings were all 10,000 characters long. The results for the xed-point variant are shown in gure 17.19 For this variant, the best performance occurred on all three architectures for register tiles that computed one string at a time, but using a 2- or 3-wide swath. Other register tile choices resulted in performance no worse then 50% slower. It is interesting to observe the general trend: making the tiles one unit larger in one or the other dimension either resulted in a small change in performance (usually an improvement), or a large (several cycle) drop in performance. The latter happened when the tile became too large to t in registers; the compiler introduced spill code requiring extra store and load instructions. The results for the oating-point variant are in gure 18. Notice that on the SP2, this variant runs signi cantly faster than the xed-point variant. This is primarily due to the increased instruction-level parallelism exhibited by the Power2 processor on oating-point code.20 Each of the two oating-point units can issue a two-cycle fused multiply-add instruction (ADD-MAX operations) each cycle, for a maximum of four oating-point operations per cycle, and these can be concurrent with the two integer instructions (loads and stores) initiated each cycle. Thus, the Power2 processor can initiate as many as six operations per cycle 19 The Alpha farm takes more cycles per node than the T3D, even though both use the same processor. The T3D's compiler appears to schedule instructions better, resulting in fewer data hazards. It also translates most of the MAX operations into compare and conditional branch instructions, while the Alpha farm's compiler makes more use of the Alpha's conditional move operation. It appears that the Alpha's dynamic branch prediction is quite accurate on this code, giving an advantage to the T3D's translation. 20 It is also the case that both the Power2 and the Alpha require more machine instructions to implement a xed-point MAX than a oating-point MAX.

20

Tile SP2 T3D Farm Height 1 2 3 1 2 3 1 2 3 1 string 10.4 8.0 8.0 50.8 44.3 42.3 45.8 37.8 31.9 2 strings 5.3 5.1 4.8 47.1 33.8 33.1 40.2 36.1 26.3 3 strings 5.1 4.9 6.9 63.4 49.8 51.2 42.8 34.3 36.4

Figure 18: Performance of the oating-point variant, in cycles per node, for various register-tile choices. on oating-point code, as opposed to only two for xed-point code. In fact, all of the loads and stores of the oating-point variant can be executed concurrently with ADD's or MAX's, and three of the ve ADD instructions per node can get fused with three of the six MAX's. On all three machines, the best oating-point variant performance was obtained for tiles that executed two strings at a time. This is a consequence of the fact that the oating-point code is executed with a higher degree of concurrency, and using longer pipelines, than the xed-point code. A node of the oating-point variant on the SP2 entails eight two-cycle instructions. The critical path for this code is only four (two-cycle) instructions (due to the fusing of a ADD-MAX pair of operations). For a single string, the SP2 achieves, but does not break, this eight-cycle barrier.21 On the other hand, if there is sucient parallelism available within a tile, the theoretical peak performance on the SP2 is four cycles per node (the eight oating-point instructions initiated two per cycle). When the register tile included independent nodes (more than one reference string), the actual performance got close (within 20%) to this peak speed. Another method of populating the register tile with independent nodes of the ISG is by using a staggered tile, as we had done with the PDE-like example. We performed only one experiment of this type, using a tile that computed three diagonally-adjacent nodes, that is, it computed the table entries with indices (j-2,k), (j-1,k-1), and (j,k-2). The SP2, oating-point performance was 6.4 cycles per node. Thus, it broke the 8-cycle barrier, but did not perform as well as, for instance, using three independent nodes from three separate reference strings (5.1 cycles per node). It is also more complicated to program the loop initialization and completion code of the diagonal tile. It is possible to spend an unbounded amount of time attempting to understand and explain each performance gure. It is probably more productive to step back and observe the general principles.  The amount of instruction-level parallelism exhibited by a processor depends on the nature of the code it is executing.  A measure of the amount of instruction-level parallelism available in a code sequence is the number of instructions needed to execute the code, divided by the length of the longest sequence of dependent instructions (eight over four in the case of dependent nodes for the oating-point variant on the SP2).  Register tiles should have enough parallelism available to exploit the parallelism of the processor. It is sometimes easiest to accomplish this by including independent operations in the high-level tiles, even though this may be a sub-optimal choice for node-level and cache tiles.  Among tiles with sucient instruction-level parallelism, larger tiles can have better performance, since the communication-to-computation ratio is better (\communication" at the register level refers to load and store operations). However, if the tile is so large that spill code is needed, the performance drops dramatically. An example of these principles is the oating-point variant on the SP2. The PMH model of the oating-point unit of the SP2 has four functional units22 ; the code for dependent nodes has 2-way parallelism available (eight instructions per node over four dependent instructions), and the best performance occurred for tiles 21 Theoretically, it should be possible for to do better than eight cycles per node on the tiles 2 or 3 nodes high by using software pipelining. However, the XLF compiler, which is generally considered to be extremely good, didn't manage to succeed. 22 There are two 2-stage oating-point pipes; thus the four FMA instructions issued in two-cycle period cannot have any data dependences.

21

that used two independent reference strings. Similar calculations explain the need for only one reference string for the xed-point variant and also explain the Alpha results.

6 Conclusion and Future Work This paper has presented hierarchical tiling as a methodology for hand-tuning codes for high performance. In particular, hierarchical tiling can improve parallelism and locality at all levels of the memory/processor hierarchy: functional units, registers, caches, multiple processors, and disks. Hierarchical tiling achieves its goal of high performance by considering these multiple levels and their interaction simultaneously. Hierarchical tiling uses an explicit model of the target architecture. The model includes the degree and levels of parallelism, and the capacities, transfer times and block sizes of the various storage mechanisms, but excludes other details of the machine such as cache associativity. To get good performance on some problems, these details might be important. Hierarchical tiling may be an e ective approach to developing portable and truly high-performance programs. Ideally, if a program could be written in terms of the parameters of the architectural model, then tuning it for a new platform would simply be a matter of substituting in appropriate values. Unfortunately, the translation from a tiling to a conventional program requires using di erent ad hoc methods for each level of the memory/processor hierarchy. Increased programmer control over prefetching would make this task easier, as would an explicit programmer interface. Alternatively hierarchical tiling could be an automatic transformation in an optimizing compiler; current work by the Hierarchical Tiling Group at UCSD has this aim. A distinguishing feature of hierarchical tiling is its incorporation of storage optimization as part of the methodology. Memory for tile surfaces is only assigned at levels where it is needed, and can be reassigned after use. Traditional tiling methods, where names are allocated in a global or per-processor address space, don't as conveniently allow data to be copied and rearranged, and so may not achieve as good spatial locality. Section 3.2 recommends keeping the movement of a tile's surface separate from the execution of the tile, as well as using distinct names for the input and output surfaces. The intent has been to keep the methodology and its e ects simple. For instance, using separate input and output names prevents the introduction of extraneous output dependences. In the examples we have studied in detail, the discipline has not caused signi cant performance loss. However, more experience may lead to revised guidelines. Except at the highest level of tiling (interprocessor parallelism) we used a static, compile-time assignment of tiles to processing elements. For many programs, there may be advantages to using dynamic assignment at the lower levels also. The question arises whether optimizations such as storage assignment and scalar replacement shouldn't be solely the responsibility of independent, general-purpose compiler phases. The example of this paper shows that the goal-directed approach of hierarchical tiling performed signi cantly better than the available commercial compilers, including the state-of-the-art IBM XLF compiler.

Acknowledgments We would like to thank Edith Schonberg and Peter Sweeney for helpful discussions about SP1 communication, and Bob Walkup for his assistance in using the SP1. We also thank the members of the hierarchical tiling group at UCSD (Karin Hogstedt, Tony Istvan, Nick Mitchell, Rich Purdy, Beth Simons, and Michelle Mills Strout) for their enthusiasm and ideas, and Val Donaldson for some timely feedback.

References

[ACFS94] Alpern, B., L. Carter, E. Feig, and T. Selker, \The Uniform Memory Hierarchy Model of Computation," Algorithmica, June 1994. [ACF93] B. Alpern, L. Carter and J. Ferrante, \Modeling Parallel Computers as Memory Hierarchies," Proceedings, Programming Models for Massively Parallel Computers (September, 1993). [ACG95] B. Alpern, L. Carter and K.S. Gatlin, \Microparallelism and High-performance Protein Matching", Supercomputing 95, (December, 1995).

22

[B92] Banerjee, U., \Unimodular Transformations of Double Loops", in Advances in Languages and Compilers for Prallel Processing , A. Nicolau and D. Padua, editors, MIT Press, 1992. [CFH95] Carter, L., J. Ferrante and S. Flynn Hummel, \Ecient Parallelism via Hierarchical Tiling," Proc. of SIAM Conference on Parallel Processing for Scienti c Computing, (Feb. 1995). [CFH95] Carter, L., J. Ferrante and S. Flynn Hummel, \Hierarchical Tiling for Improved Superscalar Perfomance", Proc. of International Parallel Processing Symposium, (Apr. 1995). [TC94] Cornell Theory Center Online Documentation System, Documentation for the IBM Scalable POWERparallel System SP1 (available vi gopher) (April, 1994). [GJG88] Gannon, D., W. Jalby and K. Gallivan, \Strategies for Cache and Local Memory Management by Global Program Transformation," Journal of Parallel and Distributed Computing, Vol. 5, No. 5, October 1988, pp. 587-616. [HA90] Hudak, D. E. and S. G. Abraham, \Compiler Techniques for Data Partitioning of Sequentially Iterated Parallel Loops," IEEE Transactions on Parallel and Distributed Systems, Vol. 2 No. 3, July, 1991, pp. 318-328. [H93] Flynn Hummel, S., I. Banicescu, C.-T. Wang and J. Wein, \Load balancing and data locality via fractiling: an experimental study," Languages, Compilers and Run-Time Systems for Scalable Computers, B. K. Szymanski and B. Sinharoy (Editors), Kluwer Academic Publishers, Boston, MA, 1995, pp. 85-89. [IT88] Irigoin, F. and R. Triolet, \Supernode Partitioning," Proc. 15th ACM Symp. on Principles of Programming Languages, January 1988, pp. 319-328. [JM86] Jalby, W. and U. Meier, \Optimizing Matrix Operations on a Parallel Multiprocessor with a Hierarchical Memory System," Proc. International Conference on Parallel Processing, August 1986, pp. 429-432. [KM92] Kennedy, K. and K. S. Mc Kinley, \Optimizing for Paralelism and Data Locality," Proc. International Conference on Supercomputing, July 1992, pp. 323-334. [LRW91] Lam, M, E. Rothberg, and M. Wolf, \The Cache Performance and Optimization of Blocked Algorithms," Proc. ASPLOS, April 1991, pp. 63-74. [LCW94] Larus, J. R., S. Chandra and D.A. Wood, \CICO: A Practical Shared-Memory Programming Performance Model", in Portability and Performance for Parallel Processing, A. Hey and J. Ferrante editors, John Wiley and Sons, 1994. [L90] Lee, F. F., \Partitioning of Regular Computation on Multiprocessor Systems," Journal of Parallel and Distributed Computing, Vol. 9, 1990, pp. 312-317. [NJVLL93] Navarro, J. J., A. Juan, M. Valero, J. M. Llaberia, and T. Lang, \Multilevel Orthogonal Blocking for Dense Linear Algebra Computations," IEEE Technical Committee on Computer Architecture Newsletter, Fall 1993, pp. 10-14. [PW86] Padua, D. A. and M. J. Wolfe, \Advanced Compiler Optimizations for Supercomputers," CACM, December, 1986, pp. 1184-1201. [RS91] Ramanujam, J. and P. Sadayappan, \Tiling Multidimensional Iteration Spaces for Nonshared Memory Machines," Proceedings of Supercomputing, November, 1991, pp. 111-120. [RAP87] Reed, D. A., L. M. Adams and M. L. Patrick, \Stencil and Problem Partitionings: Their In uence on the Performance of Multiple Processor Systems," IEEE Transactions on Computers, July, 1987, pp. 845-858. [SSM89] Saavedra-Barrera, R. H., A. J. Smith, and E. Miya, \Machine Characterization Based on an Abstract High-Level Language Machine," IEEE Transactions on Computers, December, 1989, pp. 1659-1679 [TAC94] Thomborson, C., B. Alpern and L. Carter, \Rectilinear Steiner Tree Minimization on a Workstation," Computational Support for Discrete Mathematics, N. Dean and G. Shannon editors, Volume 15 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, American Mathematics Society (1994). [WL91] Wolf, M. E. and M. S. Lam, \A Data Locality Optimizing Algorithm," Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June, 1991, pp. 30-44. [WL91b] Wolf, M. E. and M. S. Lam, \A Loop Transformation Theory and an Algorithm to Maximize Parallelism," IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, October, 1991, pp. 452-471. [W87] Wolfe, M., \Iteration Space Tiling for Memory Hierarchies," Parallel Processing for Scienti c Computing, G. Rodrigue (Ed), SIAM 1987, pp. 357-361. [W89] Wolfe, M., \More Iteration Space Tiling ," Proceedings of Supercomputing, November, 1989, pp. 655-664.

23

Suggest Documents