Scientific Application Acceleration with

2007 International Symposium on Field-Programmable Custom Computing Machines

Scientific Application Acceleration with Reconfigurable Functional Units Kyle Rupnow†*, Keith Underwood*, Katherine Compton† Sandia National Lab* University of Wisconsin-Madison† [email protected] [email protected] [email protected]

processors emphasize the emergence of new barriers to traditional mechanisms to increase processor performance. Leveraging application parallelism is essential, but the structures that extract instruction level parallelism are already on the processor’s critical path. Reconfigurable Functional Units (RFUs), however, extract dataflow locality: recurrence of common computation patterns. RFUs can exploit application parallelism outside of the processor’s critical path. Furthermore, since an RFU encodes many instructions into one operation, it can reduce pressure on issue logic. Many groups have used reconfigurable hardware to accelerate applications. Architectures such as GARP [11], and DISC II [23] use a coprocessor model, requiring large computation kernels to overcome the high overhead of communication. Stretch Inc.’s S5 processors also communicate with the reconfigurable logic through dedicated memory operations [2]. These architectures chose a computation model that benefits from large computation kernels implemented on fine-grained reconfigurable logic. Chimaera [10], PRISC [15], and OneChip [24], on the other hand, use an RFU model that supports smaller kernels that communicate through the register file. Chimaera limits the number of inputs to at most nine (hardcoded into the configuration), and uses an approach to outputs similar to that described in section 3.1, but limited to a single output per opcode. PRISC limits the inputs to two, and outputs to one. Although fine granularity can allow architectures like PRISC [15] to optimize for operations larger or smaller than the processor’s native word width, a fine-grained structure can be inefficient for complex word-sized operations. Other approaches to application acceleration focus on dataflow graphs. Interlock collapsing [14], an early general-purpose design, only considers graphs with two nodes. Other techniques that target embedded applications such as Dataflow Mini-Graphs [3] and the CCA [5] relax the two

Abstract While scientific applications in the past were limited by floating point computations, modern scientific applications use more unstructured formulations. These applications have a significant percentage of integer computation—increasingly a limiting factor in scientific application performance. In real scientific applications employed at Sandia National Labs, integer computations constitute on average 37% of the application operations, forming large and complex dataflow graphs. Reconfigurable Functional Units (RFUs) are a particularly attractive accelerator for these graphs because they can potentially accelerate many unique graphs with a small amount of additional hardware. In this study, we analyze application traces of Sandia’s scientific applications and the SPEC-FP benchmark suite. First we select a set of dataflow graphs to accelerate using the RFU, then we use execution-based simulation to determine the acceleration potential of the applications when using an RFU. On average, a set of 32 or fewer graphs is sufficient to capture the dataflow behavior of 30% of the integer computation, and more than half of Sandia applications show an improvement of 5% or more.

1

Introduction

Scientific computing, in fields from molecular dynamics and shock physics to circuit simulation and advanced mathematics, provide a constant driver for higher performance. Advances in scientific fields such as these often depend on higher model accuracy, simulations with a larger number of elements, or wider searches of the parameter space. Increased computing performance is therefore critical for the advancement of many fields. Recent commercial trends in both multi-threaded and multi-core *Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DEAC04-94AL85000.

0-7695-2940-2/07 $25.00 © 2007 IEEE DOI 10.1109/FCCM.2007.14

253 261

Authorized licensed use limited to: Texas A M University. Downloaded on April 29, 2009 at 22:53 from IEEE Xplore. Restrictions apply.

instruction limit, but still limit graph inputs and outputs. The CCA additionally limits graphs to conform to their static RFU placement. While these techniques have shown significant speedup, our previous work [17] demonstrates significant differences in the requirements for Sandia’s scientific applications. The integer dataflow is more complex than the dataflow in SPEC-FP; Sandia’s dataflow graphs average more instructions, more inputs and outputs, and larger height and width than dataflow graphs in SPEC-FP, which demonstrates the need for more complexity than existing RFUs. However, more complex co-processor models are also inappropriate, as most of Sandia’s integer dataflow generates load and store addresses for floating point operands. These operations require as low an overhead as possible to avoid negatively impacting the floating point computations. In this work, we perform an architectural exploration of high-level RFU designs, relaxing the graph shape constraints and the input and output constraints. However, it is still important that the explored architectures are reasonable in terms of register file bandwidth, instruction encoding, and chip area, without impacting processor cycle time. We maintain a reasonable register file bandwidth by limiting the number of registers the RFU can access per processor clock cycle. To limit the RFU’s effects on chip area and processor cycle time, we assume that our RFU will be a coarse grained, multi-cycle functional unit. Within these restrictions, we examine the effect that the RFUs have on the issue logic efficiency of the processor, as one RFU operation can replace several software instructions. We then show how the increased issue logic efficiency affects application acceleration.

2

expressed as a graph, where nodes are computations and edges are the communication between them. Dataflow architectures have no program counter, and thus no branching behavior. All decisions otherwise made by branches in a conventional microprocessor are instead expressed by conditionally sending data between processing units. The first dataflow architectures that appeared in the early 1970’s [6, 7] correspond closely to the graph model: the hardware itself consisted of functional units that communicate directly using FIFO queues. Dataflow architecture designs of the 1980’s [8, 9, 13, 18] built on early designs by adding tagging schemes, dynamic instruction allocation, and communication hierarchy. However, the designs still required dataflowspecific programming languages, and the memory and silicon integration technology of the time struggled to provide the machines with sufficient memory bandwidth. More recently, dataflow architectures have experienced a resurgence with architectures such as WaveScalar [20], RAW [21], and TRIPS [19]. However, these dataflow architectures still depend on an explicitly parallel computing paradigm. Despite the improvements to the capabilities of dataflow processors, traditional microprocessors are still the dominant technology. Furthermore, there is a significant amount of dataflow locality in von Neumann control-flow architectures, and RFUs are poised to extract that dataflow locality. Recently, RFUs have primarily been studied in the context of embedded and streaming applications [3, 5]. However, scientific applications have significantly different behavior than these programs, and real scientific applications have larger and more complex dataflow behavior than that of SPEC-FP[17]. Despite the behavior difference, many groups still use the SPEC suite of benchmarks, and SPEC may be representative of other applications. Therefore, we present a study of several prospective RFU designs and their potential for accelerating the scientific applications in use at Sandia National Labs and the benchmarks of SPEC-FP.

Related Work

Reconfigurable Functional Units (RFUs) consist of computational resources such as fine grained lookup tables (LUTs) or coarse grained ALUs, connected by a reconfigurable routing fabric. These units implement (preferably common) sequences of instructions (determined manually or detected at compile- or run-time) in their specialized hardware to accelerate applications. Logically, this execution model extracts dataflow locality of programs compiled for sequential execution. Dataflow research originated in the 1960’s in studies of graph theoretical models of parallel computation by Karp and Miller [12] and Rodriguez [16]. These papers studied parallel computation models as an alternative to the von Neumann model of sequential execution, with the goal of reusing of common computation and communication patterns (dataflow). Dataflow is

3

Method

Using a combination of trace-based analysis and execution-based simulation, we perform an architectural exploration to determine how different high-level RFU specifications affect issue queue efficiency and therefore application acceleration. Both the trace analysis and the execution-based simulation use tt6e trace files—a

262 254


standard trace format for the G4 instruction set [1]. Both the analysis and execution phases use the SimpleScalar [4] definition files and decode structures. There are seventeen Sandia application traces representing seven programs with two to four input sets each. There are fifteen SPEC-FP traces covering fourteen applications. 179.art is the only SPEC-FP benchmark with two sets of inputs. These benchmarks are described in Table 1. Each trace is a four billion instruction interval selected to be representative of the respective application based on performance register profiling and source code analysis.

size is effectively increased because the space that would have been used by the multiple instructions encapsulated into a single RFU operation may be occupied by later instructions now able to fit into the window. A dataflow graph can include many instructions, and therefore many inputs and outputs. To maintain a realistic RFU architecture, we limit the required register file bandwidth by limiting the amount of data encoded by one opcode. We encode each dataflow graph using one or more RFU opcodes, issued in sequence, to encapsulate the required information. Three parameters provide the necessary input and output limitations: maxInput, limits the number of input registers that one opcode can encode, maxOut_1 limits the number of outputs encoded by an opcode that also encodes inputs, and maxOut_2 limits the number of outputs in an opcode that only encodes outputs. We test three approaches to choosing parameter values: a balance of inputs and outputs per opcode, an emphasis on input coding, and an emphasis on output coding. The seven architectures that result are detailed in Table 2, where the numbers in an architecture’s name are the values used for the three parameters listed above. Based on the limits for a particular RFU architecture, each graph of instructions is encoded into one or more RFU opcodes. The M-format (Figure 1) of the PPC instruction set can accommodate five 5-bit register locations in each instruction. Five of our seven test architectures could use the M-format without additional modifications.

Table 1 - Benchmark Description Suite Sandia Sandia Sandia Sandia Sandia Sandia Sandia SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP SPEC-FP

Benchmark Alegra Cth cube3 Its Lammps Mpsalsa Xyce 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi

Description Shock physics Shock physics Sparse Matrix Solver Radiation transport Molecular dynamics Chemically reactive flow Circuit simulation Physics/Quantum Chromodynamics Shallow Water Modeling Multi-grid Solver: 3D Potential Field Parabolic/Elliptic PDE 3D Graphics Library Computational Fluid Dynamics Image Recognition / Neural Networks Seismic Wave Propagation Image Processing: Face Recognition Computational Chemistry Number Theory / Primality Testing Finite-element Crash Simulation High Energy Nuclear Physics Accel. Meteorology: Pollutant Distribution

3.1 Experimental Architectures The goal of our architectural exploration is to isolate the effects of the RFU on the issue logic while still maintaining feasibility and sufficient generality to implement the complex graphs found in Sandia’s scientific applications. For this initial study, we conservatively assume that the total latency of a dataflow graph in the RFU is equal to the latency the graph would have in a traditional processor, but without resource contention (i.e. latency is caused only by functional units and data dependence). These restrictions allow us to isolate application acceleration due to increased effective issue width and issue window size, and reduced functional unit contention. The processor’s issue width is effectively increased because one RFU opcode can represent multiple instructions, and issuing a single RFU operation is equivalent to issuing multiple instructions. Similarly, the effective issue window

Figure 1 - PPC Instruction Set: M Instruction Format [22]

Table 2 - Tested RFU Architectures maxInput maxOut_1 maxOut_2 RFU312 3 1 2 RFU313 3 1 3 RFU314 3 1 4 RFU323 3 2 3 RFU344 3 4 4 RFU414 4 1 4 RFU523 5 2 3 Unlike other functional unit types, an RFU’s latency varies depending on the dataflow graph it

263 255


implements. For simulation, we tag each RFU opcode with the its associated latency. To compute this latency, we first sort the inputs to the graph so that inputs are encoded in the order they are required. Next, we sort the outputs so they are encoded in the order they complete. During instruction encoding, we select the next maxInput inputs and determine which graph operations can complete. The maximum latency of those graph operations is saved as the opcode’s latency. Then, we examine the output list, and if any of the outputs will complete based on the inputs already encoded, we also encode the next outputs. Finally, once all of the inputs have been encoded, we create additional output instructions (each with one cycle of latency) as necessary to encode the production of any remaining outputs. We use this conservative estimate because the RFU internals are not yet designed, and the composition of the RFU’s internals depends on the parameter values we choose. Furthermore, a conservative estimate allows us to better isolate the RFU’s effects on the issue queue from those of more direct routing between computation units within the RFU. A addi

B sub

addi

add

add

A

C D

Y

sub

We simulate in two phases. In the first phase (analysis and selection), we analyze each trace to detect and select dataflow graphs to accelerate. In the second phase (execution), we again use the traces, this time performing an execution-based simulation that uses the list of selected dataflow graphs from the first phase as a list of graphs to convert to RFU operations.

3.2.1

C

addi sub

Z

X

(a)

Y

Graph Analysis and Selection Phase

In the graph analysis and selection phase, we examine the dataflow graphs or integer computation instructions within a given basic block, and choose those to accelerate. We define a dataflow graph as a sequence of dependent integer computation instructions entirely contained within a basic block (instructions that appear between two branch instructions in the execution trace). By limiting a dataflow graph to one basic block, our execution model does not require a special case for branch misprediction, because a graph will either execute in its entirety or not at all. These flow graphs can potentially include multiple outputs. To be conservative, we assume all registers in use during graph execution are live. Thus, the number of graph outputs is equal to the number of unique registers used in the graph. To generate the graphs, we designate one instruction as the graph’s “primary output” and then include all ancestral integer computation instructions within the basic block. Graph inputs can only come from noninteger computation instructions within the basic block (e.g. branches, floating point, memory operations) or integer computations outside the basic block. The inputs cannot be integer computations within the basic block, as those instructions would have been added to the graph during construction. For each integer computation instruction in the basic block, we create a candidate dataflow graph with that instruction as the primary output. We then use a greedy algorithm to choose which of the candidate graphs will actually be used. First, we choose the largest possible graph and mark all of its constituent instructions as used. Then, we repeatedly choose the largest graph containing only instructions that are not yet used, and mark its instructions. We continue until no viable graphs remain within the basic block, or when remaining graphs are smaller than a user-defined threshold to avoid graphs that are “too small”. This process is repeated for each basic block in the trace, yielding a final list of all graphs.

add

add

add sub

X

B

3.2 Simulation

Z

(b)

Figure 2 - Sample Graphs Figure 2 (a) and (b) illustrate the encoding and latency measurement process for the RFU312 architecture. Figure 2 (a) shows a graph with four inputs and three outputs. First, we encode inputs A, B and C and calculate that the first level of the graph can complete with those three inputs. Then we determine output X will also be ready after the first graph level, so we also encode it. Next, we encode input D, and because D is the last input, we also know the remaining graph levels can complete. However, because we can only encode one output in this opcode, we encode output Z first, and encode output Y with an additional opcode. In Figure 2 (b) there are only three inputs, so all inputs are encoded with the first opcode. Again, since we can only encode one output in this opcode, we choose output X, and encode Y and Z with an additional opcode.

264 256


To determine which of these candidate graphs are important for acceleration, we must count the number of times they are used. Because different parts of an application may perform computations with similar patterns, we also need to detect graph equivalence. We consider two graphs equal if they contain the same number and types of instructions, the instructions communicate in the same pattern, and the total number of inputs and outputs is the same. Two graphs are still equal if they use different registers or appear at different instruction addresses. The immediate values in use in equivalent graphs may be different—we assume these values will be encoded in the RFU opcodes. Using the list of all unique graphs and the number of times they appear, we determine which graphs are “important” and select those graphs for acceleration. We define important graphs as those that cover 1% or more of the total number of integer instructions in the trace. This refinement selects between 1% and 10% of the total number of graphs in most cases, which yields fewer than 32 graphs per trace in all of our test cases, and fewer than 16 graphs in more than half of the traces. Because we select a small number of graphs per application, we assume that all of the graph configurations are pre-loaded at application load time. Furthermore, because the RFU is coarse-grained, we assume that each configuration will only require a small number of bits, and thus the configurations can be multi-context with the context encoded in the RFU opcode.

3.2.2

parameters: maxInput, maxOut_1, and maxOut_2, which allow us to limit the register file bandwidth required by each individual RFU opcode. Figure 3 illustrates the RFU opcode generation process. Reorder insts to preserve data deps Remove all original opcodes while(more inputs OR more outputs){ if(more inputs){ encode maxInput inputs if(any outputs complete){ encode maxOut_1 outputs } } else { encode maxOut_2 outputs } calculate opcode latency insert new RFUop } renumber inst addresses

Figure 3 - RFU Instruction Insertion Alg.

4

Results

We present results from both the graph selection and execution phases of the simulation, and compare total application coverage, cumulative integer instruction coverage, total integer instruction coverage, and number of selected graphs for two minimum graph size constraints. We also measure average graph sizes, and percentages of graphs selected. Next, we summarize the results of the execution process, providing the average number of RFU opcodes per graph for seven different RFU designs. Finally, we present the overall simulated application speedup and compare that speedup to the baseline architecture, and a baseline architecture with a larger (128-entry) register update unit (RUU).

Execution Phase

The execution simulation uses a combination of a trace-based fetch engine and the SimpleScalar [4] out of order processor model to simulate the effects of the RFU. To combine these elements, we modified the processor model for perfect branch prediction and use “Simple Fetch” mode. In this mode, the processor moves to the next instruction without using a program counter. This way, the next instruction the processor requests is the next instruction in the trace file, the correct next instruction for trace simulation. The remainder of the simulation is cycle accurate, albeit without the effects of branch misprediction. While this does not show the effects of wrong path execution, we can still see how using an RFU can affect the issue logic. During the execution phase of simulation, graphs selected as per section 3.2.1 are replaced by one or more RFU opcodes that implement the graph. First, we examine the instruction stream, and reorder the instructions as necessary to preserve the data dependencies. Then, we replace all of the dataflow graph’s original instructions with a series of one or more RFU opcodes. This process depends on three

4.1 Summary of Selected Graphs We performed the graph analysis process twice: once with the minimum graph size set to two instructions, and again with the minimum graph size set to three. In the SPEC-FP benchmarks, there were 106 unique graphs on average per application given a minimum graph size of three instructions, and 187 unique graphs given a minimum graph size of two instructions. In contrast, Sandia applications averaged 373 and 542 unique graphs, respectively. On average, 7% of the total SPEC-FP graphs are selected for acceleration, while 3.5% of the Sandia graphs are selected. Fewer than 32 graphs per application are selected for both SPECFP and Sandia, and in more than half of the traces, fewer than 16 graphs are selected.

265 257


instructions in the trace, allowing dataflow graphs to cover far more total instructions.

Percentage of Integer Computation

Percentage of Entire Application

Application coverage and integer instruction coverage correlate to the potential for the RFU to accelerate the applications. Application coverage measures the percentage of instructions in the total application that are mapped to a dataflow graph—an upper bound for the overall acceleration potential for the application. Integer instruction coverage measures the percentage of total integer instructions mappable to a dataflow graph we can potentially implement in an RFU. This metric allows us to focus in specifically on the application’s integer instructions, and their potential. In total, Sandia requires 50% more selected graphs to obtain a similar percentage of integer instruction coverage; however Sandia applications average a larger total number of integer computation instructions than SPEC-FP for the same trace length: 37% vs. 23%. Figure 4 and Figure 5 show dataflow graph integer instruction coverage and application coverage (respectively) for all graphs (not just those selected) to demonstrate potential coverage. In Figure 4, SPEC-FP and Sandia have the potential to map 31% and 30% respectively of their total integer computation instructions to a dataflow graph with a minimum graph size of two instructions. However, SPEC-FP only has the potential to cover almost 14% of integer instructions with a minimum graph size of three instructions, while Sandia potentially covers 20%. The difference between these percentages for the two different graph size minimums shows that 17% of integer computation in SPEC-FP is covered by size two graphs, while only 10% is covered in Sandia.

SPEC-FP Sandia

6 4 2 0 Min3

Percentage of Integer Computation

30 25 20 SPEC-FP Sandia

15 10 5 0 Min2

Min3

Figure 6 - Average Integer Instruction Coverage with Selected Graphs Figure 6 shows the application coverage for graphs selected using our presented approach. On average, SPEC-FP covers 13% of the integer computations with a minimum graph size of three and 28% with a minimum of two, compared to 17% and 25% respectively in Sandia. In SPEC-FP, more than twice as many integer instructions are covered with a minimum graph size of two than with a minimum graph size of three. This disparity emphasizes that graphs of size two are important in SPEC-FP benchmarks. In Sandia applications, while some two node graphs are determined to be important, they do not improve integer computation coverage as significantly as they do in SPEC-FP. Graphs containing only two instructions therefore appear to be more useful in the simpler SPEC-FP benchmarks than the more complex Sandia applications.

25 SPEC-FP Sandia

10 5 0 Min2

8

Min2

30

15

10

Figure 5 - Average Application Coverage with all Graphs

35

20

12

Min3

Figure 4 - Average Integer Instruction Coverage with all Graphs In Figure 5, when the percent coverage is normalized to the total percentage of integer computations in the trace, Sandia applications cover significantly more integer instructions for both values of minimum graph size. The increases in Sandia coverage from Figure 4 to Figure 5 highlight that Sandia has a larger percentage of integer computation

266 258


4.2 RFU Operations

Table 3 - Average Graph Size

SPEC-FP Sandia

All Graphs Min 2 Min 3 3.37 4.66 4.58 5.62

Selected Graphs Min 2 Min 3 3.84 5.25 5.56 6.86

Our baseline architecture (Table 4) is an 8-way superscalar processor with a 64-entry RUU comparable to the AMD Opteron used in the Cray XT3 (Red Storm). When possible, parameters are identical to those of the Opteron; in cases where SimpleScalar requires parameters to be a power of two, we chose reasonable values close to actual Opteron values. The RFU-enhanced architectures are similar, but include a single RFU.

Percentage of Total Application

In Table 3, we compare the average graph size for all graphs to the size of just those selected using our algorithm. We see that for all of the datapoints, SPEC-FP averages one fewer instruction per graph than Sandia, again highlighting the additional complexity of Sandia dataflow graphs. The average graph size with a minimum of two instructions is noticeably smaller for both SPEC-FP and Sandia than the respective graph size with a minimum of three instructions. This indicates a significant number of two instruction graphs in both application sets. Because there is a similar difference in average graph size of the graphs selected using our algorithm, those two instruction graphs must constitute a large enough percentage of overall integer computation to be selected. In Figure 7, we show the total application coverage (normalized integer coverage) vs. the number of selected graphs. For both SPEC-FP and Sandia, the top five graphs are particularly important, however in SPEC-FP the incremental gain of additional graphs levels off after 16 graphs, whereas additional graphs continue to be important throughout the top 25 graphs for Sandia’s applications. Note especially the Sandia average with a minimum graph size of two: with 25 selected graphs, almost 10% of the entire application is mapped to dataflow graphs. This demonstrates potential for significant application speedup. In SPEC-FP up to 6.5% of the entire application is mapped to dataflow graphs given a minimum size of three. With a minimum graph size of three instructions, dataflow graphs cover 6.5% and 3% of Sandia and SPEC-FP, respectively.

Table 4 - Baseline Processor Configuration RUU Size LSQ Size Fetch Width Issue Width Commit Width Memory Ports

64 32 4 8 4 2

3 1 2 1 Perfect 32K 2M

Figure 8 shows the average number of RFU opcodes required to describe a dataflow graph, which varies across test architectures due to different parameter values limiting the number of inputs/outputs that can be encoded per opcode. First, we see that Sandia applications require fewer opcodes per graph than SPEC-FP for all architectures except RFU312, which allows the fewest outputs per opcode. Increasing the final parameter almost universally reduces the count, indicating the number of outputs encoded in output-only opcodes after all inputs are encoded is significant. Increasing the middle parameter also decreases the count, showing that more than one output is frequently ready before all inputs are encode. The modest opcode reduction of the RFU414 architecture shows that increasing the number of inputs does not help encoding without a corresponding increase in the number of outputs as in the RFU 523 architecture. 3

SPEC-FP (2) Sandia (2)

2.5

10 9 8 7 6

ALUs Multipliers FP ALUs FP Multipliers Instruction Cache Data L1 L2

SPEC-FP (3) Sandia (3)

2 1.5 1

Sandia AVG (2) SPEC-FP AVG (2) Sandia AVG (3) SPEC-FP AVG (3)

5 4 3 2 1 0 0

5

10

15

20

0.5 0

R

25

31 FU

2 R

31 FU

3 R

31 FU

4 R

32 FU

3 R

34 FU

4 R

41 FU

4 R

52 FU

3

Figure 8 - Average Number of RFU Ops per Dataflow Graph for Different Minimum Graph Sizes (in ()’s) and Parameter Combinations

Number of Selected Graphs

Figure 7 - Total Application Coverage vs. Number of Selected Dataflow Graphs

267 259


Percent Decrease in Instructions to Describe Dataflow Graphs

The difference in the number of original instructions per graph and the number of RFU opcodes now required to encode it (shown in Figure 9) indicates the overall decrease in the number of total instructions handled by the issue queue for that graph. In SPEC-FP, given a minimum graph size of two instructions, we require between 20% and 41% fewer instructions to describe a dataflow graph than the instruction sequence that does not use the RFU. For a minimum graph size of three, between 42% and 52% fewer instructions are required to describe a dataflow graph. In Sandia’s applications, we similarly need much fewer instructions to describe a dataflow graph: 32% to 52% fewer instructions with a minimum graph size of two, and 51% to 69% with a minimum graph size of three. Next we examined performance. Few SPEC-FP benchmarks show improvement for any architecture. With a minimum graph size of two, only four of the fifteen traces show a performance increase of more than 1%, and only one of those four traces (189.lucas) had more than 2%. Seven of the benchmarks showed a performance decrease. Interestingly, with a minimum graph size of three instructions, only four traces had any performance degradation (less than 0.1%). However, there were also fewer traces with a performance increase greater than 1%. This shows that while size two graphs are important, they are also sensitive to the effects of serialization. Especially when size two graphs have two outputs, our graph encoding algorithm can cause a performance decrease. Graphs with two instructions and two outputs require at least two RFU opcodes in several of our architectures, yielding no improvement to issue queue occupancy. Furthermore, because our opcode generation algorithm is conservative, there are situations in which the total RFU-enabled graph latency is greater than the original graph would be without resource contention. 80 70 60

Although the PowerPC instruction set allows up to five registers (including condition registers) to be modified by one opcode, none of our architectures allow more than four outputs (five of the seven allow less than three outputs when inputs are also encoded). Also, because we calculate the amount of computation that can complete based on graph depth, if there are more inputs to one level of the graph than can be encoded with one opcode, our computed latency is greater than strictly necessary as the minimum latency for an opcode is one. Some of these problems will be addressed by future work with architectures that specify five registers, and by rejecting graphs that actually slow computation. Sandia’s applications fared much better. Only one application did not benefit from any RFU architecture. Applications that experienced performance degradation only did so for a minimum graph size of two, for less than 0.1% (two of the three less than 0.01%). With a minimum graph size of three, fourteen of the seventeen traces showed increased performance, and eight of the seventeen had a performance increase greater than 2%. Note that the only application slowed by all RFU architectures achieves only a 0.5% speedup with the expanded 128-entry RUU baseline architecture, indicating this application has less potential for speedup than the others. Figure 10 and Figure 11 show five of the best performing traces, followed by three of the worst, and the average of all Sandia traces. Because several of the tested architectures had nearly identical performance for every trace, we have omitted them from the following graphs for clarity. In particular, RFU312, RFU313, RFU314, and RFU414 differed by an average of 0.1%, so their results are represented by RFU312. All performance numbers are normalized to the baseline architecture, so the baseline is omitted for clarity. Most applications achieved modest speedups using the enhanced RUU without RFU. However, the core algorithm in cube3 is a sparse matrix vector kernel that has a significant number of dependent loads. The larger RUU combined with the perfect branch predicition allows a significant increase in the number of outstanding loads that the architecture supports on this code. Thus, the latency tolerance is significantly increased for the particular pattern of dependent loads in cube3, which accounts for the particularly large speedup we obtain in the “cube3.crs” trace. In Figure 10 with a minimum graph size of two, the lammps application has speedups greater than 10% for nearly all architectures, and as much

SPEC-FP(2) SPEC-FP(3) Sandia(2) Sandia(3)

50 40 30 20 10 0 312

313

314

323

344

414

523

Figure 9 - Average Percent Decrease in Instructions that Describe Dataflow Graphs for Different Minimum Graph Sizes (in ()’s) and Parameter Combinations

268 260


as 27% for the RFU323 architecture in the “lmp.flow.langevin” trace. In fact, lammps outperforms the 128-entry RUU enhanced baseline in all RFU architectures. The “cth.amr-2d-cust4” trace performs best in the RFU323 and RFU344 architectures because of the additional output encoding ability. However, the RFU523 architecture performance is worse than the others, emphasizing that a balance of inputs and outputs is important to overall performance. On average, Sandia applications achieved a speedup on all of the architectures. For a minimum graph size of three in Figure 11, although the maximum speedup of an application is less than in Figure 10 the worst applications have negligible performance degradation. The differences in performance between the different RFU architectures are reduced, and the average performance of Sandia’s applications is greater or equal to the average with a minimum graph size of two. The applications with the most performance 1.30

improvement were those which executed the largest number of RFU operations. More RFU executions correspond to a greater percentage of instructions compacted into RFU operations, and therefore a larger increase in the effective size of the issue queue. Future work will include allowing more outputs to be specified per RFU opcode, to a maximum of five combined input/output registers. We plan to perform a second selection pass after encoding to eliminate graphs that degrade performance due to remaining architectural encoding restrictions. We will also examine the acceleration of individual graphs, leveraging the fact that logical operations and simple ALU operations may be performed in less than one cycle. Finally, we are currently analyzing selected graphs to aid in the design of the internals of the coarse-grained RFU structure.

1.51

Speedup over Baseline

1.25 1.20 1.15 1.10

Base + 128 RUU RFU312 RFU323 RFU344 RFU523

1.05 1.00 0.95 0.90 0.85 0.80

n x n n 4 al G vi rs ai .fi ad st j m V e ge l h r .c c u t . g . A c r 3 e c p n e y. its ia .th be lm .la ol 2d nv u a i nd p . w r c s . l a e o p a m c fl S .a lm p. ps xy h t m m l c Figure 10 - Application Speedup: Selected Sandia Applications with Minimum Graph Size of Two

269 261


1.30

1.51

Speedup over Baseline

1.25 1.20 1.15 1.10

Base + 128 RUU RFU312 RFU323 RFU344 RFU523 MAX(Min2)

1.05 1.00 0.95 0.90 0.85 0.80

n n x n 4 al G vi rs ai .fi ad st j m e l ge h .c c r u t . . g AV c r 3 c e . p n e ti s h e a i t ly v d b lm .la a. nd po .in w r-2 cu s . l a e o p a c fl S am lm p. ps xy h. t m lm c Figure 11 - Application Speedup: Selected Sandia Applications with Minimum Graphs Size of Three

5

a maximum gain of 27% and a maximum loss of 10%. However, with a minimum graph size of three, no application or benchmark has a performance decrease of more than 0.1%, which further demonstrates that size two graphs are the primary culprit causing performance loss. In addition, despite this impact, over half of Sandia’s applications achieve a speedup of 5% or more for at least one of the architectures.

Conclusions

Even in complex applications such as Sandia’s, a small number of dataflow graphs is sufficient to cover a large percentage of the integer computation. With fewer than 32 graphs per application, up to 10% of the entire application is mapped to dataflow graphs even when disallowing memory operations within graphs or dataflow between multiple basic blocks of computation. We also examined multiple Reconfigurble Functional Unit (RFU) architectures with varying input/output characteristics and demonstrated that common graphs have a large number of outputs that must be efficiently encoded. However, efficient encoding is not simply fitting more operands into an opcode: the number of inputs and outputs in one opcode must be balanced. The amount of computation allowed by the number of inputs should be comparable to the number of outputs that complete. Unbalanced architectures may still have poor encoding abilities despite being able to encode a large number of total operands, leading to an inefficient use of resources. We have shown that RFUs can accelerate scientific applications by reducing issue queue pressure by encapsulating multiple instructions per RFU opcode. This allows the issue window to effectively examine more instructions. When issuing an RFU opcode, the issue width is also effectively larger than the actual physical issue width. Our data shows that graphs with two instructions can be particularly important, yet are sensitive to serialization effects. Attempting to accelerate these graphs yielded both our best and worst results, with

References [1]

[2]

[3]

[4]

[5]

Apple Architecture Performance Groups, Computer Hardware Understanding Development Tools 2.0 Reference Guide for MacOS X, Apple Computer Inc, 2002. J.M. Arnold, "S5: The Architecture and Development Flow of a Software Configurable Processor." in IEEE International Conference on Field Programmable Technology, 2005, pp. 121-128. A. Bracy, P. Prahlad and A. Roth, "Dataflow MiniGraphs: Amplifying Superscalar Capacity and Bandwidth," in MICRO 37: Proceedings of the 37th annual International Symposium on Microarchitecture, 2004, pp. 18-29. D.C. Burger and T.M. Austin, "The Simplescalar tool set, version 2.0," University of Wisconsin, Madison., Madison, WI, Tech. Rep. CS-TR-97-1342, 1997, 1997. N. Clark, J. Blome, M. Chu, S. Mahlke, S. Biles and K. Flautner, "An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors," in ISCA '05: Proceedings of the 32nd International Symposium on Computer Architecture, 2005, pp. 0-12.

270 262


[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

A.L. Davis, "The architecture and system method of DDM1: A recursively structured Data Driven Machine," in ISCA '78: Proceedings of the 5th annual symposium on Computer architecture, 1978, pp. 210-215. J.B. Dennis and D.P. Misunas, "A preliminary architecture for a basic data-flow processor," in ISCA '75: Proceedings of the 2nd Annual Symposium on Computer architecture, 1975, pp. 126-132. V.G. Grafe, G.S. Davidson, J.E. Hoch and V.P. Holmes, "The Epsilon dataflow processor," in ISCA '89: Proceedings of the 16th annual international symposium on Computer architecture, 1989, pp. 36-45. J.R. Gurd, C.C. Kirkham and I. Watson, "The Manchester prototype dataflow computer," Commun ACM, vol. 28, pp. 34-52, 1985. S. Hauck, T.W. Fry, M.M. Hosler and J.P. Kao, "The Chimaera reconfigurable functional unit," in FCCM '97: Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines, 1997, pp. 87. J.R. Hauser and J. Wawrzynek, "Garp: A MIPS Processor with a Reconfigurable Coprocessor," in FCCM '97: Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines, 1997, pp. 12-21. R.M. Karp, R.E. Miller and S. Winograd, "Properties of a model for parallel computations: Determinacy, termination, queueing," SIAM J. Appl. Math, vol. 14, pp. 1390-1411, Nov. 1966. M. Kishi, H. Yasuhara and Y. Kawamura, "DDDP-a Distributed Data Driven Processor," in ISCA '83: Proceedings of the 10th annual international symposium on Computer architecture, 1983, pp. 236-242. N. Malik, R.J. Eickemeyer and S. Vassiliadis, "Interlock collapsing ALU for increased instruction-level parallelism," in MICRO 25: Proceedings of the 25th annual international symposium on Microarchitecture, 1992, pp. 149-157. R. Razdan and M.D. Smith, "A high-performance microarchitecture with hardware-programmable functional units," in MICRO 27: Proceedings of the Annual International Symposium on Microarchitecture, 1994, pp. 172-180. J.E. Rodriguez, "A graph model for parallel computation," PhD Thesis, vol. 1, pp. 1-120, Sept. 1967. K. Rupnow, A. Rodrigues, K. Underwood and K. Compton, "Scientific applications vs. SPEC-FP: a comparison of program behavior," in ICS '06: Proceedings of the 20th Annual International Conference on Supercomputing, 2006, pp. 66-74. S. Sakai, y. Yamaguchi, K. Hiraki, Y. Kodama and T. Yuba, "An architecture of a dataflow single chip processor," in ISCA '89: Proceedings of the 16th annual international symposium on Computer architecture, 1989, pp. 46-53. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ranganathan, D. Burger, S.W. Keckler, R.G. McDonald and C.R. Moore, "TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP," ACM Trans.Archit.Code Optim., vol. 1, pp. 62-93, 2004.

[20] S. Swanson, K. Michelson, A. Schwerin and M. Oskin, "WaveScalar," in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, 2003, pp. 291. [21] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe and A. Agarwal, "The Raw microprocessor: a computational fabric for software circuits and general-purpose programs," IEEE Micro, vol. 22, pp. 25-35, 2002. [22] J. Wetzel, E. Silha, C. May and B. Frey Eds., PowerPC User Instruction Set Architecture, Book 1, Version 2.01, Austin, TX: IBM, 2003. [23] M.J. Wirthlin and B.L. Hutchings, "Sequencing run-time reconfigured hardware with software," in FPGA '96: Proceedings of the Fourth ACM International Symposium on Field-Programmable Gate Arrays, 1996, pp. 122-128. [24] R.D. Wittig and P. Chow, "OneChip: an FPGA processor with reconfigurable logic," in FCCM '96: Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1996, pp. 126-135.

271 263


Scientific Application Acceleration with

Scientific Application Acceleration with

Suggest Documents

Acceleration of a Full-scale Industrial CFD Application with OP2

application of acceleration sensors in ... - Semantic Scholar

Deep Learning Acceleration with - Amax

Deep Learning Acceleration with - Amax

Motion Representation with Acceleration Images

Application News - Shimadzu Scientific Instruments

ACCeleRAtIon

ACCeleRAtIon

FPGA Acceleration of the Phylogenetic Parsimony Kernel? - Scientific ...

Integrating Scientific Workflows with Scientific Gateways - LNCC

Encouraging Creativity with Scientific Inquiry - Scientific Research ...

Acceleration Measurements Using Smartphone Sensors: Dealing with ...

Faraday Acceleration with Radio-frequency Assisted ... - EPPDyL

Acceleration Measurements Using Smartphone Sensors: Dealing with ...

DB2 with BLU Acceleration - VLDB Endowment Inc.

Acceleration Worksheet with speed and velocity

Unruh temperature with maximal acceleration

Service Discovery Acceleration with Hierarchical ...

The Capturing Robot with Super High Acceleration

Acceleration of cancer science with genome editing

Performance Measurement of Applications with GPU Acceleration ...

What Factors Are Associated With Grade Acceleration?

CONVEY VECTOR PERSONALITIES â FPGA ACCELERATION WITH ...

Simulations of Plasma Wakefield Acceleration with XOOPIC

Scientific Application Acceleration with