Automatic Tailoring of Configurable Vector

0 downloads 0 Views 128KB Size Report
Hybrid general purpose, configurable ... interface to the general-purpose processor and controls the ... from a bus (B = 1) to a crossbar (B = q) depending on.
Automatic Tailoring of Configurable Vector Processors for Scientific Computations D. Rutishauser1 and M. Jones2 1 Avionic Systems Division, NASA Johnson Space Center, Houston, Texas, U.S.A. 2 Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg, Virginia, U.S.A. Abstract— Re-hosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete re-write of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a legacy code without a long and expensive re-hosting effort. The approach automatically tailors a parameterized, configurable vector processor design to an input problem, and produces an instance of the processor. Experimental data shows the processors perform competitively when compared with a diverse set of contemporary high performance computing alternatives. Keywords: reconfigurable, high-performance vector, scientific computing

1. Introduction A legacy code application tailored for execution on a vector computer is the assumed target for acceleration for this work. A custom vector computer is developed on an FieldProgrammable Gate Array (FPGA) to run the application, and this computer is customized to the particular application. Configurable processing resources enable a wider range of vector processing architectures than found in traditional vector computers, and allow an architecture to be tailored to a specific application, instead of the traditional approach of tailoring the application to the computer. This paper highlights contributions of the work described in detail in [1]: (1) a formulation of the problem of determining a tailoring of an architecture to a computation, (2) an algorithmic approach to solve the problem that includes a parameterized architectural framework for vector processing, (3) a scheduling/mapping algorithm that effectively uses established performance-enhancing practices in vector computing, and (4) the VectCore processor design that provides a low-overhead implementation of the approach. The approach is evaluated using experimental data including data produced from an end-to-end implementation in hardware.

2. Related Work The following is an abbreviated survey of related work, a more complete discussion can be found in [1]. The viability

of supercomputing on FPGA systems has been assessed [2]. The potential for performance gains has resulted in numerous studies ranging from implementations of basic matrix computations [3], to the porting of a full scientific application to a production supercomputing system with configurable hardware capabilities [4]. Hybrid general purpose, configurable High Performance Computer (HPC) systems have also been developed as commercial and research systems [5],[6]. Vector processing remains a relevant processing paradigm in the domain of scientific computation. The Convey hybrid HPC [6] includes a vector processing functionality as an example of its application-specific “personalities." The Vector Instruction Set Architecture (ISA) Processors for Embedded Reconfigurable Systems (VIPERS) [7] is a “soft" vector processor approach that provides a configurable feature set that can be tailored to an application. The Vector-Extended Soft Processor Architecture (VESPA) [8] features a vector co-processor also with a configurable architecture that can be tailored to input problem requirements. These systems do not include an automatic tailoring of the vector processing resources to a specific problem.

3. Approach The system targeted by this approach is a hybrid consisting of an FPGA tightly-coupled with a general-purpose processor. Computations identified as candidates for FPGA implementation are input to tools that determine an architecture tailored to each input problem. A mapping of each input computation to its tailored architecture is also produced. Custom tools produce the High-Level Design Language (HDL) representation of the tailored architecture, and a microcode program to run on the architecture. Vendor tools produce the FPGA configuration file. All the custom and vendor tools run on a development workstation. A software program consisting of the original application augmented with code that provides an interface between the software application and the FPGA runs on the general-purpose processor. Details of the problem formulation can be found in [1]. An overview of the algorithmic approach is shown in Figure 1. The inputs are a representation, G, of a set of computations

B1

Inputs G,C Minimization of F(X) Xi

Architectural Framework, X

G -> ф Mapping

Memory

L1

Memory

L2

Memory

L3

Memory

L4

V1 V2

B2

B3

A1

V3 V4 V5

M1

n F(Xi)=min(F(X))?

Y1

y Outputs ф,X*

VectCore Design

Fig. 2: Example VectCore architecture template instance. Specification includes 4L, 5V , 1A, 1M , 3B, 1Y , and 0I. Implementation

Fig. 1: Algorithmic approach overview.

and an overall resource constraint, C, for the FPGA. An integer minimization algorithm is executed for an objective function F . The minimization algorithm includes an architectural framework that is the target for mapping the input computation to processors and supporting resources. The template is parameterized allowing different specifications for the quantity and type of processing resources. A vector X specifies the parameters for a given architecture. A scheduling and allocation algorithm maps operations in G to a set of starting times, ϕ, and resources in a particular instance, X i of the architectural template. The outputs of the minimization algorithm are the architecture specification, X ∗ , that minimizes the scheduled execution time of G without exceeding the resource constraints, and the associated schedule, ϕ. The VectCore processor design provides an interface to the general-purpose processor and controls the execution of the resources specified by X ∗ . An integrated implementation consisting of the tailored VectCore instance and a microcode program to run the input computation completes the solution.

3.1 Architectural Template The architectural template was introduced in [9] and is a design framework for a configurable vector processing core. The template provides a target for the allocation of input computations and supports performance enhancing vector computing techniques. The following notation is used for the components included in the VectCore implementation designed for this research. • L = vector load/store units • V = vector registers • A = vector adder units • M = vector multiplier units • B = functional unit buses

Y = vector SAXP Y units I = vector inner product units Other floating-point functional units could be defined in the approach. Figure 2 shows a specific example of a template instance. In the VectCore topology, U is the total number of vector functional units and B denotes the number of independent links between input/output pairs in the network. The number of inputs, q, to the VectCore network is q = V + U . Analysis in [1] shows the wiring complexity for the VectCore is proportional to Bq. Therefore, the VectCore interconnect network performance and cost can be approximately matched to topologies ranging from a bus (B = 1) to a crossbar (B = q) depending on the selection of the parameter B. The specific instance of the template is found by solving the minimization problem for a given target computation constrained by the available FPGA resources. • •

3.2 VectCore Design The VectCore microcode is a representation of the schedule and allocation determined by the minimization algorithm. The format for a microcode word is the VectCore Schedule-Packet (S-PAK). An S-PAK contains a start time and resource configuration for a schedule event. The size of each S-PAK word is fixed for a given maximum number of resources and number of resource types [1]. The S-PAK design minimizes overhead with a compact and flexible format. For example, the size of an S-PAK is low because the architecture does not require complex routing information or unique bit fields for each resource. Figure 3 shows a block diagram of the interface and S-PAK dispatch architecture. S-PAKs are written by the general purpose processor to the S-PAK First In First Out (FIFO) buffer interface of the VectCore. The S-PAK router forwards S-PAKs to resource FIFO buffers for each computing resource. When all the S-PAK configurations for a given schedule event have been forwarded, a control word is written to the event FIFO for consumption by the global clock

GP Processor

Table 1: VectCore (Virtex II Pro™)performance of each tailored problem implementation for each benchmark problem. Problem

FIFO

S-PAK FIFO

mm1 mm2 mm3 ts1 ts2 tass

Event FIFO Global Clock Control

S-PAK FIFO

Router

(S-PAKs)

FIFO

U1

Architecture Tailoring Target MFLOPS mm1 mm2 mm3 ts1 ts2 1989 0 0 243 0 0 1155 496 0 0 0 267 400 0 0 0 0 0 164 0 0 108 108 0 108 0 0 0 149 0

tass 0 0 0 0 0 763

Resource FIFOs

FIFO

FIFO

Un

U2 …

Fig. 3: VectCore interface and S-PAK dispatch scheme.

control logic. This logic sends a start signal to independent state machines for each resource. Proper execution timing is ensured by including a capability for all the resource pipelines to be stalled until all necessary resources are ready to support a given global clock cycle. Comparisons of the schedule length to actual schedule execution time measured by hardware counters show this approach adds a very low (3-4%) overhead to the schedule execution time.

4. Experimental Design The VectCore approach is evaluated with a fully operational end-to-end implementation using a Xilinx®Virtex II Pro™vp70 FPGA. The VectCore design is also targeted to a Virtex™5 part. The VectCore is tested with the matrix multiplication and back-substitution step of the LU matrix decomposition problems, and a portion of an application tailored to a Cray architecture. Matrix-by-matrix multiplication provides a computation-bound test case with high available parallelism. The back-substitution problem is characterized by a long dependency chain between the operations. The Cray benchmark also has a high available parallelism and is memory-bound. A range of problem sizes, operation orderings, and architecture resource sizes are tested to characterize the scaling and degree of tailoring possible with the VectCore approach [1].

5. Results Table 1 shows the Floating-Point Operations Per Second (FLOPS) performance for each tailored problem implementation running each problem type. The first three rows are matrix multiplication problems with different operation orderings. The next two rows are two different implementations of the back-substitution problem, and the problem labeled “tass" is the Cray application. For the mm1, mm2, and

tass problems, the maximum FLOPS performance occurs for the implementations running the problem that matches its tailoring. As expected, the long dependency chains and low available parallelism in the back-substitution problems show that the performance of the architecture is limited in comparison to problems such as matrix-by-matrix multiplication. Nevertheless, VectCore is able to effectively and correctly execute such problems, an important requirement for many applications. The VectCore performance is compared to alternatives including hybrid general-purpose/configurable processor HPC systems, a systolic architecture, a server-grade General Purpose Processor (GPP) system, and a traditional supercomputer. For additional architecture comparisons see [1]. Table 2 shows the GFLOPS per processor performance and the system cost per GFLOPS for the particular workload types and sizes for each alternative. 1 The Virtex™2 Pro VectCore exceeds the performance of the Cray SV1, and a VectCore substitution for this type of system is the original motivation for this research. The VectCore provides this performance at approximately 17 times lower price per GFLOPS than the SV1. The VectCore targeted to the Virtex™5 yields a better GFLOPS performance than previous generation hybrid HPCs, and an AMD GPP running optimized code. Systolic implementations and HPCs outperform the VectCore on similar FPGA device families. Systolic implementations require a dedicated design effort, and a different computation can easily accrue a similar design effort. An HPC’s supporting resources for the FPGA cores are fixed. Optimization to a new FPGA target may require substantial redesign for either approach. The VectCore approach is general for a class of computations, and can be re-targeted easily to newer FPGA families to realize large performance gains without a specific optimization to the device. The overhead of the VectCore approach lies in layering a vector architecture on the existing architecture of an FPGA. The amount of FPGA resources not used directly for floating-point computations impacts the FLOPS performance. The components that dominate the overhead are the 1 The system prices are reported at initial release, in U.S. dollars, because the literature supporting each non-VectCore option spans several years.

Table 2: VectCore performance and price per performance comparison to implementation alternatives. System

Workload

Size

Clock (MHz)

Convey HC-1 [6] Cray XD-1 [10]

Matrix Mult. Matrix Mult.

order 16K order 16K

150 110

Systolic Array [11]

2D Cavity Flow

48 X 48 grid

106

VectCore VectCore

Matrix Mult. Matrix Mult.

order 16K order 16K

133 350

AMD (ACML) [10]

Matrix Mult.

order 2K

2200

FPGA

Device

Hybrid HPC Virtex™5 LX 330 Virtex™2 Pro vp 50 Direct Hardware Implementation Stratix®2 EP 2S180 Configurable Vector Processor Virtex™2 Pro vp70 Virtex™5 LX 330 GPP

GFLOPS per proc.

System Price

Price per GFLOPS

FP GFLOPS Limit

% GFLOPS Limit

19.0 2.0

$13K $100K [5]

$171 $8.3K

14.4 8.8

132 23

18.0

$7.7K [12]

$425

20.4

88

1.3 6.5

$7.5K $7.5K

$5.8K $1.2K

14.9 86.8

9 7

3.9

$500

$128

1.0

$375K [14]

$100K

Legacy Vector Supercomputer Cray SV1 [13]

Matrix Mult.

300

vector control for each functional unit and the functional unit bus interconnect. To characterize the performance impact of this overhead, the FLOPS performance using the maximum number of basic floating-point operations (the operation type is problem-dependent) supported by the resources of a given FPGA is used as an upper bound. The last two columns of Table 2 show this maximum theoretical performance and the percent of this maximum achieved for the configurable options. The VectCore approach achieves the lowest percentage of the theoretical performance, but the analysis does not use optimized functional unit designs for a particular FPGA architecture. Lower per-unit resource usage and higher operating frequencies are possible with device-specific optimized designs [15]. Optimized designs can be incorporated into the existing VectCore framework to improve performance and reduce the approach overhead. In addition, the VectCore interconnect is a straightforward, non-optimized design, and its resource usage estimates are pessimistic compared to what is possible using optimizations such as partially-connected subnetworks.

6. Concluding Remarks The VectCore is compared with a diverse set of contemporary high performance computing alternatives. Problemspecific VectCore implementations are shown to exhibit higher FLOPS performance than several alternatives when targeted to recent FPGA technologies. The VectCore approach is more flexible to design changes than the systolic or HPC implementations that exhibit higher FLOPS performance. This flexibility balances the utility of the VectCore approach with its inherent overhead, which is also characterized in this work.

References [1] D. Rutishauser, “Implementing Scientific Simulation Codes Tailored for Vector Architectures Using Custom Configurable Computing Machines,” Ph.D. dissertation, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, U.S.A., 2010. [Online]. Available: http://scholar.lib.vt.edu/theses/available/etd-04132011-174232/

[2] S. Craven and P. Athanas, “Examining the Viability of FPGA Supercomputing,” EURASIP J. Embedded Syst., vol. 2007, no. 1, pp. 13 –13, 2007. [3] S. Qasim, S. Abbasi, and B. Almashary, “A Proposed FPGA-Based Parallel Architecture for Matrix Multiplication,” in Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific Conference on, 30 2008. [4] V. Kindratenko and D. Pointer, “A Case Study in Porting a Production Scientific Supercomputing Application to a Reconfigurable Computer,” Apr. 2006, pp. 13 –22. [5] internetnews.com, “Cray Unleashes XD1 Opteron/Linux Supercomputer,” http://www.internetnews.com/ent-news/article.php/ 3417221/Cray-Unleashes-XD1-OpteronLinux-Supercomputer.htm, (accessed Feb. 5, 2011). [6] J. Bakos, “High-Performance Heterogeneous Computing with the Convey HC-1,” Computing in Science Engineering, vol. 12, no. 6, pp. 80 –87, 2010. [7] J. Yu, C. Eagleston, C. H.-Y. Chou, M. Perreault, and G. Lemieux, “Vector Processing as a Soft Processor Accelerator,” ACM Trans. Reconfigurable Technol. Syst., vol. 2, pp. 12:1–12:34, June 2009. [Online]. Available: http://doi.acm.org/10.1145/1534916.1534922 [8] P. Yiannacouras, J. G. Steffan, and J. Rose, “VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors,” in CASES ’08: Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. New York, NY, USA: ACM, 2008, pp. 61 –70. [9] D. Rutishauser. (2006) Implementing Scientific Simulation Codes Highly Tailored For Vector Architectures Using Custom Configurable Computing Machines. MAPLD International Conference. Washington, DC, U.S.A. (accessed Jun. 16, 2011). [Online]. Available: http://klabs.org/mapld06/ [10] L. Zhuo and V. Prasanna, “High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware,” Computers, IEEE Transactions on, vol. 57, no. 8, pp. 1057 –1071, Aug. 2008. [11] K. Sano, L. Wang, Y. Hatsuda, and S. Yamamoto, “Scalable FPGAArray for High-Performance and Power-Efficient Computation Based on Difference Schemes,” in High-Performance Reconfigurable Computing Technology and Applications, 2008. HPRCTA 2008. Second International Workshop on, 2008, pp. 1 –9. [12] The Dini Group, “Hardware for ASIC Prototyping & FPGA Systems,” http://www.dinigroup.com/pages/3/files/2006-03-30_7000K10PCI_ press_release.pdf, (accessed Mar. 13, 2011). [13] Cray Inc., “The Benchmarker´s Guide for CRAY SV1 Systems,” http: //parallel.ksu.ru/ftp/computers/cray/sv1_bmguide.pdf, (accessed Mar. 13, 2011). [14] IDC Inc., “The Cray CX1 Supercomputer: Leveraging the Cray Brand in the HPC Workgroup Market,” http://www.cray.com/Assets/ PDF/products/cx1/IDC%20whitepaper-CrayCX1.pdf, (accessed Mar. 14, 2011). [15] K. S. Hemmert and K. D. Underwood, “Fast, Efficient Floating-Point Adders and Multipliers for FPGAs,” ACM Trans. Reconfigurable Technol. Syst., vol. 3, pp. 11:1–11:30, September 2010. [Online]. Available: http://doi.acm.org/10.1145/1839480.1839481