Implementation of a Reverse Time Migration ... - Semantic Scholar

4 downloads 625 Views 4MB Size Report
the HPC landscape, albeit at a slow pace. Adoption is still being ... with hand-written HDL and also with software based execution. The resulting design ... result in big systems with greater chances of error, and complexity in the scheduling.
Implementation of a Reverse Time Migration Kernel using the HCE High Level Synthesis Tool Tassadaq Hussain? , Miquel Peric`as? , Nacho Navarro† , Eduard Ayguad´e†? ?

Barcelona Supercomputing Center , † Universitat Polit`ecnica de Catalunya Barcelona, Spain { tassadaq.hussain, miquel.pericas, eduard.ayguade} @bsc.es nacho @ac.upc.edu

Abstract—Reconfigurable computers have started to appear in the HPC landscape, albeit at a slow pace. Adoption is still being hindered by the design methodologies and slow implementation cycles. Recently, methodologies based on High Level Synthesis (HLS) have begun to flourish and the reconfigurable supercomputing community is slowly adopting these techniques. In this paper we took a geophysics application and implemented it on FPGA using a HLS tool called HCE. The application, Reverse Time Migration, is an important code for subsalt imaging. It is also a highly demanding code both in computationally as in its memory requirements. The complexity of this code makes it challenging to implement it using a HLS methodology instead of HDL. We study the achieved performance and compare it with hand-written HDL and also with software based execution. The resulting design, when implemented on the Altera Stratix IV EP4SGX230 and EP4SGX530 devices achieves 11.2 and 22 GFLOPS respectively. On these devices, the design was capable of achieving up to 4.2x and 7.9x improvement, respectively, over a general purpose processor core (Intel i7).

I. I NTRODUCTION To address the computational requirements introduced by many demanding HPC applications, several research efforts have aimed at implementing HPC applications for different hardware architectures (CPU, GPU and FPGA). Each architecture has its own design methodologies and specific optimizations for performance. FPGAs have traditionally been programmed with languages and methodologies that stem from the Electronic Design Automation (EDA) community. This gap between EDA methodologies and HPC programmers has slowed the adoption of FPGAs in High Performance Computing. Translation of HPC algorithms into hardware is complex and time consuming. Large applications result in big systems with greater chances of error, and complexity in the scheduling and reuse of resources. To manage the size and complexity of such applications, system designers are required to perform design modeling, synthesis, and validation at a higher level of abstraction. In this work we attempt to implement a real geophysics application using a High Level Synthesis tool called HCE. The kernel, called Reverse Time Migration (RTM), solves the wave equation (which is a partial differential equation, or PDE) over a series of timesteps. The resulting code consists in, a 3D stencil operating on a structured grid followed by a time integration phase. It is both computationally intensive, relying considerably on single precision floating point operations, and 978-1-4577-1740-6/11/$26.00 © 2011 IEEE

memory intensive, requiring high bandwidth access to a 3D dataset. This paper presents, to our knowledge, one of the first case studies of a complex HPC application being ported to FPGA using exclusively HLS technology. Several features of our implementation include: • It is a full system, including the on-chip control units (like Computing Unit) and both on-chip and off-chip data access units (Memory and PCI controllers). • It implements a complex 3D memory structure to store 3D tiles that are used for Finite-difference time-domain (FDTD) computing. • It includes a specialized Cache Unit that efficiently accesses, reuses and feeds data to the computing unit. This work discusses the architecture of the PDE solver (called Wave Field Computation, WFC) and its implementation on Altera Stratix IV EP4SGX230 and EP4SGX530 FPGAs using the HCE tool. The generated designs for these devices achieve 11.2 and 22 GFLOPS, respectively. This paper is organized as follows. In Section II we give a brief introduction to High Level Synthesis and reference several tools. In Section III we describe compute intensive part (WFC) of the RTM kernel. The hardware architecture of WFC for HLS tool is presented in Section IV. Section V shows the functional verification of the proposed WFC system. In Section VI, the WFC system is compared with a processor based design and with handwritten HDL. Different proposals of RTM kernel are described in the related work (Section VII). Finalizing, Section VIII summarizes our main conclusions. II. H IGH L EVEL S YNTHESIS AND THE HCE TOOL Over the past few years, HLS tools have been developed that add the necessary technologies to become truly productionworthy. Initially limited to data path designs, HLS tools have now started to address complete systems, including controllogic and complex on-chip, off-chip interconnection. ROCCC 2.0 [1] is a free and open source tool that focuses on FPGAbased code acceleration from a subset of the C language. ROCCC tries to exploit parallelism within the constraints of the target device, optimize clock cycle time by pipelining, and minimize area. Recently Xilinx acquired AutoESL together with its AutoPilot HLS tool [2]. It takes C, C++, or SystemC as its input and produces device-specific RTL for Xilinx FPGA devices. Xilinx ISE and EDK tool chain are used to convert the generated RTL to the final bitstream. Impulse

Accelerated Technologies develops the ImpulseC programming language [3], a commercialization of StreamsC [4]. The Impulse C tools comprise a software-to-hardware compiler that translates individual Impulse C processes to hardware and generates the necessary process-to-process interface logic. Handel-C, developed by Celoxica [5], is based on the syntax of conventional C language. Programs written in Handel-C are implicitly sequential. To exploit benefits of parallelism from the target hardware, HandelC provides parallel constructs such as pipelined communication and parallel sections. Catapult C, designed by Mentor Graphics [6], is a subset of C++. The code that is compiled through Catapult C may be general purpose and result in much different hardware implementations with different timing and resource constraints. The Catapult C environment takes constraints and platform details in order to generate a set of optimal implementations. Ylichron (now PLDA Italia) developed a source-to-source C to VHDL compiler toolchain targeting system designers called HCE (Hardware Compiling Environment). The HCE toolchain [7] takes ANSI-C language as input, which describes the hardware architecture with some limitations and extensions. The HCE design flow consists of a sequence of steps (see Figure 1), with each step transforming the abstraction level of the algorithm into a lower level description. The HCE first extracts the Control and Data Flow Graph (CDFG) from the application to be synthesized. The CDFGs describe the computational nodes and edges between the nodes. HCE provides different methods to explore the CDFG of the input algorithm and generates the data-path structure. The generated data-path structure contains the user-defined number of computing resources for each computing node type and the number of storage resources (registers). The Allocation and Scheduling step maps the CDFG algorithm onto the computing data-path and produces a Finite State Control Machine. The system refine step uses the appropriate Board Support Package and performs synthesis for the communication network. Finally, the VHDLRTL generation step produces the VHDL files to be supplied to the proprietary synthesis tools for the targeted FPGA. III. R EVERSE T IME M IGRATION RTM is an advanced seismic data processing method used to show the bottom of salt bodies several kilometers underneath the earth’s surface. It is based on simulating the wave equation

f or (y = stencil; y < N Y − stencil; y + +) f or (x = stencil; x < N X − stencil; x + +)

.

f or (z = stencil; z < N Z − stencil; z + +)

.

P3 (x, y, z) =

s X

wl1 [P2 (x − l, y, z) + P2 (x + l, y, z)]

l

+

s X

wl2 [P2 (x, y − l, z) + P2 (x, y + l, z)]

l

+

s X

wl3 [P2 (x, y, z − l) + P2 (x, y, z + l)] + c◦ P2 (x, y, z))

l

+(V (x, y, z) × dt)2 + (2 × P2 (x, y, z)) − P1 (x, y, z) Fig. 2: Wave Field Computation Mathematical Presentation

inside a volume representing the earth subsurface. RTM works by running the Wave Field Equation, forward in time for the source and backwards in time for the receiver. In RTM, both forward and backward propagation parts of the method utilize the same PDE solver (WFC). The forward migration spends 90% of its execution time in the WFC kernel, while for the backwards migration it is about 70%. This shows the importance of accelerating this part of the code. The RTM implementation used in this work is based on an explicit 3D high order Finite Difference numerical scheme for the Wave Field. In this paper we attempt to design a WFC system using the HCE HLS tool and execute it on FPGA.s The mathematical representation of the WFC system is shown in Figure 2. The WFC system deals with four volumes P1, P2, V and P3, where P1, P2 and V are input volumes and P3 is the output volume. Only the P2 volume is used in the 3D-Stencil (FDTD method). The WFC equation (see Figure 2) shows that the P2 volume has a complex data access pattern and a lot of computation. For further explanation, we divided this section into two parts. • Integration over time • 3D-Stencil A. Integration over time The Integration over time unit (shown in blue in the Figure 2) inputs data streams from volumes P1, P2 and V, and from the output of the Stencil Unit. This unit requires few computations and uses two constant variables. B. 3D-Stencil

Fig. 1: HCE Tool Flow

Stencils are used in numerous scientific applications like computational fluid dynamics, geometric modeling, and image processing. These applications are often implemented using iterative finite-difference techniques that sweep over a two dimensional or 3D grid, while performing stencil computations on the nearest neighbors of the current point in the grid. In a stencil operation, each output point is updated with a weighted contribution from a subset of its neighbors. The

(a)

(b)

(c)

(d)

Fig. 4: (a) Stencil 3D Internal Memory Buffer (b) (3 × (n × 2) + 1) n-Stencil (c) (N × (1+(n× 4) ) + (n × 2)) N-Stencil Vector (d) Load and Update unit Access Pattern

3D-Stencil Computation unit consists of 3 Symmetric Finite Impulse Response (FIR) filters. For each stencil output, the 8point 3D-Stencil unit reads 25 operand from three axis of the P2 volume. In addition, there are 14 constant operands that are used. All these operands are single precision floating point values, i.e., 4 bytes/operand. The 3D-Stencil compute unit needs special care due to its irregular access pattern. Therefore, the 3D-Stencil computation increases the complexity not only due to the increasing number of computations but also because of the sparse data access pattern arising from the volume being linearly laid out in memory. IV. WFC A RCHITECTURE D ESIGN In this section we describe the design of the Wave Field Computation (WFC) system using the HLS toolchain. The HCE design environment is used to develop the WFC system for the FPGA target architecture. Before the start of the development we need to consider not only the underlying system and memory architecture but also the recommended coding style for synthesis. When designing the WFC system, the 3D-Stencil memory access pattern is a primary concern. Therefore we propose a WFC architecture that is compatible with both the 3D-Stencil system behavior and the High Level Synthesis HCE compiler. The WFC architecture, presented in Figure 3, is based on a scratch pad 3D-memory approach in which the volumes are accessed in tiles. The architecture is divided into four main blocks: • 3D-Stencil Memory • Memory Manager • Wave Field Computation (WFC) • Main Memory Manager

Fig. 3: Wave Field Computation Architecture

1 2 3 4

type array [ Mx ] [ My ] [ Mz ] . . [ Mn ] / * #HWST s p l i t D1 N1 , . . , ←Di Ni * / ; / / S t e n c i l 3D−Memory Model f o r S i n g l e Volume f l o a t P2 [ Nx ] [ Ny ] [ Nz ] / * #HWST s p l i t 2 Ny * / / / S p l i t t i n g 2 nd D i m e n s i o n o f Memory i n t o Ny ( Ny= 32 )

Fig. 5: HCE Memory Layout Design Style

A. 3D-Stencil Memory External memories access time can have a significant impact on the performance of the system. It can limit the ability to pipeline the design and thus negate the benefits gained by loop unrolling. A register based memory architecture enables HLS concepts such as loop unrolling, pipelining, and others to maximize parallelism. The HCE tool provides directives used to describe and interact with the memory architecture. Some directives are presented in Figure 5. Line#1 shows the generic directive to describe the multidimensional custom memory architecture. Comments starting with /∗#HWST are special synthesis directives for the compiler. In this case, Di represents the dimension of the M array and Ni is the number of blocks into which the Di dimension is split. Mn represents the size and dimensions of the array. It must be a power of two. Memory access for the 3D-Stencil compute unit is the main bottleneck due to its irregular data arrangement. To manage this bottleneck we proposed an internal 3D-Memory architecture. The proposed 3D-Memory architecture provides register (cache) based data access to the computation unit. The memory architecture, described in Figure 5 Line#3, contains a 3D memory (Nx, Ny, Nz) arrangement. The structure of the 3D-Stencil memory is shown in Figure 4 (a). Here the Y-axis belongs to Plane (Ny), the Z-axis represents columns (Nz) of the Plane, and the X-axis indicates the number of rows (Nx) of the Plane. A serial link sl is used to feed streaming data from the Main Memory unit to the 3D-Memory unit. In the implementation we selected the extended base volume size (Nx, Ny, Nz) such that Nx=Ny=Nz=32. Ny is the number of planes. In the current architecture it is a fixed number. Nx and Nz are the sizes of each plane. They need to be selected to fit in one BRAM of the target device. Depending upon

the available BRAM size, (Nx, Nz) can be adjusted. HCE provides directives to adjust the size of Plane (Nx, Nz) without affecting the purpose of the design. In the 3D-Stencil Memory there are 32 planes. Each plane has two ports to access data. The proposed architecture can process large size of dataset by dividing it into 3D-Tiles. B. Memory Manager The Memory Manager unit is used to increase the computation versus memory access ratio by vectorizing the access pattern and reusing data. The generic stencil structure is shown in Figure 4 (b). When n=4, 25 points are required to compute one central point. This means the computed points/accessed points (c/a) ratio is 0.04. To improve the (c/a) ratio the Memory Manager unit is deployed to increase data reuse by feeding data efficiently to the computation engine. The Memory Manger accesses multiple stencils from the input volume in the form of a stencil vector. This is shown in Figure 4 (c). (N × (1+(n× 4) ) + (n × 2)) represents the size of a single 3D-Stencil vector. Here N represents the number of planes. Since the 3D-stencil’s output is generated by processing consecutive neighboring points in 3 dimensions, a large amount of data can be reused with an efficient data management. To achieve efficient data access and high reuse, the Memory Manager is further divided into three units, the Load Unit, the Update Unit and the Reuse Unit. 1) Load Unit: The Load unit accesses the 3D-Stencil vector at the start of each row of the 3D-Memory volume. For the following vector access, the Load unit transfers control to the Update unit. The Load Unit’s data access pattern is shown in the Figure 4 (d). For a single 3D-Stencil volume, the number of accessed points is dependent on the number of planes and the stencil size. The proposed 3D-Stencil memory architecture has 24 base planes and 25 points are required to compute one stencil point. In this case, a single Stencil vector needs 600 points. The load unit reduces these numbers by reusing the points in the y-dimension. These points (pointy ) are reused when they are part of a neighboring plane of the stencil vector. The number of Load unit points is mentioned in Equation 1, where Ghost pointz refers to the points of the extended base volume and P ointc indicates the central point. Eq = (P lanes × (P ointx + P ointz + P ointc )) + Ghost P ointz (1)

2) Update Unit: After reading the first stencil vector, the Memory Manager shifts the control to the Update Unit. For each new access, the Update Unit slides the stencil vector towards the z-direction and accesses further points while reusing neighboring points. The number of points required by the Update Unit for a stencil vector is presented in Equation 2. Eq = (P lanes × (P ointx + P ointc )) + Ghost P ointsz (2)

3) Reuse unit: After accessing the first row, the Reuse unit accesses the rest of volume. This unit accesses only the central point and generates a stencil vector while reusing the existing rows and columns as mentioned in Equation 3. Eq = (P lanes × P ointc ) + Ghost P ointsz

(3)

The data reuse ratio of the Memory Manager unit is the ratio between the total number of required points and the total number of accessed points. The Memory Manager is designed to manage a (c/a) ratio equal to 1. This means that for the 8point 3D-Stencil unit, the Memory Manager uses a single input point 25 times before discarding it from the internal memory. In practice this is not achievable due to the ghost points (i.e., points belonging to a neighboring tile that are necessary for the current computation) present on the boundaries of the input volume. The Memory Manager improves data reuse ratio for large base volumes. HCE synthesis results for the Memory Manager unit are shown in Table I, Column ”Stencil Vector” shows the clock cycles consumed by each unit accessing a single stencil vector. Column ”3D-Stencil Volume” shows the total number of clock cycles consumed to access a complete 3D-Stencil Memory input base volume (24x24x24). Column Signals shows the number of local registers used to store temporary data. The Memory Manager efficiently reuses these registers’ data. TABLE I: Memory Manager Unit HCE synthesis report Dataset Number Load Unit Update Unit Reuse Units Memory Manager

Points 416 224 32

Single Stencil Vector Clock Cycles Signals 12 753 8 423 2 78 1047

3D-Stencil Volume Points Clock Cycles 416 12 5152 184 17664 1104 23232 1300

C. Wave Field Computation (WFC) The WFC is designed to take 3 input volumes (P1, P2, V) and generate an output volume (P3). This part is further divided into three units. • 3D-Stencil Compute Unit • Time Integration Unit • Resource Scheduler 1) 3D-Stencil Compute Unit: The 3D-Stencil Compute Unit is the most complex unit of the RTM core due to its intensive computation and complex data access pattern. The input stencil volume needs to be extended at its boundaries to access the stencil vector on all points lying on the boundaries of the input volume. In order to accommodate the ghost points of the input volume the core accepts base volumes extended by four points in each dimension (called Extended Base Volumes). Since the size of the 3D-Stencil Memory structure is 32×32×32, the extended base volume (Nx,Ny,Nz) uses the same volume size for the compute unit. For a clearer picture, the convention used to understand the axis of the volume is shown in Figure 4 (b). In this figure the Y-axis are the Planes in the volume, the X-axis refers to the rows in the planes, and the Z-axis represents the points in the row. It is clear from Figure 4 (b) that in the case n=4 then the 3D-Stencil computation requires 24 operands (8 operands from each axis) to compute one output point. In addition to these 24 operands, an operand corresponding to the central point is also required for all non-symmetric or odd symmetric configurations. Since our 3D-Stencil Unit is designed to compute on single precision floating point data, the core needs 25 input data elements totaling 100 bytes (25 x 32/8) for each

computed point. It also needs to perform 13 multiplications and 24 addition operations for computing one output point in the odd symmetric stencil configuration. The 3D-Stencil Compute Unit can work at a maximum frequency of 200 MHz. A single 3D-Stencil Compute Unit takes 47 cycles to produce an output. Depending upon the FPGA resources and the HLS scheduler we can instantiate multiple stencil compute units in a Compute Engine. 2) Time Integration Unit: This unit takes the computed output from the stencil unit and applies the time integration computation to the P1, P2 and V volumes. The CDFG of the WFC engine is shown in Figure 6. The access pattern for the P1 and V volumes is straightforward. The data access of both volumes has a regular pattern. We therefore used a streaming interface to feed these volumes to the compute unit. The unit performs 3 multiplications, 2 addition and one subtraction operation to compute one output point. 3) Resource Scheduler: This section deals with the exploitation of parallelism in the proposed architecture. Once the basic computation and memory units have been completed we focus on the resource allocator and scheduler. We use the unrolling technique to fully pipeline the vectors of the 3DStencil. As described in the main architecture, the Memory Manager takes care of feeding data to the WFC engine. It inserts registers in the design to provide pipelined data. For a given input volume (32x32x32), the Memory Manager takes approximately 1300 clock cycles to feed the complete volume to the WFC engine. To compute this volume, the fully pipelined WFC engine consumes 13886 clock cycles (including a 63 clock cycle latency for the first output). Because the Memory Manager transfers volumes faster than the compute unit latency, we propose to instantiate multiple compute units to efficiently utilize the Memory Manager. The number of compute engines required for maximum overlapping is given by: W F C Clock Cycels 13887 = ≈ 10 M emory M anager Clocks Cycels 1300 After allocating the compute units, the HCE scheduler efficiently maps them onto hardware blocks. GFLOPS are calculated by the formula mentioned in Figure 7. The calculated GFLOPS for each compute unit are presented in Table II. The Efficiency (see Table II) presents the HCE scheduler’s ability to best utilize the allocated hardware resources. The column labeled Hardware Efficiency is the achieved hardware utilization in the proposed architecture using the HCE scheduling methodology.

GFLOPS= f req ∗



F loating P oint Operations Output P oint

Fig. 7: Wave Field Computation GFLOPS TABLE II: Archived GFLOPS by 3D-Stencil Engine Unit Name Stencil Unit WFC Engine

Floating Point Operation per Output 37 43

Number Of Clock Cycles 5152 5272

Hardware Efficiency 0.9 0.87

Achieved GFLOPS @ 100 MHz Clock 9.93 11.2

Final WFC System: After checking the functionality and performance of each block we completed a stable WFC System that can be used as an object (IP/BlackBox). The final design of the WFC System is shown in Figure 8 (a), which contains (24×24×24) base volumes. To fully utilize the resources, the WFC system uses a double buffering scheme. In this scheme 3 WFC units are instantiated that operate at 100 MHz clock frequency. Table III mentions resources of the WFC system. TABLE III: A stable Wave Field Computation unit Number of WFC Units 3

Memory Manager Unit 1

Double Buffer Input Volume 2x32x32x32

Double Buffer Output Volume 2x24x24x24

Operating Frequency MHz 100

D. Main Memory Engine This unit contains the Main Memory controller that accesses tiled data from the Main Memory (Physical Memory) and delivers it to the 3D-Memory buffers and vice versa. The dataset in the Main Memory has dimensions (MX, MY, MZ). Depending on the size of the internal buffer memory, the Memory Controller unit partitions the physical memory into a number of tiles: MX × MY × MZ Nx × Ny × Nz Here M and N represent the main dataset and buffer size, respectively, while (x,y,z) refers to the dimensions of the volume. To improve parallelism, two separate units are designed to generate tile addresses for the input and output buffers: • The Tile Read Unit • The Tile Write Unit The basic idea of tiling is to partition the physical data into tiles which are then executed automatically as shown in Figure 8 (b). The read_tile() and write_tile() engines are created inside the Main Memory unit that reads/writes a single tile of size (Nx,Ny,Nz) from/to main memory (MX,MY,MZ). By using a double buffering scheme, the tiling operation can be performed in parallel with the compute engine. As address generation is done in hardware, we need a high speed source synchronous streaming memory controller that can provide high speed data. The WFC engine deals with the four volumes: P1,P2,V and P3. The bandwidth required for each volume is calculated by using Equation 4. BW =

Fig. 6: Partial Differential Wave Equation CDFG

OutputP oints #of ClockCycles

Base V olume Size(bytes) × Operating F requency (4) Compute U nit Clocks Cycles

Table IV shows the bandwidth required by the WFC system for each volume. Using double buffering for the P2 volume,

(a)

(b)

(c)

Fig. 8: (a) Stable Wave Field Computation System, (b) Load Balancing using parallel tiling, (c) WFC System for Altera GX530 FPGA

TABLE IV: Bandwidth Required by the WFC Engine @ 100 MHz Volume P1 P2 V P3 Total WFC Engine

Size 24×24×24 32×32×32 24×24×24 24×24×24 P1×P2×V×P3

Bytes 55296 131072 55296 55296 296960

GByte/s 1.04 2.49 1.04 1.04 5.63

the WFC System requires 5272 clock cycles to read the memory volume. Volume P2 consumes more bandwidth due to the presence of the ghost points at the boundaries of the volume. Therefore for each new volume, 8 planes (P(N-7) to P(N)) are reused from the previous volume. In this way 25% of bandwidth is saved. V. F UNCTIONAL V ERIFICATION High-level functional verification provides substantial decrease in the test generation and application test time. The debug and verify facility of the HCE tool increases the fault coverage. In addition it also provides the details about area/delay overheads of design. In this section we test and

# define # define # define # define # define # define # define

NX 32 NY 32 NZ 32 MX 64 MY 64 MZ 64 n u m t i l e ( (MX*MY*MZ) / ( NX*NY*NZ) )

f l o a t WFC_Engine ( f l o a t P1 , f l o a t V , f l o a t P2_SinglePort [ ←MX ] [ MY ] [ MZ ] , f l o a t P3_SinglePort [ MX ] [ MY ] [ MZ ] ) { f l o a t P2_0 [ NX ] [ NY ] [ NZ ] / * #HWST s p l i t 2 NY * / ; f l o a t P2_1 [ NX ] [ NY ] [ NZ ] / * #HWST s p l i t 2 NY * / ; f l o a t P3_0 [ NX ] [ NY ] [ NZ ] / * #HWST s p l i t 2 NY * / ; f l o a t P3_1 [ NX ] [ NY ] [ NZ ] / * #HWST s p l i t 2 NY * / ; f o r ( i n t i = 0 ; i

Suggest Documents