Fast Generation of High Throughput Customized Deep Learning

0 downloads 0 Views 535KB Size Report
Index Terms—Convolutional Neural Networks; FPGA; Design. Automation; Fast ... using Winograd algorithm and frequency domain convolution respectively.
Fast Generation of High Throughput Customized Deep Learning Accelerators on FPGAs Hanqing Zeng

Chi Zhang

Viktor Prasanna

Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, California 90089 [email protected]

Department of Computer Science University of Southern California Los Angeles, California 90089

Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, California 90089 [email protected]

[email protected]

Abstract—Accelerating CNNs has been an active area of research. Research on GPU has led to several well-developed opensource tools such as CAFFE and TensorFlow. However, for FPGA accelerators, such design automation tools are not yet available. We propose an automatic code generation tool that synthesizes high throughput accelerators for CNN inferencing targeting broad types of CNNs and FPGAs. The tool takes as input a high level description of the CNN model and the target FPGA device, and generates fully synthesizable Verilog as output. The tool adopts an algorithm-architecture co-design methodology based on frequency domain convolution. Our proposed algorithm called Concatenate and Pad (CaP), together with our efficient design space exploration, ensure design modularity and scalability (in terms of routing complexity and tool execution time). Users can optionally customize various design parameters, such as FFT sizes and hardware resources to be used. The tool optimizes throughput for a user specified hardware. To illustrate the tool, we generate optimized designs for AlexNet, VGG16 and variations of them (AlexNet* and VGG16*). Experimental results show that for inferencing on these models, throughput of 274.5 GOP S, 660.9 GOP S, 283.2 GOP S and 623.0 GOP S is achieved on the Intel HARP (version 0) platform. The throughput of AlexNet and VGG16 designs outperform state-of-the-art FPGA implementations by 1.85× and 3.53× respectively. The tool is delivered as a Python3 package, and is easily portable onto various computing platforms. Experiments on variety of CNNs and target FPGA devices show that the tool runs in less than 20 seconds on a commodity desktop. Index Terms—Convolutional Neural Networks; FPGA; Design Automation; Fast Fourier Transform

I. I NTRODUCTION Convolutional Neural Networks (CNNs) have shown major success in a wide range of fields including computer vision and natural language processing [1], [2]. To meet the high accuracy requirement of various applications, CNN models are becoming deeper and larger, and are evolving at a fast pace. Thus, a key problem is to develop high performance accelerators for a given model with a short turn around time. Using GPU to speedup the training and inference of CNNs has been extensively studied. Efforts over the years have led to well-developed open-source tools such as CAFFE [3] and This work was supported by the US NSF under grants CNS-1643351 and ACI-1339756. This work is also supported in part by Intel Strategic Research Alliance funding. Equipment grant from the Intel Hardware Accelerator Research Program is gratefully acknowledged.

TensorFlow [4], helping numerous end users. On the one hand, such tools speed up the computation of existing CNN models. On the other hand, the tools facilitate the development of next generation CNN models. As a result, high quality tools greatly promote the progress of the deep learning field. Apart from GPU, FPGA has also become an attractive platform for CNN acceleration [5]. Energy efficiency, reconfigurability and high logic density are the unique advantages of FPGAs [6] in this context. Although much work [7], [8] has been done on FPGA accelerators for CNN inferencing, there is still no tool which automates the time consuming manual hardware design, and meets all of the three criteria: 1) generating designs in a highly automated way without user intervention; 2) sustaining high performance for various FPGA devices and CNN models; 3) generating designs in a short time. The work in [9] proposed a framework for accelerating DNNs by generating Verilog. It features high generality by their ISA and macro-dataflow virtual machine. However, in their approach, performance is sacrificed for generality. The work in [5] proposed a general approach to map the nested loop of convolution onto a target FPGA with a given resource budget. It does not generate synthesizable code, and the design space exploration is time consuming. Other works [10], [11] proposed accelerators using Winograd algorithm and frequency domain convolution respectively. However, no tools are proposed. We develop a design automation tool by an algorithmarchitecture co-design methodology. Performance and design generality are realized by our Concatenate and Pad (CaP) algorithm designed for modular FFT architecture [12]. Tool execution time is shortened by our efficient design space exploration [13]. In addition, scalability in terms of wire routing complexity is ensured by the local connection between hardware modules. The inputs to the tool are configuration files specifying the CNN model (dimension of each convolution layer and kernel filters) and the target FPGA device (external bandwidth, on-chip memory size and logic resources). The output of the tool is fully synthesizable Verilog code. The main contributions of this work are: • We develop an automatic code generation tool targeting a wide range of CNNs and FPGAs. The tool generates high throughput, modular designs in synthesizable Verilog.







We develop an efficient design space exploration algorithm which decomposes the large design space of a complete CNN into multiple small design space of individual layers, enabling fast execution of the tool. We extensively test our tool with 4 large scale CNNs and 7 sets of baselines. Our tool consistently generates high-throughput designs. The tool-generated accelerators improve the throughput of state-of-the-art FPGA implementations by 1.85× and 3.53× for inferencing on AlexNet and VGG16 respectively, on the Intel HARP [14] (version 0) platform. II. BACKGROUND

A CNN is constructed by stacking various layers together. Typical designs include four types of layers: convolution layers, rectified linear unit (ReLU) layers, pooling layers and fully connected layers. Different layers extract different kinds of features out of the input image. Classification is performed based on the high dimensional features. Our tool performs optimization on convolution layers based on frequency domain convolution. We apply techniques including Overlap and Add (OaA) [15] and Concatenate and Pad (CaP) [13] to achieve hardware modularity and reduced computation complexity.

in First, we partition I into tiles Ti,j of size ltile × ltile (where ltile + lkern − 1 = N ). Then, after zero padding to size out N × N , we compute the intermediate output tiles Ti,j using out Equation 2. To get the final matrix R, we place Ti,j so that pixel (0, 0) of each tile is overlapped with pixel (i·ltile , j ·ltile ) of R. Each pixel in R is the sum of corresponding pixels in out Ti,j . The operation is summarized by Equation 3.

  out in Ti,j = F −1 F (Ti,j ) ◦ F (K)

R[p][q] =

X

(2)

 out Ti,j [p − i · ltile ][q − j · ltile ]

i,j

( where

0 6 p − i · ltile < ltile 0 6 q − j · ltile < ltile

(3)

The square brackets [∗][∗] indicates the pixel index within the matrix. All indices i, j, p, q start from 0. OaA applied to frequency domain convolution reduces the 2 2 complexity of spatial convolution from O(limg · lkern ) to 2 O(limg · log lkern ) [16] for computation on a single image. This ensures high performance of our tool-generated designs.

A. Frequency Domain Convolution

C. Concatenate and Pad (CaP)

Let I (dimension: fin ×limg ×limg ) be the input matrix to a convolution layer. Let K (dimension: fout ×fin ×lkern ×lkern ) be the kernel filter of a convolution layer. The convolution layer performs the convolution operation over the last two dimensions (limg , lkern ) of I and K, and accumulates over 0 0 fin . As a result, the output to that layer is fout × limg × limg . Spatial convolution performs sliding window operation of the kernel matrix over the image matrix. Alternatively, a more efficient algorithm performs convolution in frequency domain. Convolution operation in the spatial domain is equivalent to Hadamard product (◦) in the frequency domain. The algorithm is summarized by Equation 1, where F and F −1 denote Fourier transform and its inverse operation.   I ∗ K = F −1 F (I) ◦ F (K) (1)

CaP is a dual operation of OaA. Instead of partitioning a large image into tiles, CaP concatenates multiple small images into a large one, so that convolution is performed on the concatenated image. Details of this technique is described in [13]. Figure 1 shows an example of CaP. The concatenation is achieved by folding the dimension of Batch. Let d be the Batch folding factor. Then d2 images in the original batch are concatenated to form one large image. The new batch size for the concatenated√images is Batch/d2 . 4 = 2. To ensure For the example in Figure 1, d = correctness of convolution, padding of (lkern −1) pixels should be added between adjacent images for aliasing elimination. CaP Size of the concatenated image limg is thus calculated as CaP limg = d · limg + (d − 1) · (lkern − 1).

Matrices I, K need to be padded to the same size before Fourier transform, so that inputs to the Hadamard product are valid. Since limg can be large for the first few convolution layers, computing FFT on the full I matrix (limg ×limg ) is not always practical. This problem is solved by the OaA technique. B. Overlap and Add (OaA) To avoid the operation on matrices of large limg , the original image matrices are partitioned into smaller tiles. After the operation of Equation 1 on each tiles and kernels, the resulting matrices are combined to form the final output [15]. The following describes the procedures of computing FFT using the Overlap and Add (OaA) technique. Given an limg × limg input image I and an lkern ×lkern kernel K, we convolve them using N point 2D FFT units (subject to N ≥ lkern ).

D. CNN Models ImageNet competitions in recent years have led to many well-designed CNNs for image classification [1][2]. Popular designs use multiple convolution layers which gradually extract features from local to global scale when proceeding to deeper layers. The computation is performed on data of high dimensions. An additional Batch dimension is specified for batch processing of input images. From the model shown in Table I, we observe how the network extracts out more and more features (larger fin , fout ) when images are transformed to lower and lower resolution (smaller limg ). The large variation of layer parameters imposes major challenges in accelerator design. These challenges are carefully dealt with by our tool.

TABLE I: Details of Convolution Layers. AlexNet has 5 convolution layers. VGG16 has 13 convolution layers, divided into 5 groups. VGG16 specification below are for the groups. Conv. Layer

Image Size limg AlexNet VGG16

Kernel Size lkern AlexNet VGG16

Input Features fin AlexNet VGG16

1

224 × 224

224 × 224

11 × 11

3×3

2

55 × 55

112 × 112

5×5

3×3

96

64

3

27 × 27

56 × 56

3×3

3×3

256

128

4

13 × 13

28 × 28

3×3

3×3

384

256

5

13 × 13

14 × 14

3×3

3×3

384

512

3

3

Input: CNN Spec Input: User Opt Goal

Input: FPGA Spec

Algorithm Optimization Kernel Engine Output: FFT size "# , Folding factor $# Architectural Engine Optimization Kernel Output: Optimized Accelerator Design !

1 1

2

2

()* 3

3

4

4

!"#$

%&' !"#$

Fig. 1: Illustration of the CaP algorithm. CaP adds padding (white space) between adjacent images to avoid aliasing. III. AUTOMATIC D ESIGN G ENERATION Our tool automates the hardware design generation procedure, by integrating our CaP-OaA optimization into efficient design space exploration algorithm. The tool is designed to be capable of handling a wide range of CNNs and FPGAs. Users can also define customized design space by specifying various algorithmic and architectural parameters. Section III-A gives an overview of the design flow and operation modes of the tool. Sections III-B, III-C, III-D describe the Algorithmic Optimization Engine, Architectural Optimization Engine and Code Generation Engine. Section III-E specifies the user interface. A. Design Flow and Operation Modes Figure 2 summarizes the design flow. The tool is composed of three engines: Algorithmic Optimization Engine, Architectural Optimization Engine and Code Generation Engine. The inputs to the tool include configuration files describing the CNN model and the FPGA hardware, and command line arguments specifying the optional optimization constraints. The outputs of the tool include a detailed report containing the design parameters and performance estimation, and fully synthesizable Verilog code. See Section III-E for more information. After reading in the CNN specification from a configuration file, the Algorithmic Optimization Engine calculates the suitable FFT sizes Ni and Batch folding factor di for each convolution layer i. The value of Ni and di are chosen such that the overall number of operations are minimized. The Algorithmic Optimization Engine can operate in two modes: • Throughput optimization mode: It achieves the maximum throughput without additional constraint on design

Code Generation Engine Kernel

Output: Detailed Text Report

Output: Fully Synthesizable Verilog

Fig. 2: Design flow of the automatic code generation tool. parameters. The Algorithmic Optimization Engine will output the Ni and di corresponding to the lowest possible computation complexity. Ni and di can be selected without any iterative procedure since an analytical solution exists for these two parameters [13]. • Customized optimization mode: It supports user-defined optimization constraints. For example, in order to evaluate the impact of the CaP algorithm, users can set di to be all 1 (so that CaP-OaA reduces to OaA). They can also limit the maximum 2D FFT size to understand the effect of FFT sizes for a given convolution layer. This mode is intended for experienced users to facilitate their understanding of the CNN. The inputs to the Architectural Optimization Engine include Ni and di from the Algorithmic Optimization Engine and a configuration file specifying the FPGA hardware. The Architectural Optimization Engine explores the hardware parameters using an efficient design space exploration algorithm (see Section III-C). The Architectural Optimization Engine also supports two modes: • Throughput optimization mode: It identifies the configuration of individual hardware modules which achieves the highest possible throughput. The constraint is the available FPGA resources. • Customized optimization mode: It allows users to define their customized design space for the hardware modules. For example, users can set the maximum allowed 2D FFT parallelism to understand the impact of FFT hardware on the overall performance. The Architectural Optimization Engine outputs the detailed configuration of each hardware module. It also generates an analytical report, which includes the estimated performance (latency and throughput), as well as the resource consumption (on-chip memory, logic and external bandwidth). The report analyzes the utilization of various hardware modules for each layer, which helps identify the system bottlenecks.

CPU

Kernel Buffer

Image Buffer

2D FFT DRAM

HAC

MAC

2D IFFT

FPGA

Fig. 3: Architecture Template The input to the Code Generation Engine is the hardware module configurations identified by the Architectural Optimization Engine. Detailed steps to generate the synthesizable Verilog code for the complete architecture is described in Section III-D. B. Algorithmic Optimization Engine The tool adopts an algorithm-architecture co-design methodology implemented by the Algorithmic Optimization Engine and Architectural Optimization Engine. Details of the co-design can be found in [13]. In summary, motivated by the low computation complexity, the Algorithmic Optimization Engine is built upon frequency domain convolution (Section II-A). In order to sustain high throughput for various convolution layers, the engine integrates CaP-OaA optimization (Section II-B,II-C) since it flexibly adjust the limg dimension to fit the chosen FFT size. As a result, the Algorithmic Optimization Engine adopts a general approach to map the low complexity convolution algorithm onto modular FFT architecture [12] of low hardware cost. C. Architectural Optimization Engine The architectural optimization performs design space exploration on our architecture template designed for frequency domain convolution. We use the optimal hardware configuration for each individual layer to determine the range of the architectural parameters in the design space. We then identify the optimal hardware configuration for the complete CNN on the resultant (bounded) design space. By decomposing a large design space into several much smaller ones, design space exploration can be efficiently performed even for deep CNN models and large FPGA devices. The architecture template is shown in Figure 3. FPGA performs frequency domain convolution, which is the most computation intensive task, while the CPU performs other light-weight tasks involved in fully-connected layers, ReLU layers and pooling layers. The FPGA design consists of three hardware modules for 2D FFT, Hadamard product ACcumulation (HAC), and 2D IFFT. The HAC module performs Hadamard product on each single image and kernel tile, and then accumulates the resulting tiles along the fout dimension.

During CNN inference, images are streamed in while kernels are kept constant. So FFT operation is performed on FPGA for image tiles, and offline for kernels as preprocessing. On-chip buffers are added between the FFT and HAC module. Double buffering for image tiles overlaps CPU-FPGA memory transfer and FPGA computation. The preprocessing and memory transfer time for kernel tiles can be amortized when classifying large number of images. After CaP constructs sufficiently large images for a single layer, all the image tiles partitioned by OaA share the same kernel tiles in the kernel buffer. The three modules are implemented as parameterized IP cores (see Section III-D). The tool specifies the set of parameters H, including 2D FFT/IFFT data parallelism (pF F T , pIF F T ) and folding factor (qF F T , qIF F T ), HAC data parallelism (pimg , pkern ) and folding factor (qHAC ), image buffer size (Mimg ) and kernel buffer size (Mkern ). The set H for a complete CNN includes 9 variables. Instead of performing design space exploration on H directly, the tool first looks into each individual convolution layer. The set of parameters H0 i for any layer i contains only 6 variables (pF F T , pIF F T , qF F T , qIF F T , pimg and Mimg ) [13]. The tool uses the optimum in H0 i to set the range of parameters in the design space. Finally the tool identifies the optimum of H in the bounded design space. Algorithm 1 shows the design space exploration. Algorithm 1: Exploration on Bounded Design Space

1 2 3 4

5 6 7 8

// OP T : hardware configuration in H producing the optimal throughput for a complete CNN. // OP T [i]: hardware configuration in H0 i producing the optimal throughput for a single layer i (0 ≤ i ≤ l). OP T [ ] ← N U LL for layer i = 0 to l do Design space exploration over H0 i OP T [i] ← optimal hardware configuration for layer i // R[d]: range of values for dimension d R[ ] ← N U LL for dimension h d = 0 to D do i R[d] ← mini {OP T [i][d]}, maxi {OP T [i][d]} Design space exploration for the complete CNN on the design space bound by R[ ]

D. Code Generation Engine The code generation engine generates Verilog code for FPGA, as well as the C++ code for CPU. To support various configurations defined by H, the Verilog code generation engine generates the corresponding control units for the 2D FFT, HAC and 2D IFFT modules. Figure 4 shows the detailed architecture of the black boxes in Figure 3. For the control unit of the 2D FFT (2D IFFT) module, the major challenge is to generate the signals controlling various Streaming Permutation Networks (SPNs) [17]. An SPN realizes arbitrary fixed

!-point 1D FFT

!-point 1D FFT SPN

$

Transpose !×! matrix

$′

Stage &

! " -point 2D FFT

SPN &

(a) 2D FFT architecture with SPN. *567# ⋅

Image Buffer !"#

$% &'()

Kernel Buffer !"# ⋅ &'()

$% &'() MAC

MAC



*"+, ⋅ *567# ⋅

!"#

3 !89: *567#

!"# ⋅ &'() ⋅

$% &'()

!"# ⋅ &'() ⋅

./01ℎ3 *"+,

*"+, ⋅

MAC

$% &'()

(b) HAC architecture with image and kernel buffers.

Fig. 4: Architecture details for the 2D FFT and HAC modules.

addr_mem_1 addr_mem_2 addr_mem_3 addr_mem_4

Stage 2 !/# Stage 3

Stage 1

Radix-(

!"#$

#

!"#% !"#& !"#'

Control $%,'

Addr generation )**+,,' = $.,' ⋅ )**+ ,0% ,'

(a) SPN

Control $(,'

signals for the three stages of a CLOS network. Let Pi,j denote the permutation performed by the j th crossbar of the ith stage in a CLOS network. For i = 1 or 3, j ≤ m/p. The code generation engine feeds control signals corresponding to Pi,j in consecutive m/p cycles to stages 1 and 3 of the corresponding SPN. For i = 2, j ≤ p. P2,j permutes the size m/p address sequence for the j th single port memory in the SPN. The code generation engine generates the address sequences by addrk,j = P2,j · addr(k−1),j , where k denotes the k th set of size m inputs. The automatic generation of size m stride s SPN consists of two steps. In step 1, a Python script generates the control bitstream for the three stages of SPN based on the CLOS network configuration. Routing algorithm of CLOS network is described in [18]. In step 2, the code generation engine outputs the synthesizable Verilog by combining the control bitstream in step 1 with the hardware template for the SPN pipeline. Code 1 shows the generated address bitstream for a 4 × 4 matrix transposition SPN (m = 16, p = 4). addr_mem_i is the address sequence for memory i. SPN fetches the addresses from the bitstream by slicing 2 bits at a time in a circular manner. For example, the addresses into memory 1 in consecutive cycles are 1, 0, 3, 2, 0, 1, 2, 3, ....

Control

(b) FFT stage

Fig. 5: Architecture for SPN and FFT stages.

permutation on the input streaming data. For the control unit of the HAC module, the major challenge is to generate the addresses into the image and kernel buffers. Hardware generation (2D FFT). As shown in Figure 4a, 2D FFT for an N ×N matrix is computed by N -point 1D FFT for N rows and N -point 1D FFT for N columns. Within the 1D FFT pipeline, SPNs of various sizes and strides connect adjacent FFT stages. Between the row FFT and column FFT pipelines, an SPN of size N 2 and stride N realizes the N × N matrix transpose. Let p and p0 denote the SPN data parallelism. Our automatic generation of SPNs is based on [17]. An SPN permuting m elements with data parallelism p can be realized by folding the CLOS network [18] m/p times. As shown in Figure 5a, stage 1 and 3 of the SPN perform permutation in space by crossbar switches. Stage 2 performs permutation in time by p single port memories. The control signals for the three stages of an SPN can be derived by the control

Suggest Documents