Automatically Optimising CNN with Depthwise

141 downloads 0 Views 405KB Size Report
231.2 frames per second for the full MobileNet v1 model,. ❑ 3.43 times speed up ... DWS: a streaming architecture with two passes (depthwise and pointwise); ...
Automatically Optimising CNN with Depthwise Separable Convolution on FPGA Ruizhe Zhao, Xinyu Niu and Wayne Luk {ruizhe.zhao15 , niu.xinyu10, w.luk}@imperial.ac.uk Department of Computing, Imperial College London

Contributions

Design Space Exploration

We propose a novel approach to develop depthwise separable convolution for FPGA platforms, which contains:  A hardware library covering typical CNN layers on FPGA, including depthwise separable convolution layer.  A model generator transforming conventional CNN models to ones that partially or fully utilise depthwise separable convolution layers to improve computation efficiency.  A model compiler converting high-level CNN model descriptions, with or without depthwise separable convolution layers, to optimised hardware designs with quantisation. On Altera Stratix V (5SGSD8), generated designs can reach:  231.2 frames per second for the full MobileNet v1 model,  3.43 times speed up on VGG-16, automatically optimised.

 Parameters for each Processing Unit (PU): 𝑃𝐹 , 𝑃𝐶 , 𝑃𝐾 - level of parallelisation along filter, channel, and kernel; 𝑇𝐻 , 𝑇𝑊 - height and width of tiles; 𝐵𝑤 - bit width of data type  Exploration objectives: maximising speed within resource limits  Method: simulated annealing algorithm – state (parameter set), energy function (objective), move randomly, solution time ≤ 3 secs

Depthwise Separable Convolution (DWS)  Principle: Separately learns the spatial and cross-channel correlations of a standard convolution layer by performing a depthwise and a pointwise convolution  Advantage: With same input and output shapes, depthwise separable convolution uses much fewer number of parameters and operations than the standard one.

Model Generator and Compiler  The model generator tries to find an optimal hybrid model with both standard and depthwise separable convolution layers.  The generation objective combines model size and accuracy.  Method: from a pre-trained CNN model, we replace layers and finetune them. By replacing the top-3 convolution layers in VGG-16, we can generate a CNN model with 57.3% conv parameters and even ~3% accuracy improvement on the VGG Flowers dataset.  The model compiler processes the generated model for FPGA deployment (Figure 3) by exploring the design space (block optimiser).

Figure 3. The compiler converts the CNN model to DFG (Data-Flow Graph) and generates both hardware design and software code for a given platform. Figure 1. Comparison of standard (upper row) and depthwise separable convolution (lower row). Nc x H x W - shape of the input image, NF - number of output channels. Each computation is marked with number of parameters (Nparams) and operations (Nops).

Hardware Library  DWS: a streaming architecture with two passes (depthwise and pointwise); multipliers are shared between passes; it can be configured with parallelisation parameters.  Other layers are also implemented in a similar style.

Experiment Settings  Evaluated on Maxeler MAX4 platform with an Altera Stratix-V 5SGSD8 FPGA at 150 MHz  Single convolution layer setting for standard and depthwise separable comparison: H=W=56, C=F=512, K=3  Case study models: MobileNet v1 (full) and VGG-16  Software: CPU – Intel i7-950; GPU – Titan X; TensorFlow (v1.3)

Evaluation Results  Depthwise separable convolution design can reach 7.95 times speed up compared with standard convolution.  Generated FPGA design (8bit) for MobileNet: LUT (46.1%), FF(43.6%), DSP (83.6%), BRAM (78.5%) Speed (FPS) – 24.3 (CPU) < 231.2 (Ours) < 289 (GPU) Energy/frame (J) – 0.118 (Ours) < 0.376 (CPU) < 0.491 (GPU)  Comparing generated VGG-16 designs (STD for the original model, DWS for the layer replaced model) with previous publications:

Figure 2. Hardware architecture. Different passes are marked in different colours.

[1]

[2]

STD

DWS

Board

ZCU102

Stratix 5SGXA7

Stratix 5GSD8

𝑩𝒘 , Freq

16, 200 MHz

16, 150 MHz

16, 150 MHz

8, 150 MHz

FPS (ratio) 95.23 (7.60)

11.38 (0.91)

12.53 (1.0)

43.01 (3.43)

Power (W) 23.6

-

26.5

27.1

[1] L. Lu et al. Evaluating fast algorithms for convolutional neural networks on FPGAs, FCCM, 2017. [2] Y. Ma et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks, FPL, 2017.

Acknowledgements: The support of UK EPSRC (EP/L00058X/1, EP/L016796/1, EP/P010040/1 and EP/N031768/1), the European Horizon 2020 Research and Innovation Programme under grant agreement number 671653, Maxeler and Intel is gratefully acknowledged.