Automatically Optimising CNN with Depthwise

Automatically Optimising CNN with Depthwise Separable Convolution on FPGA Ruizhe Zhao, Xinyu Niu and Wayne Luk {ruizhe.zhao15 , niu.xinyu10, w.luk}@imperial.ac.uk Department of Computing, Imperial College London

Contributions

Design Space Exploration

We propose a novel approach to develop depthwise separable convolution for FPGA platforms, which contains:  A hardware library covering typical CNN layers on FPGA, including depthwise separable convolution layer.  A model generator transforming conventional CNN models to ones that partially or fully utilise depthwise separable convolution layers to improve computation efficiency.  A model compiler converting high-level CNN model descriptions, with or without depthwise separable convolution layers, to optimised hardware designs with quantisation. On Altera Stratix V (5SGSD8), generated designs can reach:  231.2 frames per second for the full MobileNet v1 model,  3.43 times speed up on VGG-16, automatically optimised.

 Parameters for each Processing Unit (PU): 𝑃𝐹 , 𝑃𝐶 , 𝑃𝐾 - level of parallelisation along filter, channel, and kernel; 𝑇𝐻 , 𝑇𝑊 - height and width of tiles; 𝐵𝑤 - bit width of data type  Exploration objectives: maximising speed within resource limits  Method: simulated annealing algorithm – state (parameter set), energy function (objective), move randomly, solution time ≤ 3 secs

Depthwise Separable Convolution (DWS)  Principle: Separately learns the spatial and cross-channel correlations of a standard convolution layer by performing a depthwise and a pointwise convolution  Advantage: With same input and output shapes, depthwise separable convolution uses much fewer number of parameters and operations than the standard one.

Model Generator and Compiler  The model generator tries to find an optimal hybrid model with both standard and depthwise separable convolution layers.  The generation objective combines model size and accuracy.  Method: from a pre-trained CNN model, we replace layers and finetune them. By replacing the top-3 convolution layers in VGG-16, we can generate a CNN model with 57.3% conv parameters and even ~3% accuracy improvement on the VGG Flowers dataset.  The model compiler processes the generated model for FPGA deployment (Figure 3) by exploring the design space (block optimiser).

Figure 3. The compiler converts the CNN model to DFG (Data-Flow Graph) and generates both hardware design and software code for a given platform. Figure 1. Comparison of standard (upper row) and depthwise separable convolution (lower row). Nc x H x W - shape of the input image, NF - number of output channels. Each computation is marked with number of parameters (Nparams) and operations (Nops).

Hardware Library  DWS: a streaming architecture with two passes (depthwise and pointwise); multipliers are shared between passes; it can be configured with parallelisation parameters.  Other layers are also implemented in a similar style.

Experiment Settings  Evaluated on Maxeler MAX4 platform with an Altera Stratix-V 5SGSD8 FPGA at 150 MHz  Single convolution layer setting for standard and depthwise separable comparison: H=W=56, C=F=512, K=3  Case study models: MobileNet v1 (full) and VGG-16  Software: CPU – Intel i7-950; GPU – Titan X; TensorFlow (v1.3)

Evaluation Results  Depthwise separable convolution design can reach 7.95 times speed up compared with standard convolution.  Generated FPGA design (8bit) for MobileNet: LUT (46.1%), FF(43.6%), DSP (83.6%), BRAM (78.5%) Speed (FPS) – 24.3 (CPU) < 231.2 (Ours) < 289 (GPU) Energy/frame (J) – 0.118 (Ours) < 0.376 (CPU) < 0.491 (GPU)  Comparing generated VGG-16 designs (STD for the original model, DWS for the layer replaced model) with previous publications:

Figure 2. Hardware architecture. Different passes are marked in different colours.

[1]

[2]

STD

DWS

Board

ZCU102

Stratix 5SGXA7

Stratix 5GSD8

𝑩𝒘 , Freq

16, 200 MHz

16, 150 MHz

16, 150 MHz

8, 150 MHz

FPS (ratio) 95.23 (7.60)

11.38 (0.91)

12.53 (1.0)

43.01 (3.43)

Power (W) 23.6

-

26.5

27.1

[1] L. Lu et al. Evaluating fast algorithms for convolutional neural networks on FPGAs, FCCM, 2017. [2] Y. Ma et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks, FPL, 2017.

Acknowledgements: The support of UK EPSRC (EP/L00058X/1, EP/L016796/1, EP/P010040/1 and EP/N031768/1), the European Horizon 2020 Research and Innovation Programme under grant agreement number 671653, Maxeler and Intel is gratefully acknowledged.

Automatically Optimising CNN with Depthwise

Automatically Optimising CNN with Depthwise

Suggest Documents

CNN

CNN

CNN

CNN

Optimising Multiple Metrics with MERT

Temporal and depthwise distribution of microorganisms, enzymes ...

Automatically annotating documents with ... - ScholarlyCommons

Hyperspectral CNN Classification with Limited Training Samples

Automatically Enriching a Thesaurus with

CNN/ORC

CNN/ORC

CNN/ORC

CNN/ORC

CNN/ORC

CNN/WMUR

CNN/ORC

CNN/ORC

Neural Metaphor Detecting with CNN-LSTM Model

face identification with cnn-um - CiteSeerX

CNN/ORC

CNN/ORC

CNN/ORC

CNN/Time

CNN/ORC