231.2 frames per second for the full MobileNet v1 model,. â 3.43 times speed up ... DWS: a streaming architecture with two passes (depthwise and pointwise); ...
Automatically Optimising CNN with Depthwise Separable Convolution on FPGA Ruizhe Zhao, Xinyu Niu and Wayne Luk {ruizhe.zhao15 , niu.xinyu10, w.luk}@imperial.ac.uk Department of Computing, Imperial College London
Contributions
Design Space Exploration
We propose a novel approach to develop depthwise separable convolution for FPGA platforms, which contains: A hardware library covering typical CNN layers on FPGA, including depthwise separable convolution layer. A model generator transforming conventional CNN models to ones that partially or fully utilise depthwise separable convolution layers to improve computation efficiency. A model compiler converting high-level CNN model descriptions, with or without depthwise separable convolution layers, to optimised hardware designs with quantisation. On Altera Stratix V (5SGSD8), generated designs can reach: 231.2 frames per second for the full MobileNet v1 model, 3.43 times speed up on VGG-16, automatically optimised.
Parameters for each Processing Unit (PU): 𝑃𝐹 , 𝑃𝐶 , 𝑃𝐾 - level of parallelisation along filter, channel, and kernel; 𝑇𝐻 , 𝑇𝑊 - height and width of tiles; 𝐵𝑤 - bit width of data type Exploration objectives: maximising speed within resource limits Method: simulated annealing algorithm – state (parameter set), energy function (objective), move randomly, solution time ≤ 3 secs
Depthwise Separable Convolution (DWS) Principle: Separately learns the spatial and cross-channel correlations of a standard convolution layer by performing a depthwise and a pointwise convolution Advantage: With same input and output shapes, depthwise separable convolution uses much fewer number of parameters and operations than the standard one.
Model Generator and Compiler The model generator tries to find an optimal hybrid model with both standard and depthwise separable convolution layers. The generation objective combines model size and accuracy. Method: from a pre-trained CNN model, we replace layers and finetune them. By replacing the top-3 convolution layers in VGG-16, we can generate a CNN model with 57.3% conv parameters and even ~3% accuracy improvement on the VGG Flowers dataset. The model compiler processes the generated model for FPGA deployment (Figure 3) by exploring the design space (block optimiser).
Figure 3. The compiler converts the CNN model to DFG (Data-Flow Graph) and generates both hardware design and software code for a given platform. Figure 1. Comparison of standard (upper row) and depthwise separable convolution (lower row). Nc x H x W - shape of the input image, NF - number of output channels. Each computation is marked with number of parameters (Nparams) and operations (Nops).
Hardware Library DWS: a streaming architecture with two passes (depthwise and pointwise); multipliers are shared between passes; it can be configured with parallelisation parameters. Other layers are also implemented in a similar style.
Experiment Settings Evaluated on Maxeler MAX4 platform with an Altera Stratix-V 5SGSD8 FPGA at 150 MHz Single convolution layer setting for standard and depthwise separable comparison: H=W=56, C=F=512, K=3 Case study models: MobileNet v1 (full) and VGG-16 Software: CPU – Intel i7-950; GPU – Titan X; TensorFlow (v1.3)
Evaluation Results Depthwise separable convolution design can reach 7.95 times speed up compared with standard convolution. Generated FPGA design (8bit) for MobileNet: LUT (46.1%), FF(43.6%), DSP (83.6%), BRAM (78.5%) Speed (FPS) – 24.3 (CPU) < 231.2 (Ours) < 289 (GPU) Energy/frame (J) – 0.118 (Ours) < 0.376 (CPU) < 0.491 (GPU) Comparing generated VGG-16 designs (STD for the original model, DWS for the layer replaced model) with previous publications:
Figure 2. Hardware architecture. Different passes are marked in different colours.
[1]
[2]
STD
DWS
Board
ZCU102
Stratix 5SGXA7
Stratix 5GSD8
𝑩𝒘 , Freq
16, 200 MHz
16, 150 MHz
16, 150 MHz
8, 150 MHz
FPS (ratio) 95.23 (7.60)
11.38 (0.91)
12.53 (1.0)
43.01 (3.43)
Power (W) 23.6
-
26.5
27.1
[1] L. Lu et al. Evaluating fast algorithms for convolutional neural networks on FPGAs, FCCM, 2017. [2] Y. Ma et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks, FPL, 2017.
Acknowledgements: The support of UK EPSRC (EP/L00058X/1, EP/L016796/1, EP/P010040/1 and EP/N031768/1), the European Horizon 2020 Research and Innovation Programme under grant agreement number 671653, Maxeler and Intel is gratefully acknowledged.