A fast, embedded implementation of a Convolutional ...

A fast, embedded implementation of a Convolutional Neural Network for Image Recognition -Revisited V.K. Pothos, D. Kastaniotis, I. Theodorakopoulos and N. Fragoulis1

Introduction During the past few years, convolutional neural networks (CNNs) have been established as the dominant technology for approaching real-world visual understanding tasks. A significant research effort has been put into the design of very deep architectures, able to construct highorder representations of visual information. The accuracy obtained by deep architectures such as GoogleNet [1] and the more contemporary ResNet [2] on image classification and object detection tasks, proved that depth of representation is indeed the key to a successful implementation. Although high quality implementations are already available for mainstream, PC-like computing systems, deploying such systems into diverse technological areas (i.e. automotive, transportation, IoT, medical etc.), requires implementation of DL architectures on diverse embedded platforms incorporating less powerful hardware. However, meeting particular performance requirements on embedded platforms is, in general, difficult. Building systems based on existing computing libraries (e.g. BLAS, Eigen etc.) generally achieves only limited effectiveness [5]. Based on the above discussion it becomes evident that improving such approaches requires tuning multiple computational kernels for the particular use -case at hand. This generally requires great effort by some very specialized programming teams. Such teams must be simultaneously capable of producing high-efficiency code for a target platform as well as being familiar with the details of CNN computations and algorithmic art, in order to be able to tweak – if necessary – any given architecture. This is the exact case with the Irida Labs team.

1 Irida

Labs S.A., www.iridalabs.gr, tel: +302610992965, email: [email protected]

Our team is comprised of young, highly skilled scientists and engineers, featuring a variety of diverse – though complementary – skills, ranging from high-performance, low-level programming to algorithmic development and optimization. The potential of Irida Labs team becomes evident in this white paper, in which the development and implementation of a complex deep-learning network is presented.

Tweaking the algorithms: The SqueezeNet architecture A basic downside of deep learning architectures, is that they require hundreds of MBytes in coefficients for the convolutional kernels to operate. Such requirements can render the embedded implementation of similar networks rather prohibitive. Imagine a scenario where a CNN has to operate on a video stream, in order to produce a real-time video annotation captured by a smartphone. The allocation and data transfers needed to load e.g. 600 MB of coefficients on an embedded device’s memory is a rather intense workload, particularly when it has to be completed within a limited time, starting when the user -opens the camera app (e.g. initialization) and ending when the video recording starts. In order to address such issues, very recently a significant research effort has been shifted towards architectures that produce significantly fewer coefficients. In particular, the recently presented SqueezeNet 1.1 [3] architecture is able to achieve similar levels of classification accuracy on ImageNet, to the baseline AlexNet [4] architecture, using 50 times fewer coefficients. The smart combination of small convolutional kernels and a complex architecture that enables information to flow through different paths facilitates the construction of sufficiently high-order image representations that are suitable for a large variety of applications. A coefficients’ size of 3 MB, easily reduced further by a factor of 5 via modelcompression techniques, makes SqueezeNet a very appealing architecture for embedded implementations.

Figure 1: SqueezeNet CNN Architecture

As it is depicted in Figure 1, SqueezeNet begins with a standalone convolution layer (conv1), followed by 8 Fire modules (fire2-9), ending with a final conv layer (conv10). A Fire module is comprised of a squeeze convolution layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. The number of filters per fire module are gradually increased from the beginning to the end of the network. SqueezeNet performs max-pooling with a stride of 2 after conv1, fire4, fire8, and conv10. It is also worth noting that in SqueezeNet there are not any fully-connected layers.

Funtionality, Learning and Training There have been two individual system implementations for the same architecture, corresponding to two different use cases.

Use Case 1: Image recognition and tagging In the specific implementation this CNN architecture has been trained to perform image tagging, and is able to discriminate between 12 general image categories. These are the 10 categories tagged in the MIRFLICKR-25000 database [6], plus two extra categories namely Documents and Food. The training dataset is the MIRFLICKR-25000 database, augmented with 6000 images tagged as “documents” and “Food”. The dataset us then augmented by cropping each image at 5 different frames and vertical mirroring each one of them. This results into 10 times the amount of the initial images. The training has been made using the Caffe[7] deep learning framework and the accuracy achieved in terms of average EER is 3.7%.

Use Case 2: Food recognition In the specific implementation this CNN architecture has been trained to perform image tagging, and is able to discriminate between 101 food categories, tagged in the Food-101 database [9] comprised by some 101000 images. The dataset us then augmented by cropping each image at 5 different frames and vertical mirroring each one of them. The training has been made using the Caffe[7] deep learning framework and the accuracy achieved in terms of average recognition rate is 72% for Rank 1, 85% for Rank 3 and 91% for Rank 5.

Achieving a high performance embedded software implementation In the following sections, an implementation of a SqueezeNet 1.1 on the Qualcomm SnapDragon 820 MDP tablet is presented. This next-generation development hardware is based on the Qualcomm® Snapdragon™ 820 processor, which includes a Qualcomm® Kryo™ CPU, Qualcomm® Adreno™ 530 GPU, and the Qualcomm® Hexagon™ 680 DSP, along with the latest available Android OS. Our implementation follows a heterogeneous programming approach utilizing the multiple Kryo CPU cores of SnapDragon mainly for housekeeping/data feeding, but mainly the Adreno 530 GPU which is programmed using OpenCL in a fully optimal way. Optimizations include among others, optimal utilization of Adreno GPU computing resources, optimal memory management, fast external memory access (zero copy), unrolling of loops, minimize data bandwith in-between kernels (thus reducing the necessary load/stores), etc.

Speed results and comparison In this section the speed results of our implementation are depicted. In addition, speed measurements on the same platform for typical, open-source implementations of Caffe

framework on the same platform are also given for comparison. Performance in terms of speed and power consumption, of Irida Labs implementation on SnapDragon 820 is given in Table 1 (for the highest GPU clock frequency @624MHz) and in Table 2 (for the lowest GPU clock frequency @133MHz). Table 1: Inference Performance for IRIDA LABS implementation: Highest GPU clock 624MHz

Network Name: SqueezeNet 1.1

Batch Size

GPU utilization

1

53%

10

92%

SnapDragon 820 – FP32

Inference Speed Performance Power Consumption Inference Speed Performance Power Consumption 1 2

43,00 msec 2,47 Watt1 21.5 msec 2,32 Watt1,2

Power measurements have been taken using the Trepn power profiling facility distributed by Qualcomm. For 10 images. That is 232mW/image

Table 2: Inference Performance for IRIDA LABS implementation: Lowest GPU clock 133MHz

Network Name: SqueezeNet 1.1

Batch Size

GPU utilization

1

75%

10

90%

Inference Speed Performance Power Consumption

120,00 msec

Inference Speed Performance Power Consumption 1 2

SnapDragon 820 – FP32

0,99 Watt1 85,1 msec 0,87 Watt1,2


Performance of Caffe native Android implementation [8] on SnapDragon 820 is given in Table 3. Note that this implementations does not use GPU but rather the Kryo cores and it is based on the Eigen 3 library. Table 3: Inference Performance for Caffe

Network Name: SqueezeNet 1.1 Inference Speed Performance Power Consumption Inference Speed Performance Power Consumption 1 2

Batch Size 1

20

SnapDragon 820 – FP32 270 msec 208 msec 6,0 Watts 1,2


As is evident, Irida Labs’s approach of utilizing a more focused programming approach to achieve high processing speed produces a much more optimal implementation.

References [1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[2] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [3] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 1MB model size." arXiv preprint arXiv:1602.07360(2016). [4] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [5] S. Chintala, “convnet-benchmarks,” https://github.com/soumith/, convnet-benchmarks, 2016, [Online; accessed 24-Jully-2016]. [6]

M. J. Huiskes, M. S. Lew (2008). The MIR Flickr Retrieval Evaluation. ACM International Conference on Multimedia Information Retrieval (MIR'08), Vancouver, Canada

[7] http://caffe.berkeleyvision.org [8] https://github.com/sh1r0/caffe-android-lib [9] https://www.vision.ee.ethz.ch/datasets_extra/food-101/

About Irida Labs Irida Labs (Patras, Greece, www.iridalabs.gr) is bridging the gap between a camera and a human eye by bringing visual perception to any device, by developing software in Computer Vision, using Image Processing and Machine Learning techniques made for any CPU, GPU or DSP/ASP platform or a combination of them using Heterogeneous Programming techniques. Irida Labs portfolio includes applications in Computational Photography and Visual Perception/Analytics addressing various markets such as mobile devices, action cameras, drones, surveillance, automotive, industrial and robot vision etc.

A fast, embedded implementation of a Convolutional ...

A fast, embedded implementation of a Convolutional ...

Suggest Documents

A fast, embedded implementation of a Convolutional ...

SECOND: Sparsely Embedded Convolutional

SwishNet: A Fast Convolutional Neural Network for

Embedded System design with implementation of a

Implementation of a Shoe-Embedded Human ...

Embedded Fatigue Detection using Convolutional

A Fast Implementation of Adaptive Histogram Equalization

A Fast and Secure Implementation of Sflash

A Fast x86 Implementation of Select

A fast algorithm for computing BER of convolutional codes encoder ...

A Fast FPGA Implementation of a General Purpose ... - Semantic Scholarhttps://www.researchgate.net/.../A-Fast-FPGA-Implementation-of-a-General-Purpose-...

A fast algorithm for computing BER of convolutional codes encoder ...

FPGA Design and Implementation of a Convolutional ...

SCPS: a fast implementation of a spectral method for ... - BioMedSearch

A fast logarithm implementation with adjustable accuracy

A Fast Foveated Fully Convolutional Network Model for Human ...

Implementation of Training Convolutional Neural Networks - arXiv

Implementation of Convolutional Encoder by XOR

XOR-FREE Implementation of Convolutional Encoder for ...

Embedded State Machine Implementation

Convolutional Neural Network with embedded Fourier ... - CiteSeerX

A System-On-Chip FPGA Implementation of Embedded ... - CiteSeerX

implementation of a vehicle detection system in the fpga embedded ...

a novel implementation of hardware based hybrid embedded - aircc