An efficient implementation of 2D convolution in ... - J-STAGE Journals

LETTER

IEICE Electronics Express, Vol.14, No.1, 1–8

An efficient implementation of 2D convolution in CNN Jing Changa) and Jin Shab) School of Electrical Science and Engineering, Nanjing University, Nanjing 210046, People’s Republic of China a) [email protected] b) [email protected]

Abstract: Convolutional neural network (CNN), a well-known machine learning algorithm, has been widely used in the field of computer vision for its amazing performance in image classification. With the rapid growth of applications based on CNN, various acceleration schemes have been proposed on FPGA, GPU and ASIC. In the implementation of these specific hardware accelerations, the most challenging part is the implementation of 2D convolution. To obtain a more efficient design of 2D convolution in CNN, this paper proposes a novel technique, singular value decomposition approximation (SVDA) to reduce the usage of resources. Experimental results show that the proposed SVDA hardware implementation can achieve a reduction in resources in the range of 14.46% to 37.8%, while the loss of classification accuracy is less than 1%. Keywords: CNN, 2D convolution, hardware implementation Classification: Integrated circuits References

© IEICE 2017 DOI: 10.1587/elex.13.20161134 Received November 16, 2016 Accepted November 30, 2016 Publicized December 16, 2016 Copyedited January 10, 2017

[1] C. Farabet, et al.: “CNP: An FPGA-based processor for convolutional networks,” FPL (2009) 32 (DOI: 10.1109/FPL.2009.5272559). [2] C. Garcia and M. Delakis: “Convolutional face finder: A neural architecture for fast and robust face detection,” IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1408 (DOI: 10.1109/TPAMI.2004.97). [3] M. Ranzato, et al.: “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” CVPR (2007) (DOI: 10.1109/CVPR.2007. 383157). [4] M. Sankaradas, et al.: “A massively parallel coprocessor for convolutional neural networks,” ASAP (2009) 53 (DOI: 10.1109/ASAP.2009.25). [5] S. Chakradhar, et al.: “A dynamically configurable coprocessor for convolutional neural networks,” ACM SIGARCH Computer Architecture News 38 (2010) 247 (DOI: 10.1145/1815961.1815993). [6] T. Chen, et al.: “A small-footprint high-throughput accelerator for ubiquitous machine-learning,” SIGPLAN Not. 49 (2014) 269 (DOI: 10.1145/2541940. 2541967). [7] A. Krizhevsky, et al.: “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems 25 (2012) 1097. [8] J. Cong and B. Xiao: “Minimizing computation in convolutional neural networks,” Artificial Neural Networks and Machine Learning-ICANN (2014)

1


[9]

[10]

[11]

[12]

[13] [14]

1


281. C. Zhang, et al.: “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” ACM/SIGDA FPGA (2015) 161 (DOI: 10. 1145/2684746.2689060). N. Li, et al.: “A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition,” SSIAI (2016) 165 (DOI: 10.1109/SSIAI.2016.7459201). H. Nakahara and T. Sasao: “A deep convolutional neural network based on nested residue number system,” FPL (2015) 1 (DOI: 10.1109/FPL.2015. 7293933). R. G. Shoup: “Parameterized convolution filtering in a field programmable gate array,” Selected papers from the Oxford 1993 international workshop on field programmable logic and applications on More FPGAs (1994) 274. GitHub DeepLearnToolbox: https://github.com/rasmusbergpalm/ DeepLearnToolbox/blob/master/tests/test_example_CNN.m. Google Code Project Hosting: https://code.google.com/p/cuda-convnet/.

Introduction

Convolutional neural network (CNN), a famous machine learning architecture, has been widely adopted in various applications, including video surveillance, face/ person detection, mobile robot vision, object recognition, etc. [1, 2, 3, 4] Due to the specific computation pattern of CNN, general purpose processors hardly meet the implementation requirement, which encourages the proposal of various hardware implementations based on FPGA, GPU and ASIC [5, 6, 7]. CNN contains numerous 2D convolutions, which are responsible for more than 90% of the whole computation [8]. Thus, how to implement the 2D convolution in CNN more efficiently is an important issue. To solve this problem, many efforts have been made [1, 4, 9, 10, 11]. Among these approaches, the architecture which is inspired by [12], first introduced into CNN by [1], is commonly adopted. An instance of 3 3 kernel is shown in Fig. 1. In this figure, W denotes weights in convolution kernel and L denotes the row length of the input image. This architecture performs 2D convolution by taking a window in the input image, multiplying each element by the corresponding kernel (resulting in O(n2 ) multiplications), and then feeding the output into an adder tree (O(n2 ) additions). [4] proposes a coprocessor coupled with off-chip memory with large bandwidth to hold the intermediate state, and it achieves a speed 31 faster than software implementation. [9] improves the efficiency of FPGA-based CNN design by quantitatively analyzing the relationship between computing throughout and memory bandwidth. [10] replaces the fully connected layers with global summation and proposes an implementation whose peak performance is 409.62 giga-operations per second (GOPS). [11] introduces nested residue number system (NRNS) to replace MAC unit with several LUTs to save resources and achieves 5:86 improvement compared with the existing best realization. However, one thing to note is that none of these implementations modified the basic architecture of 2D convolution.

2


Fig. 1.

Commonly used 2D convolution architecture with kernel size of 3 3

To implement 2D convolution in CNN more efficiently, this paper proposes one novel technique, singular value decomposition approximation (SVDA). SVDA decomposes the 2D convolution to pairs of low-complexity 1D convolutions by applying low rank approximation. Experimental results show that the proposed scheme achieves a reduction in hardware complexity in the range of 14.46% to 37.8% with classification accuracy dropping by less than 1%. The rest of this paper is organized as follows: Section 2 provides the background of CNN and Singular Value Decomposition (SVD). Section 3 explains the proposed technique in detail. Section 4 describes the hardware architecture. Section 5 presents the implementation results. Section 6 concludes this paper. 2


Background

2.1 Convolution neural networks (CNN) Convolutional Neural Networks are biologically inspired hierarchical architectures that can be trained to perform various detection, classification and recognition tasks. A typical CNN consists of two components: a feature extractor and a classifier. The feature extractor is used to filter input images into feature maps that represent a variety of features of the image. These features may include corners, lines, edges, etc., which are relatively invariant to position shifting or distortions. The output of the feature extractor is a low-dimensional vector composed of these features. Then this vector is fed into the classifier, which is usually based on traditional artificial neural networks. The purpose of this classifier is to decide the probability of categories that the input (e.g. image) might belong to. Fig. 2 shows a typical CNN for image classification, which is obtained from ImageNet [7]. The feature extractor includes several convolutional layers and optional pooling layers (such as average pooling, max pooling etc.). The classifier is composed of several fully connected layers for classification and recognition. The computation of convolutional layer is shown in Eq. (1) and Eq. (2). yl1 i;j;k is the output of layer

3


Fig. 2.

Typical CNN for image classification

(l 1), yli;j;k is the output of layer (l), where i, j, and k denotes the 3D coordinate denotes the weight of the filter f, which is applied in the node of the node. wðl1;fÞ i;j;k at layer (l 1). a, b, and c denotes the 3D coordinates of the weight of the filter. ðxðlÞ i;j;k Þ is the non-linear squashing function. The pooling layer subsamples the output of the convolutional layer. X X X ðl1;fÞ ðl1Þ ¼ wi;j;k yiþa;jþb;kþc þ biasðfÞ ð1Þ xðlÞ i;j;k a

c

b

yli;j;k ¼ ðxðlÞ i;j;k Þ

ð2Þ

In fully connected layers, the nodes in the input layer and output layer are fully l1 connected by different weights wðlÞ is the i;j , as shown in Eq. (3) and Eq. (4). yi ðlÞ l output of layer (l 1) and yi is the output of layer (l). ðxi Þ is the non-linear squashing function. X ðl1Þ ðl1Þ wi;j yj þ biasjðl1Þ ð3Þ xðlÞ i ¼ j

yli;j;k ¼ ðxðlÞ i;j;k Þ

ð4Þ

2.2 Singular value decomposition (SVD) Singular value decomposition (SVD) is a factorization of a real or complex matrix. Formally, the singular value decomposition of an m n real or complex matrix M is a factorization of the form USV, where U is an m m real or complex unitary matrix, S is a m n rectangular diagonal matrix with non-negative real numbers on the diagonal, and V is an n n real or complex unitary matrix. The diagonal entries i of S are known as the singular values of M. The columns of U and the row of V are called the left-singular vectors and right-singular vectors of M, respectively. SVD can be employed to low-rank matrix approximation. 3

Design techniques

3.1 Singular value decomposition approximation (SVDA) SVDA transforms a 2D convolution to several 1D convolutions and applies a lowrank matrix approximation to reduce the computation complexity. For a n n n P kernel K, it can be transformed to K ¼ ui i vi , where ui and vi are the ith column i¼1


and row vector of U and V respectively. i denotes the ith singular value. Applying low-rank matrix approximation to kernel K means that only some of the largest singular values are kept and the others are set to zero. Assuming m singular values

4


are kept, the 2D convolution performed between input image I and n n kernel K m P can be transformed to ððI ui Þ ði vi ÞÞ. Thus, original 2D convolution is i¼1

decomposed into m pairs of 1D convolution. In terms of complexity, the complexity of original 2D convolution is O(n2 ) while the SVDA transformed convolution is O(2mn) instead. Therefore, the complexity can be reduced when m < n=2. The choice of m should be decided by the tradeoff between complexity and precision. To quantitatively analyze the precision and find the best m, a parameter ¡ is defined to indicate the similarity between the approximate kernel and the exact kernel as follows: X1 2i ð5Þ ¼ Xm1 2 n i For different kernel sizes (from 3 to 9 usually used in CNN), extensive simulations are conducted to compute the values of ¡ under different m. The statistic values of ¡ under different m are listed in Table I. Table I. The value of ¡ for different kernel sizes The value of ¡

Kernel size m¼1

m¼2

m¼3

m¼4

m¼5

33

0.8872

0.9874

1

n

n

55

0.8380

0.9372

0.9823

0.9978

1

77

0.8153

0.9000

0.9510

0.9801

0.9943

99

0.8017

0.8745

0.9238

0.9572

0.9787

As shown in Table I, the value of ¡ increases with the growth of the value of m. Based on the results above, three approximate models are built to represent different degrees of similarity. The three models are shown in Table II. Table II.


Three approximate models

Kernel size

33

55

77

99

Model 1

m¼1

m¼1

m¼1

m¼2

Model 2

Exact value

m¼2

m¼2

m¼3

Model 3

Exact value

m¼2

m¼3

m¼4

To find out which model is the best option, the accuracies of different models are tested in LeNET [13] for dataset of MINIST and cudaconvnet [14] for dataset of CIFAR10. The details of LeNET and cudeconvnet used here are shown in Table III. In Table III, the bold Conv represents a convolution layer while the number in the brackets represents the kernel size of the 2D convolutions in the convolution layer. To conduct this experiment, all of the convolution kernels in the convolution layers are replaced by the approximation values according to the three corresponding models. The original and approximated CNN classification accuracy results are shown in Table IV. 5


Details of LeNET and cudaconvnet

Table III. Model

Model architecture

LeNET

Conv1(9 9), Max pooling1(2 2), Conv2(5 5), Max pooling2(2 2), Fc1, Fc2, Softmax

cudaconvnet Conv1(3 3), Max pooling1(3 3), Conv2(7 7), Ave pooling2(3 3), Conv3(5 5), Ave pooling3(3 3), Fc1, Fc2, Softmax

Table IV.

Results of CNN classification accuracy

Dataset

Original Accuracy

Model 1

Model 2

Model 3

MINIST

95.44%

83.51%

94.83%

95.06%

CIFAR10

86.78%

74.49%

85.80%

85.98%

As Table IV shows, there is an unacceptable deterioration of accuracy in model 1. However, the decreases of accuracy are relatively acceptable, less than 1%, in model 2 and 3. Between model 2 and 3, model 2 achieves the most complexity reduction. Therefore, model 2 turns out to be the best option, considering accuracy and resources. 4

Hardware architecture

Based on SVDA, explained in Section 3.1, a 2D convolution is transformed to m pairs of 1D convolutions, namely row convolution and column convolution respectively. The hardware architecture is illustrated in Fig. 3, where m is the number of remained singular values and n is the kernel size.


Fig. 3. Overall hardware architecture

6


The input image pixels and output convolution results are both serial and the design is fully pipelined. The row convolution is performed in a serial way and the column convolution is performed in a parallel way. The transpose buffers cache the results of row convolutions serially and output them to column convolutions in parallel. The throughput and latency of the proposed design stay the same as the traditional design in Fig. 1 [1, 10, 11]. 5

Implementation results

To demonstrate the effect of the proposed technique, several 2D convolvers based on 5 5, 7 7 and 9 9 kernels (commonly used in CNN) are designed. For each kernel, two different designs including the original design and the design applying SVDA are implemented respectively. The m for each kernel is chosen based on model 2 explained in Section 3.1. All the implementations are based on fixed points, with 16 bits for image pixels and 8 bits for parameters. These designs are synthesized based on Xilinx Virtex-7 FPGA. To make it a fair comparison, the multipliers and adders are mapped to LUTs rather than DSPs. The synthesis results are shown in Fig. 4. For each 2D convolver shown in Fig. 4, resources of row convolvers, transpose buffers and column convolvers are all included. As shown in Fig. 4, the design applying SVDA achieves 14.46% to 37.8% reduction in resources corresponding to different kernels. Additionally, it is worth mentioning that the critical paths (clock speed) of these two designs for each kernel are comparable.

Fig. 4.

Synthesis results comparison


7


6

Conclusion

This paper proposes one efficient technique, SVDA, for 2D convolution designs in CNN. SVDA transforms 2D convolution to pairs of 1D convolutions with low complexity. Experimental results show that up to 37.8% reduction in resources can be achieved by applying this technique with the CNN classification accuracy dropping by less than 1%. Acknowledgments This work was jointly supported by the National Natural Science Foundation of China under Grant No. 61370040, 61006018, 61376075 and 61176024, the project on the Integration of Industry, Education and Research of Jiangsu Province BY2015069-05, BY2015069-08, and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.


8

An efficient implementation of 2D convolution in ... - J-STAGE Journals

An efficient implementation of 2D convolution in ... - J-STAGE Journals

Suggest Documents

An Efficient Implementation of Decoupled Communication in ...

COMMUNICATION-MINIMIZING 2D CONVOLUTION ... - Forrest Iandola

COMMUNICATION-MINIMIZING 2D CONVOLUTION ... - Forrest Iandola

Efficient Object Recognition using Convolution

2d block for spatial convolution - IJEI

An efficient mobile PACE implementation

An Energy Efficient ONU Implementation

FPGA IMPLEMENTATION OF A VEDIC CONVOLUTION ALGORITHM

FPGA Implementation of Convolution using Wallace

An Implementation of Energy-efficient Routing

Implementation of an Efficient Transformerless Single ...

An efficient implementation of Slater-Condon rules

An efficient microfluidic sorter: implementation of ...

Towards an Efficient Implementation of Sequential Montgomery ...

An Efficient Robust Eye Localization by Learning the Convolution

Efficient implementation of 3D graphics in an MBS ...

An efficient implementation of massive neutrinos in non-linear ... - arXiv

An optimized GPU implementation of a 2D free surface ... - hgpu.org

Implementation and Evaluation of an E cient 2D Parallel ... - CiteSeerX

An Efficient Implementation of HtmlDiff in Java - Semantic Scholar

Design and Implementation Of An Efficient Relative Model in Cancer ...

Memory-Efficient High-Speed Convolution-based ...

Efficient Fast Convolution Architectures for Convolutional Neural

Efficient Convolution Kernels for Dependency and Constituent ...