Fixed-point Factorized Networks

arXiv:1611.01972v1 [cs.CV] 7 Nov 2016

Fixed-point Factorized Networks Peisong Wang National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Beijing, China [email protected]

Jian Cheng National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Beijing, China [email protected]

Abstract

(MACs). Current DNNs are usually trained offline by utilizing specialized hardwares like NVIDIA GPUs and CPU clusters. But such an amount of computation may be unaffordable for portable devices such as mobile phones, tablets and wearable devices, which usually have limited computing resources. What’s more, the huge storage requirement and large memory accesses may hinder efficient haredware implementation of neural networks, like FPGAs and neural network oriented chips. There have been many studies working on reducing the storage and the computational complexity of DNNs by quantizing the parameters of these models. Some of these works quantize the pretrained weights using several bits (usually 3∼12 bits) with a minimal loss of performance. However, in these kind of quantized networks one still needs to employ large numbers of multiply-accumulate operations. Others focuse on training these networks from scratch with binary (+1 and -1) or ternary (+1, 0 and -1) weights. These methods do not rely on pretrained models and may reduce the computations at both training stage as well as testing stage. But on the other hand, these methods could not make use of the pretrained models very efficiently due to the dramatic information loss during the binary or ternary quantization of these weights. Moreover, all these methods are highly constrained by the direct quantization scheme, for example, these methods must have exactely the same number of quantized weights as the full precision weights in each layer of the original networks. However, the redundances of deep neural networks differ from layer to layer, thus it is inefficient to use fixed numbers of weights for the quantized networks. In this paper, we propose a unified framework based on fixed point factorization and full precision recovery of

In recent years, Deep Neural Networks (DNN) based methods have achieved remarkable performance in a wide range of tasks and have been among the most powerful and widely used techniques in computer vision, speech recognition and Natural Language Processing. However, DNN-based methods are both computational-intensive and resource-consuming, which hinders the application of these methods on embedded systems like smart phones. To alleviate this problem, we introduce a novel Fixed-point Factorized Networks (FFN) on pre-trained models to reduce the computational complexity as well as the storage requirement of networks. Extensive experiments on large-scale ImageNet classification task show the effectiveness of our proposed method.

1. Introduction Deep neural networks (DNNs) have always been setting new state of the art performance in various fields including computer vision, speech recognition as well as natural language processing. Convolutional neural networks (CNNs), in particular, have outperformed traditional machine learning algorithms on computer vision tasks such as image recognition, object detection, semantic segmentation as well as gesture and action recognition. These breakthroughs are partially due to the added computational complexity and the storage footprint, which makes these models very hard to train as well as to deploy. For example, the Alexnet [10] involves 61M float-point parameters and 729M high precision multiply-accumulate operations 1

weight matrices to simultaneously accelerate and compress DNN models with only minor performance degradation. It is nontrivial to utilize fixed point factorization of weight matrices. One may argue to use full precision matrix decomposition like SVD, followed by fixed point quantization of the decomposed submatrices. However, this kind of method has two obvious shortcomings. Firstly, the matrix approximation is optimized for the full precision submatrices, but not for the fixed point representation, which is our real purpose. Moreover, the weight distributions after decomposition are quite complicate, which makes fixed point quantization more difficult. On the contrary, in this paper we propose to firstly factorize the weight matrix using fixedpoint (+1, 0 and -1) representation followed by recoverying the (pseudo) full precision submatrices. We also propose a effective and practical technique called weight balancing, which makes our fine-tuning much more stable. We demonstrate the effects of the fixed point factorization, full precision weights recovery, weight balancing and whole-model performance on ImageNet classification. The main contributions of this paper can be summarized as follows:

allows asymmetric reconstruction to prevent error accumulation across layers. They achieves 3.8× speed-up on VGG16 model with only a 0.3% increase of top-5 error in ImageNet [15] classification. Many work also explore the lowrank tensor decomposition method for neural network acceleration and compression. Lebedev et al. [11] propose to use CP-decomposition on 4D convolutional kernels using multiple rank-one tensors, which achieves 8.5× speedup for character-classification CNN. Kim et al. [9] propose to use Tucker decomposition to speed up the execution of CNN models. They show great speed-up and energy reduction on smartphone which is equipped with a mobile GPU. Wang et al. [17] propose to accelerate the test-phase computation of CNNs based on low-rank and group sparse tensor decomposition. They achieve 6.6× speed-up on VGG-16 model within 1% increased top-5 error. Fixed-point quantization based method are also investigated by several recent works. Soudry et al. develop the Expectation Backpropagation (EBP) [1] method, which is a variational Bayes method to binarize both weights and neurons and achieves good results for fully connected networks on MNIST dataset. In the work of BinaryConnect (BC) [2], the authors propose to use binary weights for forward and backward computation while keep a full-precision version of weights for gradients accumulation. Great results have been achieved on small datasets like MNIST, CIFA-10 and SVHN. Binary-Weight-Network (BWN) [11] is proposed in a more recent work, which is among the first ones to evaluate the performance of binarization on large-scale datasets like ImageNet [15] and yields good results. These methods train neural networks from scratch and can barely benefit from pretrained networks. Hwang et al. [6] find a way by first quantize pretrained weights using a small number of weights, followed by retraining. However, their method relies on carefully choosing the step size of quantization using exhaustive search. And the scalability on large-scale datasets remains unclear. Besides these methods mentioned above, there have been other approaches on DNN acceleration. Han et al. [5] use network pruning to remove low-saliency parametera and small-weight connections to dramatically reduce parameter size. Wu et al. [18] propose to use product quantization for CNN compression and acceleration at the same time. FFT-based methods [12] are also investigated. Many works [14] utilize a smaller network but achieve comparable performance, which is another widely studied stream for DNN acceleration.

• We propose the FFN framework for DNN acceleration and compression, which is more flexible and accurate and to our knowledge, is the first fixed point decomposition based quantization method in this area. • Based on fixed point factorization, we propose a novel full precision weight recovery method, which makes it possible to make full use of the pretrained models even for very deep architectures. • We investigate the weight imbalancing problem generally existing in matrix/tensor decomposition based DNN acceleration methods. Inspired by weights initialization methods, we present an effective weight balancing technique to stabilize the finetuning stage of DNN models.

2. Related Work CNN acceleration and compression is widely studied in recent years. We mainly list works that are closely related with ours, i.e., the low-rank based methods and fixed-point quantization based methods. Deep neural networks are usually over-parameterized and the redundancy can be removed as shown in the work of [3], which achieves 1.6× speed-up for a single layer using biclustering and low-rank approximation of filter matrix. Since then, many low-rank based methods are proposed. Jaderberg [8] propose to use filter low-rank approximation and data reconstruction to lower the approximation error and achieve 2.5× speed-up for scene text character recognition with no loss in accuracy. Zhang et al. [19] propose a novel nonlinear data reconstruction method, which

3. Approaches Our method exploits the weight matrix approximation method for deep neural network acceleration and compression. Unlike many previous low-rank based matrix decomposition methods which use float point values for the factor2

ized submatrixes, our method aims at fixed-point factorization directly. To more efficiently make use of the pre-trained weights, we also introduce our novel psudo full-precision weight matrix recovery method in addition to the fix-point approximation. Thus the information of the pre-trained models is divided into two parts: the first one is the fixed-point factorized submatrices and the second one resides in the psudo full-precision weight matrices, which on the other hand, will transfor to the fixed-point weight matrices in the finetuning stage. Moreover, our weights balancing technique makes the fine-tuning more efficient and has an important role in our whole framework. We will discuss these three parts of our proposed framework thoughly in the following subsections.

constraints, the computation of SDD is a NP-hard problem. Kolda and O’Leary propose to obtain an approximate local solution by greedily finding the best next di xi yi . To further reduce the approximation error of the decomposition, we develop an improved version of wheir algorithm as in Algorithm 1 by iteratively minimizing the residual error. Algorithm 1: Improved SDD decomposition. Input: Matrix W ∈ Rm×n , non-negative integer k; Output: X ∈ {+1, 0, −1}m×k, k×k Y ∈ {+1, 0, −1}n×k, diagonal matrix D ∈ R+ Initialization: di ← 0 for i = 1, · · · , k; Select Y ∈ {−1, 0, 1}n×k

3.1. Fixed-point Factorization of Weight Matrices

while not converge do

A general deep neural network usually has multiple fully connected layers and / or convolutional layers. For the fully connected layers, the output signal vector so is computed as: so = φ(W si + b) (1)

for i = 1, · · · , k 1. R ← W −

W ≈

j6=i

dj xj yjT

2. Set yi to the i-th column of Y

where si is the input signal vector and W and b are the weight matrix and the bias term respectively. For a convolutional layer with n filters of size w × h × c where w, h and c are the kernel width and height and the number of input feature maps, if we reshape the kernel and the input volume at every spatial positions, the feedforward pass of the convolution can also be expressed by equation 1. Thus our decomposition is conducted on the weight matrix W . There is a trivial solution to the fixed-point quantization combined with matrix decomposition, i.e., to firstly conduct full precision matrix decomposition like SVD, followed by fixed point quantization of the decomposed submatrices. However, this kind of method has two obvious shortcomings. Firstly, the matrix approximation is optimized for the full precision submatrices, but not for the fixed point representation, which is our real purpose. Moreover, the weight distributions after decomposition are quite complicate, which makes fixed point quantization more difficult. On the contrary, in this subsection we propose to directly factorize the weight matrices into fixed-point format. More specifically, we want to approximate our weight matrix W ∈ Rm×n as a weighted sum of outer products of several (k) vector pairs with only ternary (+1, 0 and -1) entries, which is refered to as the semidiscrete decomposition (SDD) in the following format: k X

P

3. while not converge i compute xi ∈ {−1, 0, 1}m given yi and R ii compute yi ∈ {−1, 0, 1}n given xi and R 4. end while 5. Set di to the average of R ◦ xi yiT over the non-zero locations of xi yiT 6. Set xi as the i-th column of X , yi the i-th column of Y and di the i-th diagonal value of D end for end while Once the decomposition is done, we can replace the original weights W with the factorized ones, i.e., the X, Y and D. More formally, for convolutional layers, the original layer is replaced by three layers: the first one is a convolutional layer with k filters of size w × h × c, which are all with ternary values; The second layer is a ”channel-wise scaling layer”, i.e., each of the k feature maps is multiplied by a scaling factor; The last layer is another convolutional layer with n filters of size 1 × 1 × k, which also have ternary values.

3.2. Full-precision Weights Recovery di xi yi = XDY T

(2)

Our fixed-point factorization method is much more accurate than direct binarization or ternarization method and many other fixed-point quantization method. But there is still the need of finetuning to restore the precision of DNN

i

where X ∈ {+1, 0, −1}m×k and Y ∈ {+1, 0, −1}n×k and D ∈ Rk×k is a diagonal matrix. Because of the ternary 3

turn to 1 from 0. And these information reside in the fullprecision weight matrices and is transfored to the quantized weights during fine-tuning. Note that when the fine-tuning is done, we will discard the full-precision weighs and we only use the quantized weighs X and Y for prediction.

models. Like most of current fixed-point quantization based accelerating method, we want to use the quantized weights (the X, Y in our case) during the forward and backward propagation while use the full-precision weights for gradient accumulation. However, after factorization, we have lost the full-precision weights, i.e., we can not use the original W for gradient accumulation any longer. A simple solution is to use the float-point version of X and Y as fullprecision weights to accumulate gradients. But this is far from satisfactory. In this subsection, we present our novel full-precision weights recovery method to make the fine-tuning stage much easier. Our motivation is very simple, we want to recovery the full-precision version of X and Y , indicated by ˆ and Yˆ , which can better approximate W , at the constraint X ˆ and Yˆ turn to X and that, after fixed-point quantization, X Y respectively. We can treat our full-precision weights recovery method as an inversion of current fixed-point quantization methods. In fixed-point quantization based DNN acceleration and compression methods, we quantize each element of the full-precision weight matrices into the nearest fixed-point format value. While in our method, we have got the fixed-point version of the weights, and we want to determine from which value the fixed-point element is quantized. We turn this problem into an optimization problem as follows: ˆ Yˆ k2 minimize k W − XD F

3.3. Weight Balancing So far, we have presented our fixed-point decomposition and full-precision weights recovery method to improve the test-phase efficiency of deep neural networks. However, there is still a problem to be considered, which we refered to as weight imbalance. Weight imbalance is a common problem of decomposition based methods, not just existing in our framework. This problem is caused by the non-uniqueness of the decomposition. Suppose we have a weight matrix W , which is factorized into the product of two matrices P and Q, i.e., W = P Q. Now we consider the partial derivatives of W with respect to P and Q: ∂W ∂P = Q (5) ∂W ∂Q = P If we let P ′ = 10 ∗ P and Q′ = 0.1 ∗ Q, now the decomposition becoms W = P ′ Q′ and the partial derivatives becom: ∂W ∂W ′ ∂P ′ = Q = 0.1 ∗ Q = 0.1 ∗ ∂P (6) ∂W ∂W ′ ∂Q′ = P = 10 ∗ P = 10 ∗ ∂Q

ˆ Y ˆ X,

ˆ − X| < 0.5 subject to |X |Yˆ − Y | < 0.5

(3)

From equation 6 we can see that, P is enlarged 10 times while the gradient is reduced to one-tenth of the original. And what happend to Q is opposite to P . The consequence is that during back-propagation, Q changes frequently while P almost stays untouched. At this time, one has to use different learning rate for each layers. However, this is quite a hard job especially for very deep neural networks. In our framework, the weight matrix W ∈ Rm×n is reˆ Yˆ T , where the X ˆ ∈ Rm×k and Yˆ ∈ Rn×k placed by XD is in the range of [-1.5, 1.5] while D is at the scale of about ˆ usually 0.00001 to 0.01. And for convolutional layers, X has much more elements than Yˆ because of the w × h spaˆ To balance the weights into their tial size of filters in X. appropriate scales, and inspired by the normalized weight initialization methhod proposed in [4], we develop the following weight balancing approaches: First, we want to find the scale factor λX and λY for ˆ and Yˆ , which are proportional to the square root of the X number of their rows and colums. Second, we try to make the balanced D close to identity matrix by setting the mean value of the elements along the diagonal to one. Because for fully-connected layer, D is a element-wise scaling factor and for convolutional layers, D is a channel-wise scaling factor. Thus making D close to one will not affect the calculation of gradients much.

Here the two constraints are introduced to ensure that the ˆ and Yˆ are quantized to X and Y . The problem can be X efficiently solved by alternating method. Note we constraint ˆ and Yˆ to be always between -1.5 and 1.5 to alleviate X overfitting and during fine-tuning, we also clip the weights as well. During fine-tuning stage, we quantize the full-precision ˆ and Yˆ according to the following equation: weights of X  0.5 < Aij < 1.5  +1 q(Aij ) = 0 −0.5 ≤ Aij ≤ 0.5 (4)  −1 −1.5 < Aij < −0.5

The quantized weights are used to conduct forward and ˆ backward computation and the full-precision weights X and Yˆ are used to accumulate gradients during backpropagation. Both X an Y will change during fine-tuning ˆ and Yˆ , for example, some because of the updates of X elements of X and Y will turn from 0 to 1 and so on. We use the recovered full-precision weights instead of the float-point versions of X and Y to accumulate gradients is based on the argument that, for example again, there is more chance for 0.499 to become 1 from 0 than for 0.001 to 4

˜ Y˜ and This can be expressed by the equation 7 where X, ˜ D represent the balanced version of weight matrices.  ˜ = λX ∗ X ˆ = √ λ ∗X ˆ X   m+k   ˜ λ ∗ Yˆ Y = λY ∗ Yˆ = √n+k D ˜ =  D  λX ∗λY   ˜ =1 mean(D)

Table 1. Results on AlexNet.

Model AlexNet [10] BC-BN [2] BWN-BN [11] LDR [13] FFN FFN-BN

(7)

Top-1 Acc. (%) 57.1 35.4 56.8 55.3 59.1

Top-5 Acc. (%) 80.2 61.0 79.4 75.1 78.8 81.6

4. Experiments 4.2. VGG-16

In this section, we comprehensively evaluate our method on ILSVRC-12 image classification benchmark, which has 1.2M training examples and 50k validation examples. We report the single-crop evaluation result using top-1 and top5 accuracy. Experiments are conducted on two mostly used CNN models, i.e., AlexNet and VGG-16. All of these models are downloaded from Berkeley’s Caffe model zoo without any change and are used as a baseline for comparison.

VGG-16 [16] use much wider and deeper structure than AlexNet, having 13 convolutional layers and 3 fullyconnected layers. VGG-16 won the 2-nd place in ILSVRC 2014. We use the same rules to choose the k’s and we also set k = 3138, 3072, 1000 for last three fully-connected layers, resulting in the same number of parameters with the original VGG-16 model. During fine-tuning, we resize images to have 256 pixels at their smaller dimension. From Table 4.2 we can see that after acceleration and compression, our method even outperform the original VGG-16 model by 0.1% top-5 accuracy.

4.1. AlexNet Alexnet was the first CNN architecture that showed to be successful on ImageNet classification task. This network has 61M parameters and more than 59% of them reside in the fully-connected layers. Thus we choose relatively smaller k’s for fully-connected layers. Specifically, for the convolutional layers with 4-D weights of size w × h × c × n, we choose k = min(w ∗ h ∗ c, n). We set k = 2048, 3072, 1000 for last three fully-connected layers. The resulting architecture has 60M parameters. At finetuning stage, images are resized to 256 × 256 pixel size as the same with original Alexnet. We also compare our method with the following approaches, whose results on ImageNet dataset are available to us. The BWN method only report their results on AlexNet using batch normalization [7], so in order to compare with their results, we also report our results using batch normalization with the same settings as in BWN.

Table 2. Results on VGG-16.

Model VGG-16 [16] LDR [13] FFN

Top-1 Acc. (%) 71.1 70.8

Top-5 Acc. (%) 89.9 89.0 90.1

5. Conclusion We introduce a novel fixed-point factorized method for deep neural networks acceleration and compression. To make full use of the pre-trained models, we propose a novel full-precision weight recovery method, which makes the fine-tuning more efficient and effective. More over, we present a weight balancing technique to ease the fine-tuning stage by making training stage more stable. Extensive experiments on AlexNet and VGG-16 show the effectiveness of our method.

• BWN: [11]: Binary-weight-network, using binary weights and float-point scaling factors;

References

• BC: [2]: BinaryConnec, using binary weights, reproduced by [11];

[1] Z. Cheng, D. Soudry, Z. Mao, and Z. Lan. Training binary multilayer neural networks for image classification using expectation backpropagation. arXiv preprint arXiv:1503.03562, 2015. [2] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015. [3] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.

• LDR [13]: Logarithmic Data Representation, 4-bit logarithmic activation and 5-bit logarithmic weights. The results are listed in Table 4.1. The suffix BN indicates that this method use batch normalization [7]. From the results, we can see that without batch normalization, our method only has a 1.4% decrease in top-5 accuracy. Our method can outperform the best results by a large margin of 2.2% top-5 accuracy if batch normalization is incoporated. 5

[4] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research, 9:249–256, 2010. [5] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015. [6] K. Hwang and W. Sung. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6. IEEE, 2014. [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [8] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. [9] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [11] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. [12] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013. [13] D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016. [14] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. [15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. [16] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [17] P. Wang and J. Cheng. Accelerating convolutional neural networks for mobile applications. In Proceedings of the 2016 ACM on Multimedia Conference, pages 541–545. ACM, 2016. [18] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [19] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.

6

Fixed-point Factorized Networks

Fixed-point Factorized Networks

Suggest Documents

Factorized Weight Functions vs. Factorized Scattering

Factorized Combinations of Virasoro Characters

Factorized Runge-Kutta-Chebyshev Methods

On MC-Hypercentral Triply Factorized Groups - CiteSeerX

second order factorized model adaptation for short

Discriminatively learning factorized finite state pronunciation models

kNN Hashing With Factorized Neighborhood Representation

Khatri-inverse of factorized structured matrices ...

Human Action Recognition Using Factorized Spatio-Temporal ...

Two-spin entanglement distribution near factorized states

Feature-based factorized Bilinear Similarity Model for ... - Jiayu Zhou

Renormalization group trajectories from resonance factorized S-matrices

Factorized soft graviton theorems at loop level arXiv:1411.2230v1 [hep ...

Learning a factorized segmental representation of far-field ... - CiteSeerX

Factorized Asymptotic Bayesian Inference for Factorial Hidden ... - arXiv

Factorized ground state in dimerized spin chains - Csic

Factorized ground state in dimerized spin chains - Csic

Gray-box optimization and factorized distribution algorithms: where ...

Entanglement and factorized ground states in two-dimensional ...

Universal kinematic scaling as a probe of factorized long-distance ...

Factorized duality, stationary product measures and generating functions

FactORized MUlti-task LeArning for task discovery ... - Semantic Scholar

Factorized Latent Spaces with Structured Sparsity - EECS Berkeley

Modeling Pixel Means and Covariances Using Factorized ... - CiteSeerX