LTNN: A Layer-wise Tensorized Compression of

0 downloads 0 Views 5MB Size Report
the tensor-train data format during the training process can achieve a good compression rate with maintained accuracy. In this paper, we propose a layer-wise ...
1

LTNN: A Layer-wise Tensorized Compression of Multilayer Neural Network Hantao Huang, Student Member, IEEE and Hao Yu, Senior Member, IEEE

The trend of utilizing deeper neural network for machine learning has introduced a grand challenge of high-throughput yet energy-efficient hardware accelerators [1], [2]. For example, the deep neural network in [3] has billions of parameters that requires high computational complexity with large memory usage. From the computing hardware architecture perspective, reduction of memory access is greatly preferred to improve energy efficiency. There is a large power consumption dominated by memory access for data driven applications. For example, under 45-nm CMOS technology, a 32-bit floating point addition consumes only 0.9-pJ whereas a 32-bit SRAM cache-access takes 5.56× more energy consumption; and a 32-bit DRAM memory access takes 711.11× more energy consumption [4]. Therefore, simplified neural network by compression is greatly needed for energy-efficient hardware applications. There are many neural network compression algorithms proposed recently such as connection pruning, weight sharing and quantization [5], [6]. The work in [7] further used lowrank approximation directly to the weight matrix after training. However, the direct approximation of the neural network weight can simply reduce complexity but cannot maintain the accuracy, especially when simplification is performed to the

neural network obtained after the training. In contrast, many recent works [8], [9] have found that the accuracy can be maintained when certain constraints (binarization, sparsity, etc) are applied during the training. In addition, [10] shows using the tensor-train data format during the training process can achieve a good compression rate with maintained accuracy. In this paper, we propose a layer-wise tenorized compression algorithm of multi-layer neural network (LTNN). By reshaping neural network weight matrices into high dimensional tensors with a low-rank tensor approximation during training, significant neural network compression can be achieved with maintained accuracy. An accordingly layer-wise training algorithm is further developed for multilayer neural network by modified alternating least-squares (MALS) method. LTNN differs from the existing work [10] by adopting the layerwise training to optimize the tensor-train ranks. A smaller tensor-train rank can achieve faster training and inferencing speed (data per second) without harming the accuracy. Tensor-train quantization method is also proposed to achieve higher compression rate with maintained accuracy. LTNN can provide state-of-the-art results on various benchmarks with significant compression during numerical experiments. For MNIST benchmark, LTNN shows 64× compression rate without accuracy drop. For CIFAR-10 benchmark, LTNN shows that compression of 21.57× compression rate for fullyconnected layers with 2.2% accuracy drop. For Imagenet12 benchmark, LTNN achieves 13.95% top-5 error rate and the 35.84× compression of the neural network. For the inference of tensorized neural networks, the computation speed of our proposed method is 1.615× faster than the existing works [10] due to the smaller tensor core ranks. The rest of this paper is organized as follows. The related works are discussed in Section II. The shallow and deep tensorized neural networks are discussed in Section III with detailed layer-wise training method. The learning algorithm for LTNN and the network interpretation are elaborated in Section IV. Section V shows the detailed LTNN performance on various benchmarks, indicating a significant neural network compression rate. Finally, application results on object recognition and human-action recognition are presented in Section VI with conclusion drawn in Section VII.

Manuscript received March 24, 2017; revised December 05, 2017 and May 11, 2018; accepted Aug 31, 2018. H. Huang is with School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. H. Yu is with Southern University of Science and Technology, China ([email protected]) Copyright (c) 2018 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected].

II. R ELATED W ORKS A. Neural Network Pre-training Deep learning targets to learn hierarchical features, which can be further used for classification or detection. Although deep learning methods significantly perform better than comparable shallow neural networks, there are still challenges

Abstract—An efficient deep learning requires a memoryefficient construction of neural network. This paper introduces a layer-wise tensorized formulation of multilayer neural network, called LTNN, such that the weight matrix can be significantly compressed during training. By reshaping the multilayer neural network weight matrix into a high dimensional tensor with a low-rank approximation, significant network compression can be achieved with maintained accuracy. An according layerwise training is developed by a modified alternating leastsquares (MALS) method with backward propagation (BP) for fine-tuning only. LTNN can provide state-of-the-art results on various benchmarks with significant compression. For MNIST benchmark, LTNN shows 64× compression rate without accuracy drop. For Imagenet12 benchmark, our proposed LTNN achieves 35.84× compression of the neural network with around 2% accuracy drop. We have also shown 1.615× faster on inference speed than the existing works due to the smaller tensor core ranks.

I. I NTRODUCTION

2

to train many layers of parameters. Principally, the objective function is highly non-convex, which potentially have many local minima in the model parameter space [11]. As such, a backwards propagation based training algorithm can be easily stuck in local minima but not all the local minima provides equal performance on the validation dataset. [12] adopted alternating direction methods of multipliers (ADMM) to train the neural networks without gradient decent steps. Another faster method to tackle this challenge is to adopt model pretraining , which can provide an efficient initialization of neural network parameters for more generalized performances. Many unsupervised learning methods have been proposed to provide a more general and better performance. [13] proposed to use sparse representation based pre-training method to perform initialization. [14] proposed to use shallow neural network weights for deep neural netwrok initialization and [15] proposed a layer-wise training based on auto-encoders. In this paper, we develop an auto-encoder based layer-wise training in the tensor-train data format, which not only provides good initialization but also minimize the model size. B. Neural Network Model Compression Recently, neural network compression has gained much attention, motivated by mapping deep neural networks to resource limited hardwares. HashedNets [16] apply hashing mapping function to efficiently implement parameter sharing on the fully-connected layer with good compression rate but is inefficient for large-scaled network due to collision. Later, connection pruning [4] is proposed to compress a neural network by pruning unimportant connections, which requires recursively training to maintained the accuracy. Quantized neural network with low precision number representation is further proposed with or without retraining [17], [8], [18]. We focus quantized neural network without training, which can utilize the well-developed neural network. We develop a nonuniform weights quantization optimization method on tensor cores for the simplified numeric representation of weights under controlled accuracy. In principle, our method are based on low-rank approximation of weight matrices [19], [10]. For example, a low-rank decomposition of the neural network weight matrix: W =A×B

(1)

where W ∈ Rm×n , A ∈ Rm×r and B ∈ Rr×n . As such, one can effectively compress the weight matrix from mn to mr + nr given a small rank r. A tensor decomposition is a generalized low-rank matrix decomposition. By representing dense data (matrix) in high dimensional space (tensor) with a low rank, even higher network compression rate can be achieved [10]. However, the work in [10] requires a complicated training for multilayer network using tensorized backward propagation (BP), which is becomes slow to train even with new frameworks such as Tensorflow. Moreover, the initialization of the tensortrain based neural network weights is random. The rank of the tensor-train data format is also intuitively set. However, our method is based on the modified alternating least squares

(MALS [20]) method, which not only minimizes the rank of tensor cores but also learn weights to represent the input features in the tensor-train data format. The inference speed is also improved, where the computation speed of our proposed method is 1.615× faster than the existing works [10] due to the smaller tensor core ranks (246.15 data/s v.s. 152.38 data/s). Therefore, our training method is fundamentally different from the training method in [10]. Moreover, a left-to-right sweeping on tensor cores is further applied to efficiently reduce the rank, leading to a higher neural network compression. The contribution of this paper can be summarized from two manifolds. Firstly, this paper provides a new shallow neural network training method, which can achieve the same performance with reduced model size by adopting the tensortrain data format. Secondly, this paper proposes a new deep neural network training strategy by using layer-wise leastsquares based training method in the tensor-train format. This training method not only provides a good initialization of tensor cors to speed-up the BP process but also reduce the model size. This paper also suggests a new compression method to compress existing neural network. Experiments on Imagenet-12 results show that 33.63% and 13.15% error rate for both top-1 accuracy and top-5 accuracy can be achieved with 8.8× compression. If we consider neural network weights quantization, we can further improve the accuracy to 13.95% and the compression rate to 35.84×. III. T ENSORIZED N EURAL N ETWORKS A. Tensor-train Decomposition and Compression Tensors are natural multi-dimensional generation of matrices. Here, we refer one-dimensional data as vectors, denoted as lowercase bold face letters v. Two dimensional arrays are matrices, denoted as uppercase bold face letters V and higher dimensional arrays are tensors denoted as uppercase bold face calligraphic letters V. To refer one specific element from a tensor, we use V(i) = V(i1 , i2 , ...id ), where d is the dimensionality of the tensor V and i is the index vector. A summary of notations and descriptions is shown in Table I for clarification. A d-dimensional n1 × n2 × ... × nd tensor V is decomposed into the tensor-train data format if tensor core Gk is defined as rk−1 × nk × rk and each element is defined [21] as V(i1 , i2 , ...id ) =

r0 ,rX 1 ,...rd

G1 (α0 , i1 , α1 )

α0 ,α1 ...αd

(2)

G2 (α1 , i2 , α2 )...Gd (αd−1 , id , αd ) where αk is the index of summation, which starts from 1 and stops at rank rk . r0 = rd = 1 is for the boundary condition and n1 , n2 , ...nd are known as mode size. Here, rk is the core rank and G is the core for this tensor decomposition. By using the notation of Gk (ik ) ∈ Rrk−1 ×rk , we can rewrite the above equation in a more compact way. V(i1 , i2 , ...id ) = G1 (i1 )G2 (i2 )...Gd (id )

(3)

where Gk (ik ) ∈ Rrk−1 ×rk , a slice from the 3-dimensional matrix Gk . Fig. 1 shows the general idea of tensorized

3

HL+1

Neural network weights W folding to tensor W,

HL

WL WL-1

HL-1

W(i1, i2, …, id , j1, j2, …, jd) Dimension 2

G3

G2

G1

Gd-1

G4



W G1[i1 , j1]

Dimension 1

Tensor-train decomposition

Gd

x

1 r1

r1

x

G2[i2 , j2]

r2

r2

Gd[id , jd]

G4[i4 , j4]

G3[i3 , j3]

Gd-1[id-1 , jd-1]

x r3 r3

Tensor-train cores G

x

… xr

x rd-1

d-2

rd-1

r4

1

Fig. 1: Neural network weight tensorization and represented by the tensor-train data format for parameter compression (from nd to dnr2 )

neural network. A two-dimensional weight is folded into a three-dimensional tensor and then decomposes into tensor cores G1 , G2 , ...Gd . These tensor cores are relative small 3dimensional matrices due to small value of rank r resulting a high neural network compression rate. Such representation is memory efficient to store high dimensional data. For example, a d-dimensional tensor requires n1 ×n2 ×...×nd = nd parameters. However, ifP it is represented d using the tensor-train format, it takes only k=1 nk rk−1 rk parameters. So if we manage to reduce the rank of each core, we can efficiently represent data with high compression rate and store them distributively. B. Shallow Tensorized Neural Network We first start with single hidden layer feedforward neural network [22] and later extend to multi-layer neural network [22], [15], [23]. We refer one hidden layer neural network as shallow neural networks and more hidden layers as deep neural networks. Furthermore, we define a layer-wise tensorized neural network (LTNN) if the weight of the neural network can be represented in the tensor-train data format. For example, a two-dimensional weight W ∈ RL×N can be reshaped to a Qk1 k1 + k2 dimensional tensor by factorizing L = d=1 ld and

TABLE I: Symbol notations and detailed descriptions. Notations V ∈ Rn1 ×n2 ×...×nd Gi ∈ Rri−1 ×ni ×ri W1 , W2 , ..., Wd−1 B1 , B2 , ..., Bd−1 H1 , H2 , ..., Hd−1 T,Y U , S, V X r0 , r1 , ..., rd i1 , i2 , ..., nd n1 , n2 , ..., nd p(1/y), ..., p(m/y) nm Nt N M

Descriptions d-dimensional tensor of size n1 × n2 ...nd Tensor cores of tensor-train data format Neural network weights of d layers Neural network bias of d layers Activation matrix of d layers Labels and neural network output SVD decomposition matrices Input features Rank of tensor cores Index vectors referring to tensor element Mode size of tensor V ∈ Rn1 ×n2 ×...×nd Predicted probability on each class Maximum mode size of n1 , n2 , ..nd Number of training samples Number of input features Number of classes

Qk2 N = d=1 nd and such tensor can be further decomposed into tensor-train data format. As shown in Fig. 2, we can train a single hidden layer neural network based on data features X and training labels T with Nt number of training samples, N dimensional input features and M classes. During the training, one needs to minimize the error function with determined weights W and bias B : E = ||T − f (W , B, X)||22

x1

p(1/y)

x2 x3

p(M/y) Labels T

xn

(4)

Neuron outputs

W1 Random Assigned

W2

Least-squares Solver

Fig. 2: Layer-wise training of neural network with leastsquares solver

where f (·) is the trained model to perform the predictions from input. Here, we discuss one general machine learning algorithm using least-squares learning without tensor-train based weights, which is mainly inspired from [24]. Then we show how to train a tensorized neural network. We firstly build the relationship between hidden neural nodes and input features as 1 (5) preH = XW1 + B1 , H1 = 1 + e−preH where W1 ∈ RN ×L and B1 ∈ RNt ×L are randomly generated input weight and bias formed by wij and bij between [−1, 1].

4

The training process is designed to find W2 such that we can minimize: arg minW2 ||H1 W2 − T ||22 + λ||W2 ||22

(6)

where H1 is the hidden-layer output matrix generated from the Sigmoid function for activation; and λ is a user defined parameter that biases the training error and output weights. The output weight W2 is computed based on least-squares problem: ˜ T H˜1 )−1 H ˜1 T T˜ , H ˜1 ∈ RNt ×L W2 = (H 1     H1 T ˜1 = √ T˜ = where H 0 λI

(7)

where T˜ ∈ R(Nt +L)×M and M is the number of classes. ˜1 ∈ R(Nt +L)×L . Then neural network output I ∈ RL×L and H will be Y = f (W , B, Xt ) p(i/yi ) ≈ yi , yi ∈ Y

(8)

where Xt is the testing data and i represents class index i ∈ [1, M ]. We approximate the prediction probability for each class by the output of neural network. For LTNN, we need to tensorize input weight W1 and output weight W2 . Since the input weight W1 is randomly generated, we can also randomly generate tensor core G, then create tensorized input weight W 1 based on (3). For output weight W2 , it requires to solve a least-squares problem in the tensor-train data format. This is solved using the modified alternating least squares method and will be discussed in Section IV-A. After tensorization, the neural network processing is a direct application of tensor-train-matrix-by-vector operations. Tensorized weights W 1 ∈ Rn1 l1 ×n2 l2 ×..×nd ld isQ equivalent to d W1 ∈ RN ×L , where nk and lk following N = k=1 nk and Qd L = k=1 lk . The neural network forward pass computation in the tensor-train data format is:

Algorithm 1 Single hidden layer training of LTNN using modified alternating least-squares method Input: Input Set (X ∈ RNt ×N , T ∈ RNt ×M ), X is the input data and T is the desired output (labels, etc) depending on the layer architecture, activation function G(ai , bi , xj ) , Number of hidden neuron node L and accepted training accuracy Acc. Output: Neural Network output weight Qd Qd W 2 1: Tensorization: Factorize N = k=1 nk and L = k=1 lk where N and L are dimensions of the input weight W1 ∈ RN ×L . The tensorized input weight W 1 ∈ Rn1 l1 ,n2 l2 ,..,nd ld . 2: As (2) indicates, we need generate d cores G1 , G2 , ...Gd , where each core follows Gi ∈ Rri−1 ×ni li ×ri . Please note that since we prefer random input weights, Gi is randomly generated with small rank ri−1 and ri . 3: Perform Tensor-train-matrix-by-matrix preH = XW1 + B1 4: Activation function H1 = 1/(1 + e−preH ). 5: Calculate the output weight W 2 by modified-alternating-leastsquares (MALS), which is equivalent to for matrix calculation W2 = (H1T × H1 )−1 × H1T T 6: Calculate the training accuracy Acc. If it is less than required, increase L and repeat from step 1. 7: END Training

The explicit computation of the tensor-train data format into matrix involves an additional load O(dr3 N L). Algorithm 1 summarizes the whole training process of a single hidden layer tensorized neural network, which also provides the extension to deep neural network. Step 1 determines the mode size of the d-dimensional tensor W 1 . Then based on the mode size of each dimension, weight tensor cores are randomly generated. Tensor-train-matrix-by-matrix multiplication and modified-alternating-least-squares method are performed to find the output weight W 2 . Please note that tensor W 2 can further compressed by left-to-right sweep using truncated singular value decomposition (SVD) method. More details will be discussed in Section IV. C. Deep Tensorized Neural Network

To build a multi-layer neural network as shown in Fig. 3(a), we propose a layer-wise training process based on stack auto-encoder. An auto-encoder layer is to set the single layer l1 ,lX 2 ,...,ld output T the same as input X and find an optimal weight H1 (i) = X (j)G1 [i1 , j1 ]G2 [i2 , j2 ]...Gd [id , jd ] + B1 (i) to represent itself [15], [25], [26]. By stacking auto-encoder j1 ,j2 ,...,jd (9) layers on the final decision layer, we can build the multi-layer where i = i1 , i2 , ..., id , ik ∈ [1, nk ], j = j1 , j2 , ..., jd , jk ∈ neural network. Therefore, Algorithm 1 can also be viewed [1, lk ] and G[id , jd ] ∈ Rrk−1 ×rk is a slice of tensor cores. as unsupervised layer-wise learning based on auto-encoder by This is d times summation with summation index jk increased changing the desired output T into input features X. Our proposed LTNN applies layer-by-layer learning as from 1 to lk . We use a pair [ik , jk ] referring to the index of the mode nk lk . This tensor-train-matrix-by-vector multipli- shown in Fig. 3(b). The learning process is the minimization cation complexity is O(dr2 nm max(N, L)), where r is the of: maximum rank of cores Gi and nm is the maximum mode arg minWl = ||f (f (Hl Wl0 + Bl0 )Wl + Bl ) − Hl ||2 (10) size of tensor W 1 . This can be illustrated in Fig. 1, where a two-dimensional weight is folded into a three-dimensional where Wl is the auto-decoder learned weights and will be tensor and then decomposes into tensor cores G1 , G2 , ...Gd . passed to the multi-layer neural network on layer L. Wl0 The pair [ik , jk ] refers to a slice of the tensor core Gk . This and Bl0 are the random generated input weights and bias can be very efficient if the rank r is very small compared respectively for such auto-encoder. f (·) is the activation functo general matrix-vector multiplication. Note that in order to tion. Algorithm 2 summarizes the whole training process of perform efficient tensor-train-matrix-by-vector multiplication, multi-layer tensorized neural network. The auto-encoder is the tensor-train data format should not be explicitly computed. to reconstruct the noisy input with certain constraints and

5

p(1|y)

mathematically defined as 0

p(2|y)

p(M-1|y) p(M|y)

Input Layer Layer 1

Layer 2

Layer l

Layer l+1

(a) Wl bl

Wl bl

h1

Classification Layer

h1

f( )

h2

f( )

h3

h2

h3

f( )

f( )

hn-2

hn-2 f( )

hn-1 f( )

hn

Hl Layer l Output

Y i00 j 00 d00 =

p(3|y)

Wl bl

W1 b1

Hl Activation of Layer l

hn-1 hn

Hl

Layer l Output

(b)

1

0

H X W X D X

F i0 j 0 d0 d00 × X i00 +i0 −1,

j 00 +j 0 −1, d0

i0 =1 j 0 =1 d0 =1

(11) 0 0 where X ∈ RH×W ×D is the input, F ∈ RH ×W ×D×D” 00 00 00 is the filter and Y ∈ RH ×W ×D is the output from this convolution layer. H, W and D are input data height, width and channels respectively where as H 0 , W 0 and D0 represent kernels specification accordingly. A bias term B can be added to (11) but we omit this for clearer presentation. A fullyconnected layer is a special case of convolution operation and it is defined when the output map has dimensions W 00 = H 00 = 1. For tensorized CNN, we will treat CNN as feature extractor and use LTNN to replace the fully-connected layer to compress the neural network since the parameters from fully-connected layers consume significant large portion of parameters. For example, a VGG-16 net [14] has 89.39% (472M B/528M B) parameters using in the 3 fully-connected layers. As shown in Fig. 4, repeated blocks of convolutional layer, pooling layer and activation layer are added on the top. We regard this as CNN feature extractor. Then these features are used in LTNN as input features with labels for further training. Finally, the whole network can be fine-tuned with backward propagation (BP).

Fig. 3: (a) Deep Neural Network (b) Layer-wise training process for deep neural network

IV. L AYER - WISE T RAINING A LGORITHMS A. Layer-wise Training of LTNN

the desired output is set to input itself as shown in Fig. 3(b). Therefore, the first training process is to perform autoencoder layer-by-layer. Then the last layer is performed by modified-alternating-least-squares (MALS) method as shown in Algorithm 1. The optimization process of MALS will be discussed later in Section IV Convolutional neural network (CNN) is widely applied for image recognitions. The process of convolving a signal can be Algorithm 2 Stacked auto-encoder based layer-wise training of multi-layer LTNN Input: Input Set (X, T ), X is the input data and T is label matrix, N L number of layers , activation function G(ai , bi , xj ) , maximum number of hidden neuron node L and accepted training accuracy Acc. Output: Neural Network output weight β 1: While l < N L − 1 2: Prepare auto-encoder training set Hnl , L, Acc, where Hl = X for the first layer. 3: perform least-squares optimization on layer l arg minWl = ||f (f (Hl Wl0 + Bl0 )Wl + Bl ) − Hl ||2 4: Calculate next layer output Hl+1 = f (Hl , Wl , Bl ) 5: END While 6: For the final layer, prepare the feed forward intermediate results HN L−1 , and label T . 7: Perform Algorithm 1 training process for the final layer weight WN L−1 8: If the training accuracy is less than required Acc, perform BP based on (12) 9: END Training

Layer-wise training of LTNN requires to solve a leastsquares problem in the tensor-train data format. For the output weight W2 in (6), directly tensorization of W2 is similar to high-order SVD and performance degradation is significant with relative small compression rate. Instead, we proposed a tensor-train based least-squares training method using modified alternating least squares algorithm (also known as density matrix renormalization group in quantum dynamics) [20], [27]. The modified alternating least squares (MALS) for minimization of ||H1 W 2 − T ||2 is working as below. 1) Initialization: Randomly initialized cores G and set W 2 = G1 × G2 × ... × Gd 2) Sweep of Cores: core Gk is optimized with other cores fixed. Left-to-right sweep from k = 1 to k = d 3) Supercore generated: Create supercore X(k, k + 1) = Gk × Gk+1 and find it by minimizing of least-squares problem ||H1 ×Qk−1 ×Xk,k+1 ×Rk+2 −T ||2 , reshape Qd Qk−1 Qk−1 = i=1 Gi and Rk+2 = i=k+2 Gi to fit matrix-matrix multiplication 4) Split supercore: SVD X(k, k+1) = U SV T , let Gk = U and Gk+1 = SV T × Gk+1 . Gk is determined and Gk+1 is updated. Truncated SVD can also be performed by removing smaller singular values to reduce ranks. 5) Sweep Termination: Terminate if maximum sweep times reached or error is smaller than required. The low rank initialization is very important to have smaller rank r for each core. Each supercore generation is the process 1 Stride

is assumed 1 and no padding is on the input data

6

Memory intensive process

Computation intensive process

32x32x3 input images

Vectorization W1

Repeatable blocks

Tensorized Weights W2

Airplane Bird

Cat Ship

Input layer

Convolutional Layer

Pooling Layer

Fully-connected Layer

Fig. 4: Convolution neural network with tensorized weights on fully-connected layers for neural network compression

n1

r1

r2

n2

n3

n1

n1

rd-1 n5

r4

nd

Tensor Core Merging

rd-1 Super Core Generated

n3

n4 r3

r2

n1

r4

n4 r3

r2 n1n2

r3

n1n2

n3

r1

r2

n2

n3

r1

r2

n2

n3

r1

r2

n2

n3

n5 r4

n4 r3

rd-1 n5

r4

n4 r3

n4

nd rd-1

n5 r4

n4 r3

nd

nd rd-1

n5 r4

nd rd-1

n5

nd

Super Core Optimization: least-squares optimization

Truncated SVD: r 1 is updated rank by removing small singular value

Right Sweep and merging core

Continue right sweep and then left sweep

Fig. 5: Diagrammatic notation of tensor-train and the optimization process of modified alternating least-squares method

of solving least-squares problems. The complexity of leastsquares for X are O(nm Rr3 + n2m R2 r2 ) [27] and the SVD compression requires O(nm r3 ), where R, r and nm are the rank of activation matrix H1 , the maximum rank of core G and maximum mode size of W 2 respectively. By using truncated SVD, we can adaptively reduce the rank of each core to reduce the computation complexity. Fig. 5 gives an example of MALS sweeping algorithms on diagrammatic notation of tensors 2 . Firstly, we perform a random guess of tensor cores G, which is represented by a node with edges. The edge represents the dimension of tensor 2 Diagrammatic

notation of tensors is detailed in [20].

cores. So the most left one and right one have two edges and the rest have thee edges. This is exactly the same as tensor core definition G ∈ Rrk−1 ×nk ×rk with boundary condition r0 = rd = 1 as discussed in Section III-A. Then, adjacent cores merges together to form a suprecore. Such supercore is optimized by standard least-squares optimization. After that, the supercore is spited into two cores by truncated SVD, where small singular values are removed. Finally, the sweep will continue from left to right and then sweep back until the algorithm meets the maximum sweeping times or satisfying the minimum error requirement. The number of dimension d is chosen based on the weight dimension of fully connected layers. Given the same rank, the larger number of dimension, the higher compression can be achieved. However, to maintain a reasonable accuracy, the number of dimension is usually set 4 to 6 and each mode size is selected to be almost the same. Then the number of dimension d is adjusted based on the performance. B. Fine-tuning of LTNN End-to-end learning of LTNN is desirable to further improve the accuracy of the layer-wise learned network. Backward propagation (BP) is widely used to train deep neural network. For LTNN, the gradient of tensor layer can be computed as X ∂E ∂H(i) ∂E = (12) ∂G ∂H(i) ∂G k

i

k

where E is the loss function and H(i) has the same definition as (9). The computation complexity is very high (O(d2 r4 m max(M, N ))), which increases the training time and limits the number of epochs. Therefore, it is necessary to have a good initialization by our proposed layer-wise learning method to reduce the number of epochs required for BP. C. Quantization of LTNN To achieve high compression rate without accuracy loss, we further show how to use less bit-width to represent weights in the tensor-train data format. Instead of performing quantization on weight W , we propose a non-uniform quantization on tensor cores G as shown below. Let X represent the vectorized ˆ be the representative levels. Given tensor-train cores G and X

7

a probability density function (pdf) fX (x), our objective is to find ˆ 2] (13) minXˆ M SE = E[(X − X) ˆ are the quantized representative levels. This can be where X solved by iterative optimization for quantizer design. Firstly, we can have a random guess of representative levels x ˆq . Then we can calculate the decision thresholds as tq = (ˆ xq + x ˆq−1 )/2, where q = 1, 2, 3, ..., Mq − 1. Mq represents the number of levels. The new representative values can be calculated as R tq+1 xfX (x)dx t (14) x ˆq = Rqtq+1 fX (x)dx tq We can iteratively calculate the decision thresholds and also the new representative values until convergence is reached for the optimal quantizer. Note that we can estimate the pdf fX (x) by sampling tensor core weight values. Detailed results will be shown in the experiments. D. Network Interpretation of LTNN The tensor-train based neural network can be greatly fit into the multilayer neural network framework. We will explain this from both deep features perspective and stacked autoencoders perspective. By representing weights in tensor-train data format, we actually approximate deep neural network architecture in a compressed method. Obviously, the tensor cores are not unique and can be orthogonalized by left-toright sweep or right-to-left sweep. This sweep is achieved by performing singular vector decomposition (SVD) to tensor cores. (15) G = U SV T where U and V are orthogonal and S are the singular values matrix. So for left-to-right sweep, we can keep U and merge SV T to the next core. This process can be explained as : W(i1 , i2 , ...id ) =G1 (i1 )G2 (i2 )...Gd (id ) U1 S1 V1T G2 (i2 )...Gd (id )

(16)

U1 U1 U2 ...Ud−1 C where U1 , U2 , ... are orthogonal cores by SVD operations and C is the final core for this weights. For such neural network weight W, it means the input features have been through sequentially orthogonal transformations by U1 , U2 ... and then by multiplying C, the feature space will be projected to a large or smaller space. From the stacked auto-encoder perspective, the learned tensors core is equivalently as auto-encoder with reconstruction optimization: arg minWL = ||f (HL WLT )WL − HL ||2 subject to WLT × WL = I

(17)

where WL0 is randomly generated and the activation function f (·) can be removed. The orthogonal constraints of WL are also added. Under such reconstruction constraints, the training objective can be represented as arg minW1 ,...,WL = ||W1 W2 , ..., WL Wf X − T ||2

(18a)

arg minW = ||WWf X − T ||2

(18b)

TABLE II: Specification of benchmark datasets Dataset [29] Name Iris Adult Credict Diabets Glass Leukemia Liver Segment Wine Mushroom Vowel Shuttle CIFAR-10 [34] MNIST [30] MSRC-12‡ [35] MSR-Action3D‡ [36]

Training Samples 120 24580 552 154 171 1426 276 1848 142 6499 422 11600 50000 60000 2916 341

Test Sampels 30 6145 138 612 43 5704 70 462 36 1625 106 2900 10000 10000 2965 309

Attributes

Classes

3 14 14 8 10 38 16 19 3 22 10 9 32x32x3 28x28 1892 672

4 2 2 2 7 2 2 12 12 3 11 7 10 10 12 20

‡ Extracted features from action sequences

where Wf represents final decision layer, which is not determined from auto-encoder. (18a) is equivalently to find a tensor W with tensor-train decomposition cores W1 , W2 ...WL as shown in (18b). Under such interpretation, we expect similar behavior of tensorized neural network as stacked auto-encoder based neural networks. V. P ERFORMANCE E VALUATION AND A NALYSIS A. Experiment Setup The neural network design is performed on Matlab using Tensor-train toolbox [10], [27] and Matcovnet [28]. All the experiments are performed on the Dell server with 3.47GHz Xeon Cores and 48G RAM. Four GPUs (two Quadro 5000 and two GTX-1080 Ti) are also used to accelerate the backward propagation (BP) training process of tensor-train layers [10]. We analyze shallow LTNN and deep LTNN on UCI dataset [29] and MNIST [30] dataset. To evaluate the model compression, we compare our shallow LTNN with SVD based node pruned method [31] and general neural network [24]. We also compare auto-encoder based deep LTNN with other various works [10], [19], [32], [33]. The details of each dataset are summarized in Table II. B. Shallow LTNN As discussed in Section III, a shallow LTNN is a single hidden layer feed forward neural network. The first layer is a randomly generated tensor-train based input weight and the second layer is optimized by modified alternating least squares method (MALS). In this experiment, we can find that tensortrain based neural network shows a fast testing process with model compressed when the tensor rank is small. To evaluate this, we apply the proposed learning method comparing to general neural network [24] on UCI Pddataset. Please note that the memory required for LTNN is k=1 nk rk−1 rk comparing to N = n1 × n2 × ... × nd and the computation process can be speed-up from O(N L) to O(dr2 nm max(N, L)), where nm is the maximum mode size of the tensor-train. Table III shows detailed comparison of speed-up, compressed-model and accuracy between LTNN, general neural network and SVD

8

(a)

(b)

Fig. 6: Visualization of layer-wise learned weights on layer 1 by reshaping each independent column into square matrix with (a) rank = 50 and (b) rank = 10 on MNIST dataset (The range of weights is scaled to [0 1] and mapped to grey-colored map, where 0 is mapped to black and 1 to white) TABLE III: Performance comparisons between tensorized neural network (TNN), general neural network and SVD pruned neural netowrk on UCI dataset Dataset§ Wine iris adult credict diabetes Glass leukemia liver mushroom segment shuttle Vowl

No. of Hid. 512 64 128 128 128 512 256 128 512 128 1024 256

Test-time (s) TNN‡ 1.79E-02 7.57E-04 8.70E-03 9.26E-04 8.07E-04 2.00E-03 4.97E-04 6.15E-04 1.76E-02 5.72E-03 5.29E-02 9.50E-03

Test-acc TNN‡ 0.8649 0.968 0.788 0.778 0.71 0.886 0.889 0.714 0.9926 0.873 0.995 0.9626

Test-time (s) General NN 2.43E-02 2.20E-03 1.04E-02 2.20E-03 1.80E-03 9.90E-03 1.50E-03 2.00E-03 1.95E-02 1.84E-02 5.41E-02 3.04E-02

Test-Acc General NN 0.7027 0.991 0.784 0.798 0.684 0.909 0.889 0.685 0.9951 0.886 0.989 0.9159

Test-time (s) SVD 2.80E-02 8.73E-04 9.36E-03 2.02E-03 1.70E-03 5.20E-03 1.49E-03 1.63E-03 1.84E-02 1.52E-02 3.81E-02 0.0176

Test-Acc SVD 0.7297 0.935 0.789 0.743 0.496 0.801 0.778 0.7 0.9948 0.847 0.986 0.935

Mode TNN 1.043 1.869 1.766 3.137 1.320 1.816 1.580 1.837 1.880 1.598 1.468 1.233

Cmp SVD 1.111 1.103 1.113 1.113 1.113 1.111 1.113 1.113 1.111 1.113 1.111 1.113

Acc. TNN -0.162 0.023 -0.004 0.020 -0.026 0.023 0.000 -0.029 0.002 0.013 -0.006 -0.047

loss SVD -0.027 0.056 -0.005 0.055 0.188 0.108 0.111 -0.015 0.000 0.039 0.003 -0.019

Speed up TNN SVD 1.358 0.868 2.906 2.520 1.195 1.111 2.376 1.087 2.230 1.061 4.950 1.904 3.018 1.008 3.252 1.228 1.108 1.060 3.207 1.211 1.023 1.419 3.200 1.727

We randomly choose 80% of total data for training and 20% for testing. ‡ Rank is initialized to be 2 for all tensor cores. No BP fine tuning is performed

TABLE IV: Model compression under different number of hidden nodes and tensor ranks on MNIST dataset by layerwise training No Hid† Compression Q. Compr.

32 5.50 22

64 4.76 15.92

128 3.74 14.96

256 3.23 12.92

512 3.01 10.70

1024 4.00 14.22

2048 8.18 26.18

Rank‡ Compr. (2048) Q. Compr. Accuracy ( %) Compr. (1024) Q. Compr. Accuracy ( %)

15 25.19 100.76 90.42 11.11 39.50 91.24

20 22.51 80.04 90.56 9.75 39 92.67

25 20.34 72.32 91.41 6.87 24.43 93.11

30 17.59 62.54 91.67 6.06 21.55 93.89

35 14.85 59.40 93.47 5.74 18.37 94.54

40 12.86 41.15 93.32 4.39 15.61 95.51

45 8.63 27.62 93.86 3.89 12.45 96.32

‡ Number of hidden nodes are all fixed to 2048 (1024) with 4 fully-connected layers. † All tensor Rank is initialized to 50 without fine-tuning.

Fig. 7: Tensorized neural network layer 1 weight and layer 2 weight histogram with approximated Gaussian distribution

pruned neural network. It clearly shows that proposed method can accelerate the testing process comparing to general neural network. In addition, our proposed method only suffers around 2% accuracy loss but SVD based method has varied loss (up to 18.8 %). Furthermore, by tuning the tensor rank we can achieve 3.14× compression for diabetes UCI dataset. Since we apply 10% node pruning by removing the smallest singular values, the model compression remains almost the same for different benchmarks. As discussed in Section IV-C, bit-width configuration is an effective method to reduce the model size. To achieve such goal, we apply non-uniform quantization for the tensor core weights. As shown in Fig. 7, the probability density function of tensor core weights can be modeled as Gaussian distribution. For such known pdf, we can effectively find the optimal

level representative values with minimized mean square error. Fig. 8 shows the trade-off between accuracy, bit-width and compression rate on MNIST dataset with shallow tensorized neural network. C. Deep LTNN For a deep tensorized neural network, we mainly investigate auto-encoder based multi-layer neural network on MNIST dataset. We firstly discuss the learnt filters by the proposed auto-encoder. Then the hyper-parameter of LTNN such as tensor-train ranks and number of hidden nodes are discussed with respect to the testing time, neural network compression rate and accuracy. Fig. 6 shows the first layer filter weights of proposed autoencoder. Please note that the LTNN architecture for MNIST

9

TABLE V: CNN architecture parameters and compressed fully-connected layers Layer 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Type Input Convolutional Max Pooling ReLu Convolutional Relu Ave. Pooling Convolutional Ave. Pooling Reshape Fully-Connected Relu Fully-Connected Fully-Connected

No. of maps and neurons 3 Maps of 32x32 neurons 32 Maps of 32x32 neurons 32 Maps of 16x16 neurons 32 Maps of 16x16 neurons 32 Maps of 16x16 neurons 32 Maps of 16x16 neurons 32 Maps of 8x8 neurons 32 Maps of 8x8 neurons 32 Maps of 8x8 neurons 512 Maps of 1x1 neurons 64 Maps of 1x1 neurons 64 Maps of 1x1 neurons 512 Maps of 1x1 neurons 10 Maps of 1x1 neurons

100%

600

Compression Rate

Comprssion Rate

400

60%

Accuracy

300

40%

Accuracy Loss

200

20%

100 0

2

4

6

8

10

Testing Accuracy

80%

500

0

Kernel — 5x5 3x3 — 5x5 — 3x3 5x5 3x3 – 1x1 – 1x1 1x1

0%

Sride — 1 2 — 1 — 2 1 2 — 1 — 1 1

Pad — 2 [0 1 0 1] — 2 — [0 1 0 1] 2 [0 1 0 1] — 0 — 0 0

No. Param. — 2432 — — 25632 — — 25632 — — 32832 — 33280 5130

Compr. Param. — 2432 — — 25632 — — 25632 — — 7360 (4.40×) — 10250 (3.747×)

TABLE VI: Test-error comparison with 64× model compression under single hidden layer neural network for MNIST dataset Method Random Edge Removal [32] Low Rank Decomposition [19] Distilled Knowledge in neural network [33] Hashing trick based compression [37] Tensorzing Neural Network [10] LTNN Quantized LTNN ‡

Error rates 15.03% 28.99% 6.32% 2.79% 1.90% 2.21% 1.59%

‡ Only tensorized layers are quantized with 9 bit-width.

Bit-width of tensor-train based weights

Fig. 8: Compression rate and accuracy with increasing bitwidth of tensor core weights

is W1 (784 × 1024), W2 (1024 × 1024) and W3 (1024 × 10). We re-arrange each column into a square image and visualize on each cell of the visualization panel. We only visualize those independent columns. Therefore, from Fig. 6, we can see that the larger ranks, the more independent columns and the more visualization cells. We can also find that in Fig. 6(a), large tensor ranks can learn more filters. The first three rows are mainly the low frequency information and then the middle rows consist of a little detailed information of the input images. In the last few rows, we can see more sparse cells there representing high frequency information. In comparisons, Fig. 6(b) shows less independent filters due to the smaller tensor rank. However, we still can find similar filter patterns in it. It is a subset of Fig. 6(a) filter results. Therefore, by reducing ranks, we can effectively tune the number of filters required, which will provide the freedom of finding an optimal number of filters to save parameters. Fig. 9 shows the testing accuracy and running time comparisons for MNIST dataset. It shows a clear trend of accuracy improvement with increasing number of hidden nodes. The running time between LTNN and general NN is almost the same. This is due to the relative large rank r = 50 and computation cost of O(dr2 nm max(N, L)). Such tensor-train based neural network achieve 4× compression within 1.5% accuracy loss under 1024 number of hidden nodes respectively. Details on model compression are shown in Table IV. From Table IV, we can observe that the compression rate is directly

connected with the rank r, P where the memory storage can d be simplified as dnr2 from k=1 nk rk−1 rk but not directly link to the number of hidden nodes. We also observe that by setting tensor core rank to 35, 14.85× model compression can be achieved with acceptable accuracy loss. The compression rate can be further improved using quantized LTNN to 59.4× compression rate. Table IV also shows the clear effect of quantization on the compression rate. In generally, bit-width quantization can help improve 3× more compression rate on neural network. Therefore, low rank and quantized tensor cores are important and orthogonal methods to increase compression rate. Fig. 10 shows the increasing accuracy when we increase the rank of tensor-train based weights. Here, we have two factorization of the input image 28 × 28, which we refer to LTNN 1 (2 × 2 × 2 × 2 × 7 × 7) and LTNN 2 (4 × 4 × 7 × 7). We can observe that changing the weight factorization will slightly affect the accuracy of the neural network by around 1% accuracy. Furthermore, we find the change trend of compression rate is the same for both factorization modes but LTNN 1 can compress more parameters. We can conclude that decomposing weights to more tensor cores will empirically improve the compression rate as well as accuracy. D. Fine-tunned LTNN The proposed LTNN can also perform end-to-end learning to fine-tune the compressed model to improve the accuracy result. Under 1024 number of hidden nodes and 15 maximum rank, we can achieve 91.24 % testing accuracy. Then we perform left-to-right sweep to remove small singular values to make the compression rate 64. This compression rate is

1 .4

1 0 5 %

1 .2

9 0 %

A ro u n d 1 .5 %

1 .0

a c c u r a c y lo s s

7 5 %

4 x m o d e l c o m p r e s s io n

0 .8 0 .6 0 .4 0 .2

2 5 6

5 1 2

7 6 8

1 0 2 4

1 2 8 0

L T N G e n L T N G e n

N . te N . a

1 5 3 6

6 0 %

te s t tim e s tin g tim e a c c u ra c y c c u ra c y

N u m b e r o f H id d e n N o d e s

1 7 9 2

4 5 % 3 0 %

2 0 4 8

Objective

Top 1 Error

T e s tin g A c c u r a c y

T e s tin g T im e

10

Top 5 Error

Epoch 1, Error 0.0535

1 5 %

Fig. 9: Testing time and accuracy comparison between LTNN and general neural network (Gen.) with varying number of hidden nodes

C o m p r e s s io n R a te

1 2 1 0

N 1 C N 2 C N 1 A N 1 A

o m p o m p c c u r c c u r

re s s io n re s s io n a c y a c y

9 9 %

9 6 %

9 3 %

8 6

9 0 %

4

T e s tin g A c c u r a c y

L T N L T N L T N L T N

Fig. 11: Fine tuning process from proposed LTNN layer-wise training method with 64× compression on single hidden layer neural network on MNIST dataset

8 7 %

2 1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

R a n k o f te n s o r -tr a in b a s e d w e ig h ts

Fig. 10: Compression and accuracy comparison between two factorization (LTNN 1 (2 × 2 × 2 × 2 × 7 × 7) and LTNN 2 (4 × 4 × 7 × 7) of tensorized weights with varying ranks

F o rw a rd p a s s B a c k w a r d s P a s s ( G r a d ie n t D e s c e n t) B a c k w a rd s P a s s (A D A M )

1 0 0 0 0

8 0 0 0

D a ta p e r s e c o n d

1 4

7 8 9 2 .0 4 d a ta /s

4 tim e s d a ta s p e e d d iffe r e n c e 6 0 0 0

4 0 0 0

1 9 7 1 .5 9 d a ta /s

2 0 0 0

0

set mainly for the result comparisons with other works. After fixing the compression rate, the fine-tuning process is shown as Fig. 11. The top1err means the prediction accuracy after one guess and top5err refers to the prediction accuracy after five guesses. We can find a very deep accuracy improvement at the first 5 epochs and then it becomes flat after 20 epochs. The final error rate is 2.21%. By adopting non-uniform quantization of 9 bit-with on LTNN with fixed 64 compression rate, we achieve a even lower error rate ( 1.59%). We also compare this with other works as summarized in Table VI. To have a fair comparison with [4], we also adopt the same network, LeNet-300-100 network [30], which is a fourlayer fully-connected neural network (784 × 300, 300 × 100 , 100 × 10). Under such configuration, [4] can achieve 40× compression with error rate 1.58 % using quantization (6 bitwidth precision), pruning and huffman coding techniques. In our proposed LTNN, with the same compression rate 40×, we can achieve 1.63 % error rate under single floating point (32 bit) precision. By adopting 9 bit-width on tensorized layer, we can achieve smaller error rate 1.55 % with the same compression rate 3 . Please note that by using more advanced techniques such as weights sharing and Huffman coding [4], our proposed LTNN can achieve more compression. We leave 3 The

improvement of accuracy is mainly due to the increased rank value since both tensor-train and quantization techniques are applied to maintain 64× compression rate

2

4

6

8

1 0

1 2

1 4

1 6

1 8

2 0

T e n s o r - tr a in r a n k s

Fig. 12: Forward and backwards computation speed measured using batch size divided by time (data/second)

this for future study. We also perform an additional study on the computation speed due to different tensor-train ranks. Although [10] reports the asymmetric training complexity O(d2 r4 nm max(M, N )) and testing complexity O(dr2 nm max(N, L)), we find that the modern computation framework can greatly compensate the speed difference using the automatic differentiation method. Fig. 12 shows the training speed and testing speed difference maintains almost the same with increasing tensortrain ranks. Here the computation speed refers to the time consumed to perform a batch of 100 images with size of 32 × 32 × 3 per second (batch size divided by time). We use a four layer neural network to perform the experiment with weight 1 (3072 × 262144), weight 2 (262144 × 4096) and weight 3 (4096 × 10). The first two weight matrices are in the tensor-train data format. The weight 1 is adopting the input mode (4 × 4 × 4 × 4 × 4 × 3) and output mode (8×8×8×8×8×8) whereas the weight 2 is (8×8×8×8×8×8) and (4 × 4 × 4 × 4 × 4 × 4), respectively. The ranks are set the same for all the tensorized neural networks and varied from 2 to 18 as shown in Fig. 12. The timing is measured on a

11

TABLE VII: Neural network compression algorithm comparisons on Imagenet2012 Neural Network Fastfood 32 AD [39] SVD [40] Collins & Kohli [41] Pruning+Quantization [4] Tensorized NN [10] Proposed LTNN ‡ Proposed LTNN-TF Proposed Quantized LTNN

Top-1 Error 42.78% 44.02% 44.40% 31.17% 32.2% 33.63% 35.636% 33.63%

Top-5 Error – 20.56% – 10.91% 12.3% 13.15% 14.07% 13.95%

Model Size 131 MB 47.6 MB 61 MB 17.86 MB 71.35 MB 60 MB 60 MB 14.73 MB

Compression Rate 2× 5× 4× 31× 7.4× 8.8× 8.8× 35.84×

‡ The reported result is trained on Matcovnet with 0.0001 learning rate on convolutional layers and 0.001 learning rate on the tensorized layers.

single Nvidia GTX-1080 Ti machine. We further adopt different backwards propagation algorithms, which are backwards propagation (BP) and adaptive moment estimation (Adam) [38]. We find BP and Adam perform at almost the same speed with the same tensor-train ranks. We also observe that the tensor-train ranks can greatly affect both the training speed and testing speed. As such, a layer-wise training method to find the minimum tensor ranks is greatly needed to optimize the performance. Therefore, we can conclude that using a tensor-train layer on the fully-connected layers provide more flexiable tradeoff between network compression and accuracy. An end-toend fine-tuning can further improve the accuracy without compromising compression rate. Adopting layer-wise pretraining method can also improve the inference speed. VI. A PPLICATIONS In this section, we will further discuss two applications. One is object recognition using deep convolution neural network on CIFAR-10 [34] dataset and Imagenet2012 dataset [42]. The other one is human action recognitions on MSRC-12 Kinect and MSR-Action3D dataset [35], [36] . The main objective here is to achieve state-of-the-art performances with significant neural network compression. A. Object Recognition Here, we discuss object recognition with convolution neural network (CNN) on CIFAR-10 dataset. CIFAR-10 dataset consists of 60, 000 images with size of 32 × 32 × 3 under 10 different classes. This dataset contains 50, 000 training images and 10, 000 testing images. For CNN architecture, we adopt LeNet-alike neural network [30] as shown in Table V. We use a two-layer fullyconnected tensorized neural networks to replace the original fully-connected layers. Therefore, the 512 neural output will be treated as input features and we built two tensorized layers of 512 × N and N × 10, where N represents number of hidden nodes. For N = 512, our proposed LTNN has 12416 (7296+5120) number of parameters of fully-connected layers, which is 5.738× and 1.752× for fully-connected layers and the whole neural network. The testing accuracy is 75.62% with 3.82% loss comparing to original network. For N = 1024, our proposed LTNN can perform the compression of 4.045× and 1.752× for fully-connected layers and the whole neural network respectively. The testing accuracy is 77.67% with 1.77%

loss comparing to original network. This is slightly better than [10] in terms of accuracy, which is 1.7× compression of the whole network with 75.61 % accuracy. Another work [19] can achieve around 4× compression on fully-connected layers with around 74% accuracy and 2 % accuracy loss. By adopting non-uniform tensor core quantization (6 bit-width), we can achieve 21.57× and 2.19× compression on fully-connected layers and total neural network respectively with 77.24 % performance (2.2% accuracy loss). Therefore, our proposed LTNN can achieve high compression rate with maintained accuracy. We also perform large scale dataset on Imagenet2012 dataset [42], which has 1.2 million training images and 50,000 validation images with 1000 classes. We use deep the VGG16 [14], [10] as the reference model, which achieves 30.9% top1 error and 11.2% top-5 error. The VGG16 model consists of 3 fully-connected layers with weight matrices of sizes 25088 × 4096, 4096 × 4096 and 4096 × 1000. We replace the 25088×4096 as a tensor with an input mode 8×7×7×8×8 and an output model 4×4×4×4×4. Another fully-connected layers 1024 × 1000 is inserted for the final classification. We re-use the VGG16 convolution filters and train the new replaced fully connected layers. The two layers are pre-trained by the layerwise learning. We fine-tune the whole neural network using the default learning rate in Matconvnet [28] with 32 images per batch. Within 15 epochs in the Matcovnet framework, the fine-tuned neural network achieves 33.63% and 13.15% error rate for top-1 accuracy and top-5 accuracy, respectively. We randomly select 1000 training images to perform the layerwise training, which takes 15226s (0.42h) to optimize the tensor core ranks. Another 4964s (1.38h) is consumed to perform one batch learning. The proposed training method (LTNN) takes 15 epochs ( 90 hours) to finetune comparing to 20 epochs ( 160 hours) of random initialized tensor layers (TNN) using the backwards propagation. The whole neural network is compressed by 8.8×. If we consider neural network weights quantization, we can further improve the accuracy to 13.95% and the compression rate to 35.84× with 8 bit-width for convolution filters and 7 bit-width fully-connected layers. We further perform the training using Tensorflow [43] framework to analyze the training process using more advanced optimizers. We perform the training on the tensorized layers only as discussed in [10] and achieve 14.07% top-5 error rate, which shows that the fine tuning process can help improve around 1% accuracy. For the comparison of the previous work [10], our compression rate is higher than 7.4×. Note that we set the tensor train format exactly following the description from [10] as the input mode 2×7×8×8×7×4 and the output mode 4×4×4×4×4×4. The other major difference between the proposed method and [10] is the adoption of layer-wise training for the weights initialization. For Imagenet12 dataset, we both compress the VGG16 neural network. Since the author of [10] did not publish the source code for Imagenet12 experiment, we implement the code and perform the comparisons on the Tensorflow framework [43]. Adam optimizer is chosen to perform the optimization with beta1 = 0.9, beta2 = 0.999. beta1 and beta2 represents the exponential decay rate from the first and second moment estimates. The

12

1 .0 0 .9

T o p T o p T o p T o p

0 .8 0 .7 E rro r R a te

0 .6

-1 e -1 e -5 e -5 e

rro rro rro rro

r (T r (L r (T r (L

N N T N N N T N

) N ) ) N )

Fig. 14: Human action from MSRC-12 Kinect dataset: a sequence of 8 frames action for the Start System Gesture [44].

0 .5 0 .4 0 .3

TABLE VIII: Gesture classes and the number of annotated instances for each class in MSRC-12 Kinect dataset

0 .2 0 .1 0 .0 2 0 0 0 0

4 0 0 0 0

6 0 0 0 0

8 0 0 0 0

1 0 0 0 0 0

T r a in in g tim e ( s )

Fig. 13: Error rate improvement with training timing using Adam optimizer

learning rate is chosen to be 0.001 with exponential decay. The trained model is stored with time interval of 1350 seconds at first and then relatively longer timing interval is set to store the model since the accuracy improvement is very slow. The 110,000s training process is shown in Fig. 13. We find that the layer-wise training can accelerate the training process at the first few epochs, which shows that adopting layerwise training can help accelerate the convergence process. At time 10000s, the accuracy improvement becomes steady. To achieve 15% top-5 error rate, it still takes LTNN 46557s (12.9h) and TNN 57484s (15.96h), respectively. For 14% top5 error rate, it takes 91445s (25.31h) and 111028s (28.37h) for LTNN and TNN. There is no significant training speedup since modern optimizer such as Adam [38] can easily get out of saddle points. Another reason could be the small tensor core ranks (higher model compression) leading to a difficult training process. However, we find by adopting layer-wise training, the inference speed is 1.615× faster comparing to random initialized tensor cores. The inference speed (batch size divided by inference time) is 246.15 data/s and 152.38 data/s for LTNN and TNN, respectively. For fair comparisons, we compare the result from deep compression [4] with pruning techniques and quantization only since the Huffuman code method can also applies to our method due to its lossless compression property. We observe around 2% accuracy drop for our proposed quantized LTNN, which is mainly due to the naive convolution layer quantization method. However, our quantization method requires no recursive training but deep compression [4] requires recursive training, which takes weeks to compress the neural network. A summary of recent works on Imagent2012 is shown in Table VII. B. Human Action Recognition MSRC-12 is a relatively large dataset for action/gesture recognition from 3D skeleton data [35], [44] recorded from Kinect sensors. The dataset has 594 sequences contianing the performances of 12 gestures by 30 subjects with 6244

Gestures Start System Push Right Wind it up Bow Had Enough Beat Both

Number of Insts. 508 522 649 507 508 516

Gestures Duck Goggles Shoot Throw Change Wepon Kick

Number of Insts. 500 508 511 515 498 502

annotated gestures. The summary of 12 gestures and number of instances are shown in Table VIII. In this experiment, we adopt the standard experiment configuration by splitting half of the persons for training and half for testing. We use the covariances of 3D joints description as feature extractor. As shown in Fig. 14, the body can be represented by K points, where K = 20 for MSRC-12 dataset. The body action can be performed in T frames and we can set x, y and z representing coordinates of the ith joint at frame t. The sequence of joint location can be represented as S = [x1 , x2 , ..., xK , y1 , y2 , ..., yK , z1 , z2 , ...zK ]. Therefore, the covariance of the sequence is: T

C(S) =

1 X ˜ ˜T (S − S)(S − S) T − 1 t=1

(19)

˜ is the sample mean. Since covariance matrix is where S diagonal symmetric matrix, we only use the upper triangular. We also add temporal information of the sequence to the features. We follow the same feature extraction process from [44]. Start System Duck Push Right Goggles

Wind it up Shoot Bow Throuw Had Enough Change

Beat Both KIck

Fig. 15: Confusion matrix of human action recognitions from MSRC-12 Kinect dataset by 50% random subject split with 25 times repetition. Based on the aforementioned feature extraction, the input feature size of one sequence is 1892. For human action

13

TABLE IX: MSRC-12 human action recognition accuracy and comparisons Method Hierarchical Model on Action Recognition [45] Extended LC-KSVD [46] Temporal Hierarchy Covariance Descriptors [44] Joint Trajectory-CNN [47] Sliding Dictionary-Sparse Representation [48] Proposed LTNN

Accuracy 66.25% 90.22% 91.70% 93.12% 92.89% 93.40%

TABLE X: MSR-Action3D human action recognition accuracy and comparisons Method Recurrent Neural Network [49] Hidden Markov Model [50] Random Occupancy Patterm [51] Temporal Hierarchy Covariance Descriptors [44] Rate-Invariant SVM [52] Proposed LTNN

Accuracy 42.50% 78.97% 86.50% 90.53% 89.00% 88.95%

recognitions, we use a four-layer neural network architecture, which is defined as 1892 × 1024, 1024 × 1024 and 1024 × 10 with Sigmoid function. We set the maximum tensors rank to 25 and decompose the input weight into a 8-dimensional tensors with mode size [2 2 11 43] and [2 2 2 128]. The neural network compression rate comparing to general neural network is 8.342× and 28.26× without and with non-uniform quantization respectively. Fig. 15 shows the confusion matrix of proposed LTNN method, where the darker the block is, the higher prediction probability it represents. For example, for the class 1 ”Start System”, the correct prediction accuracy is 82 %. However, as the first row shows, the most mis-classified class is class 9 and class 11. Both are with 5% probability. The worst case is class 11, which has only 60% accurate prediction probability. The average prediction accuracy is 91.41% after 25 times repetitions. To have fair comparisons with other works as shown in Table VIII, we report our best prediction accuracy 93.4%. Therefore, it clearly shows that our LTNN classifier can effectively perform human action recognition to the stateof-the-art result. In addition, we have also performed the 3D-action recognition on MSR-Action3D dataset [36]. This dataset has 20 action classes, which consists of a total of twenty types of segmented actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, sideboxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up and throw. Each action starts and ends with a neutral pose and performed by 10 subjects, where each subject performed each action two or three times. We have used 544 sequences with a recoding of the skeleton joint location. Here, we apply the same feature extractors as MSRC-2012 and use our proposed LTNN as classifier. We adopt a four-layer neural network, which is 672 × 1024, 1024 × 1024 and 1024 × 20. The compression rate is 3.318× and can be further improved to 11.80× with 9-bit quantization. As shown in Table X , comparing to other methods, we can achieve the state-of-theart result with 88.95% recognition accuracy.

VII. C ONCLUSION This paper introduces a tensorized formulation for compressing neural network during training. By reshaping neural network weight matrices into high dimensional tensors with low-rank decomposition, significant neural network compression can be achieved with maintained accuracy. A layer-wise training algorithm of tensorized multilayer neural network is further introduced by modified alternating least-squares (MALS) method. The proposed LTNN algorithm can provide state-of-the-arts results on various benchmarks with significant neural network compression rate. The accuracy can be further improved by fine-tuning with backward propagation (BP). For MNIST benchmark, LTNN shows 64× compression rate without accuracy drop. For CIFAR-10 benchmark, LTNN shows that compression of 21.57× compression rate for fullyconnected layers with 2.2% accuracy drop. Experiments on Imagenet-12 results show that c for both top-1 accuracy and top-5 accuracy can be achieved with 8.8× compression. By adopting non-uniform weights quantization, we can further improve the accuracy to 13.95% and the compression rate to 35.84×. R EFERENCES [1] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [2] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256. [3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231. [4] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015. [5] S. Han and et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga,” in ACM/SIGDA FPGA, 2017, pp. 75–84. [6] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, 2016. [7] A. Davis and I. Arel, “Low-rank approximations for conditional feedforward computation in deep neural networks,” arXiv preprint arXiv:1312.4461, 2013. [8] I. Hubara, D. Soudry, and R. E. Yaniv, “Binarized neural networks,” arXiv preprint arXiv:1602.02505, 2016. [9] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J. Dally, “Dsd: Regularizing deep neural networks with dense-sparse-dense training flow,” arXiv preprint arXiv:1607.04381, 2016. [10] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov, “Tensorizing Neural Networks,” ArXiv e-prints, Sep. 2015. [11] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010. [12] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, “Training neural networks without gradients: A scalable admm approach,” in International Conference on Machine Learning, 2016, pp. 2722–2731. [13] C. Plahl, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pre-training of deep belief networks using sparse encoding symmetric machines,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4165–4168. [14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [15] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layer-wise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.

14

[16] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing convolutional neural networks in the frequency domain,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD16, 2016, pp. 1475– 1484. [17] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein, “Towards a deeper understanding of training quantized neural networks.” [18] Z. Liu, Y. Li, F. Ren, and H. Yu, “A binary convolutional encoderdecoder network for real-time natural scene text processing,” arXiv preprint arXiv:1612.03630, 2016. [19] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148–2156. [20] S. Holtz, T. Rohwedder, and R. Schneider, “The alternating linear scheme for tensor optimization in the tensor train format,” SIAM Journal on Scientific Computing, vol. 34, no. 2, pp. A683–A713, 2012. [21] I. V. Oseledets, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011. [22] M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jes´us, Neural network design. PWS publishing company Boston, 1996, vol. 20. [23] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 809–821, 2016. [24] G. B. Huang, Q. Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006. [25] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [26] L. L. C. Kasun, Y. Yang, G.-B. Huang, and Z. Zhang, “Dimension reduction with extreme learning machine,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3906–3918, 2016. [27] I. V. Oseledets and S. Dolgov, “Solution of linear systems and matrix inversion in the tt-format,” SIAM Journal on Scientific Computing, vol. 34, no. 5, pp. A2718–A2739, 2012. [28] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 689–692. [29] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [31] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.” in INTERSPEECH, 2013, pp. 2365–2369. [32] D. C. Cires¸an, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “High-performance neural networks for visual object classification,” arXiv preprint arXiv:1102.0183, 2011. [33] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. [34] A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” 2014. [35] S. Fothergill, H. Mentis, P. Kohli, and S. Nowozin, “Instructing people for training gestural interactive systems,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2012, pp. 1737–1746. [36] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2010, pp. 9–14. [37] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick.” in ICML, 2015, pp. 2285–2294. [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [39] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, “Deep fried convnets,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1476–1483. [40] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277. [41] M. D. Collins and P. Kohli, “Memory bounded deep convolutional networks,” arXiv preprint arXiv:1412.1442, 2014. [42] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, “Imagenet large scale visual recognition competition 2012 (ILSVRC2012),” 2012.

[43] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. [44] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, “Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations.” in IJCAI, vol. 13, 2013, pp. 2466–2472. [45] S. Yang, C. Yuan, W. Hu, and X. Ding, “A hierarchical model based on latent dirichlet allocation for action recognition,” in ICPR. IEEE, 2014, pp. 2613–2618. [46] L. Zhou, W. Li, Y. Zhang, P. Ogunbona, D. T. Nguyen, and H. Zhang, “Discriminative key pose extraction using extended lc-ksvd for action recognition,” in Digital lmage Computing: Techniques and Applications (DlCTA), 2014 International Conference on. IEEE, 2014, pp. 1–8. [47] P. Wang, W. Li, C. Li, and Y. Hou, “Action recognition based on joint trajectory maps with convolutional neural networks,” arXiv preprint arXiv:1612.09401, 2016. [48] Y. Annadani, D. Rakshith, and S. Biswas, “Sliding dictionary based sparse representation for action recognition,” arXiv preprint arXiv:1611.00218, 2016. [49] J. Martens and I. Sutskever, “Learning recurrent neural networks with hessian-free optimization,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1033–1040. [50] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2012, pp. 20–27. [51] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d action recognition with random occupancy patterns,” in Computer vision– ECCV 2012. Springer, 2012, pp. 872–885. [52] B. B. Amor, J. Su, and A. Srivastava, “ccognition using rate-invariant analysis of skeletal shape trajectories,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 1–13, 2016.

Hantao Huang (S'14) received the BS and PhD degree from the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU) Singapore in 2013 and 2018. Since 2018, he has been a staff engineer in MediaTek, Singapore, where he is working on machine learning algorithms, neural network compression and quantization for edge devices. His research interests includes data analytics, machine-learning algorithms, and low-power system design. He is a student member of the IEEE.

Hao Yu (M'06-SM'14) obtained his B.S. degree from Fudan University (Shanghai China) and obtained M.S/Ph. D degrees both from electrical engineering department at UCLA, USA, with major of integrated circuit and embedded computing. He was a senior research staff at Berkeley Design Automation (BDA), and later with school of electrical and electronic engineering at Nanyang Technological University (NTU), Singapore. Dr. Yu has 208 peer-reviewed and referred publications [conference (144) and journal (64)], 6 books, 8 book chapters, 1 best paper award in ACM Transactions on Design Automation of Electronic Systems (TODAES), 1 Springer PhD thesis award (advisor), 3 best paper award nominations (DAC06, ICCAD06, ASP-DAC12), 3 student paper competition finalists (SiRF13, RFIC13, IMS15), 2 keynote talks, 1 inventor award from semiconductor research cooperation (SRC), and 20 granted patents. His main research interest is about smart energy-efficient data analytics, links and sensors with multi-million government (PI of 1 NRF-CRP, 2 MOE-TIER2 etc.) and industry (Intel, Huawei, BGI etc.) funding. His industry work at BDA is also recognized with an EDN magazine innovation award and multimillion venture capital funding. He is a senior member of IEEE and member of ACM.