ARTICLE IN PRESS
JID: NEUCOM
[m5G;September 7, 2018;20:31]
Neurocomputing 0 0 0 (2018) 1–11
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
3D separable convolutional neural network for dynamic hand gesture recognition Zhongxu Hu a, Youmin Hu a, Jie Liu c, Bo Wu a,∗, Dongmin Han b, Thomas Kurfess b a
School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, USA c School of Hydropower and Information Engineering, Huazhong University of Science and Technology, Wuhan, China b
a r t i c l e
i n f o
Article history: Received 26 December 2017 Revised 14 August 2018 Accepted 18 August 2018 Available online xxx Communicated by Jun Yu Keywords: Hand gesture recognition 3D separable CNN Skip connection Layer-wise learning rate
a b s t r a c t Dynamic hand gesture recognition, as an essential part of Human–Computer Interaction, and especially an important way to realize Augmented Reality, has been attracting attention from many scholars and yet presenting many more challenges. Recently, being aware of deep convolutional neural network’s excellent performance, many scholars began to apply it to gesture recognition, and obtained promising results. However, no enough attention has been paid to the number of parameters in the network and the amount of computer calculation needed until now. In this paper, a 3D separable convolutional neural network is proposed for dynamic gesture recognition. This study aims to make the model less complex without compromising its high recognition accuracy, such that it can be deployed to augmented reality glasses more easily in the future. By the application of skip connection and layer-wise learning rate, the undesired gradient dispersion due to the separation operation is solved and the performance of the network is improved. The fusion of feature information is further promoted by shuffle operation. In addition, a dynamic hand gesture library is built through HoloLens, which thus proves the feasibility of the proposed method. © 2018 Elsevier B.V. All rights reserved.
1. Introduction Reliable hand gesture recognition is an important way to realize Augmented Reality (AR). The existing AR glasses, such as HoloLens, whose interaction is mainly done through the view cursor and hand gesture “click”. However, this approach is very inflexible since it does not make use of the diverse expressions of hand’s numerous gestures. The use of a variety of dynamic gestures can greatly enhance the convenience of human–machine interaction, and vision-based gesture recognition can rid users of the constraints of an input device, with advantages such as low adhesion, low invasiveness, more intuitive experience and so on. Gesture recognition, as an important part of human–computer interaction, has been studied by scholars for decades. However, there are still many challenges [1–3]. The challenges exist because of the following reasons: first, the flexibility and diversity of human gesture, which means that even the same person does the same gesture twice, the two gestures are different. For vision-based gesture recognition, different skin color, person, lighting and other factors lead to large differences in the collected images; The second is the
∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (B. Wu).
action’s speed change, the speed of an action is constantly changing; The third is the identification of negative cases, at the beginning and the end of the action or in different stages of action translation, i.e. negative cases, the action is very similar with the action to be classified; Lastly, the recognition time, the expectation of the time of feedback is usually less than 100 ms [8], which requires the processing method to be faster. Hidden Markov [13,14], Conditional Random Field [17,18] and Support Vector Machine [15,16] have all been widely used in dynamic gesture recognition. In order to improve the accuracy of recognition, many hand-crafted features are used. In recent years, the depth sensor has been combined with the RGB sensor to improve the input information and the robustness of recognition [19– 21]. These methods have their effects, but gesture recognition remains a challenging problem. Recently, deep neural networks have been successfully applied to many identification problems. Because of its strong ability of identification, it draws a wide range of attention. The convolutional neural network (CNN) skips the complex pre-processing of images, so it is widely used when dealing with images. Many scholars have begun to apply deep neural network to gesture recognition, and obtained some good results [2]. When the input of dynamic gesture recognition is consecutive images, there are two main methods for now, one is to combine CNN and RNN / LSTM [22–24];
https://doi.org/10.1016/j.neucom.2018.08.042 0925-2312/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
the other is to uses 3D CNN [7,8]. The RNN / LSTM in the former method is fully connected, which results in more network parameters and an increase in the amount of computation. In contrast, 3D CNN uses fewer neurons. In this paper, a 3D separable convolutional neural network is proposed for two reasons: one is that it is effective and efficient for dynamic gesture recognition, and the other is that a compact network model is easier to be deployed in AR glasses in the future. Inspired by the work of Howard and others [6], a standard convolution process of 3D CNN is separated to decrease the number of the network’s parameters and reduce the amount of computation. The strategies of skipping connection and layer-wise learning rates are used to solve the problem of gradient dispersion after separation. At the same time we have built a dynamic gesture dataset to verify the method proposed. The paper is organized as follows. Section 2 discusses related work. Section 3 introduces the methods we proposed. Section 4 explains some details of the method and experimental comparison. The experimental results are discussed as well. Section 5 presents the conclusion and future work prospects. 2. Related work As the gesture recognition has been studied by many scholars, there are now many related methods, and this paper mainly introduces the work on the neural network and network’s compression. The conventional gesture recognition method’s recognition effect often depends on the features chosen by people, so far many hand-crafted spatio-temporal features [37] and multimodal features [38,41] have been applied to gesture and behavior recognition. And the recent trend is to use deep neural networks to learn their features. The Convolutional Neural Network was used to learn feature matching in [42]. Yu and Hong et al. used the Deep Autoencoder Model to extract features for human pose recognition [34,39,40]. Shuiwang et al. was the first group of researchers to apply the 3D convolutional neural network to human behavior recognition [4]. He used 3D convolution to extract certain features in the temporal and spatial dimension of human behavior and added priori knowledge to the network to improve its performance. Du Tran et al. found that 3D convolution networks are more suitable for extracting temporal and spatial features than 2D convolution networks [25]. Pavlo et al. had designed two 3D convolutional neural networks for gesture recognition. First, they used a dual-channel 3D CNN to extract spatio-temporal features for final prediction [7]. Second, in order to further utilize information in temporal dimension, a recurrent 3D convolutional neural network was used for dynamic gesture detection and classification, in which the 3D convolutional neural network was used for local temporal feature’s extraction and the RNN was used to model the overall feature. And finally the probability values of the gesture to fall in different categories of gestures were obtained by the softmax layer [8]. Diba et al. used a dual-stream CNN structure, which used optical flow and RGB images as inputs to make the network learn the motion features of the video, then those features were classified by SVM [26]. The idea of dual-stream structure was also used by Ng et al. [22], Simonyan and Zisserman [23], Feichtenhofer et al. [24] and Ma et al. [27], all of whom used two different input branches, RGB image and optical flow. On the basis of the original RGB image, adding the optical flow graph adds certain priori knowledge to the network, so that the network would concentrate on moving objects, thus improves the recognition of network. In this paper, we adopt a similar idea, instead of optical flow, we used frame difference because of its faster computation. In contrast to those methods that increase network’s accuracy by increasing the complexity of the network’s structure, we focus on rendering the network compact and maintain high accuracy.
The existing compression and acceleration methods of network are mainly around 2D CNN, and they can be roughly divided into two categories by what to compress: the weight of network and the structure of network. They can also be divided into two categories, as far as the computing speed is concerned: only compress size and compress size while increasing speed. As of weight values, the typical work is Deep Compression model [28] and XNorNet [29]. The former reduces the weight by pruning, quantization and Huffman coding, but it does not take into account the speed of calculation. The latter compresses the weight by binarization, and the calculation speed is improved, but the network’s precision is compromised. As of network’s structure, FN Iandola et al. proposed SqueezeNet [5]. The number of the network’s weights was reduced by optimizing the network’s structure without considering the amount of computation, which leads to three network structure design strategies: 1. Replacing the 3 × 3 convolution kernel with a 1 × 1 convolution kernel; 2. Reducing the number of input channels; 3. down sampling process should be placed in the network as closest to the end as possible. Those three strategies also give us some inspiration in the process of designing our network’s structure. Hinton et al. proposed a Distilling algorithm similar to network transfer [30], its basic idea is to let a small network be taught by a large network with good performance until the small network have the same performance as the large, thus achieves the goal of network compression while the network acceleration is left out. Google’s recently proposed MobileNet was a compact network architecture aimed at mobile deployment [6]. Considering the speed of mobiles, computing resources are limited, Google borrowed ideas from factorized convolution, and divided the standard convolution operation into two parts: depth-wise convolution and pointwise convolution. It can theoretically increase the network’s efficiency by 8∼9 times .Face + + comes up with ShuffleNet, which shuffled the input group and increased the network’s ability of learning [9]. From the above discussion we can see that in order to improve the accuracy of gesture recognition, many researchers have made the model very complex, which greatly increases the size of network. Here, our goal is to make the network compact, so a 3D CNN is used in this paper. In order to further compress the 3D CNN, this paper draws on some ideas and methods of compression of 2D CNN, and takes into account both the size of network and the computational complexity, and finally proposes the 3D separable convolutional neural network. 3. Method In this chapter, we will discuss the structure and training details of the proposed model. 3.1. Pre-processing Our research object is dynamically changing hand gesture, so the input of model must be a consecutive n-frame image, in this paper n = 8. This is due to several aspects: one is that the value is sufficient to express a certain dynamic gesture, the second is that the value is not too large to increase the amount of calculation, and the third is that the value is preferably an exponential multiple of 2. In order to improve the accuracy of model’s prediction, many researchers introduce priori knowledge, that is, traditional moving object detection methods are used to pre-process the raw images, the methods mainly include: 1. Optical Flow; 2. Motion History Image; 3. Frame Difference. We have compared these methods, as shown in Fig. 1. It can be seen from Fig. 1 that the optical flow method and the frame difference method can effectively filter out useless
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
3
Fig. 1. Preprocessing. The first row is the raw 8 frame consecutive images, the other three rows are respectively obtained by optical flow method, frame difference method, and MHI, because all three methods work on the difference between the two adjacent frames, so the original input are processed into 7 frames. It can be seen that the frame difference method can obtain the relatively clear hand movement contour, and better eliminate the interference in the background.
Fig. 2. Comparison of 2D Convolution and 3D Convolution. (a) is a standard 2D convolution process with a 4-D convolution kernel, i.e. (N, H, W, C), where N is the number of kernels, H is the height of kernel, W is the width, C is the number of input channels; (b) is the standard 3D convolution process, the convolution kernel is a 5-d vector, i.e., (N, D, H, W, C). There is a temporal dimension D. It can be seen that if the number of total feature images of input and of output are the same, the number of parameters of 3D convolution kernel will be relatively small.
background noise, and highlight the movement of objects. The MHI method retains part of the background. Because the used sensor is RGB camera on AR glasses and the human’s head will inevitably move, the entire RGB sensor will be driven to move and the frame difference method won’t be able to filter out all background noise. Such noise is small for a single frame difference image, but using the MHI method will accumulate the noise, which is the accumulation of motion for several consecutive frames, so the latter MHI images will have more background noise. Comparing optical flow method with frame difference method, the latter one can help extract more key contour information, which helps the later subsequent identifying process. The optical flow is slower. Take the sample in Fig. 1 as an example, the calculation time of the optical flow method is 0.06 s, while the frame difference method takes only 0.004 s. In summary, the input of the model designed in this paper changes from 8-frame raw images to 1-frame raw image and 7frame difference. Its purpose is to filter out useless background information so that the model is more concentrated on moving objects.
3.2. Network architecture The purpose of this paper is to get a compact network structure, so there are two optimization goals: 1. Reduce the number of model’s parameters; 2. Reduce the amount of computer calculation.
3.2.1. 3D CNN vs. 2D CNN A set of sequence images can be used as a whole, that is, different frames of the images can be set as different channels of input, and then be put into a 2D CNN. But the temporal feature will be lost. Yet in 3D CNN it is not the case, and the structure of a 3D CNN is more compact than that of 2D CNN. The general standard 2D convolution process’s calculational cost is:
DK · DK · M · N · DF · DF
(1)
where M is the number of channels of input, N is the number of channels of output,DK × DK is the size of the kernel, DF × DF is the size of output feature. The computational cost of a 3D convolution process is:
Dk · DK · DK · M · N · DF · DF · D f
(2)
where Dk is the number of kernels in the temporal dimension and Df is the number of frames, M is the number of channels of input, N is the number of channels of output. The number of parameters of two-dimensional convolution is DK · DK · M · N . If the total number of feature images of input and output layers is fixed and the convolution kernel’s size is fixed, which means that the M , N and DK × DK are fixed. In this case, the convolution becomes three-dimensional convolution, that is, the input and output layers are divided equally, the input layer becomes c · M = M , which the c is the number of frames, and the output layer becomes Df · N = N , then the number of parameters of 3D convolution is Dk · DK · DK · M · N, where Dk ≤ c, The number of parameters
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM 4
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
Fig. 3. 3D CNN structure. The network consists of a layer of frame difference, a 10-layer 3D convolution, a 3D avg_pooling layer and a fully connected layer, and finally the probability values of object falling into various categories are obtained using softmax. The activation function of the 3D convolution layer is Leaky ReLU, and the coefficient of leaky is 0.2.
Fig. 4. 3D separable convolution. The 3D separable convolution is divided into two parts: 3D depth-wise and 3D point-wise. The size of the convolution kernel in 3D depthwise part is 1 × D × H × W × C.The convolution kernel is only used in calculation with the corresponding channel of the corresponding input feature, and the convolution is not combined. After the calculation is completed, the number of output channels is D × C. The size of the convolution kernel is N × 1 × 1 × 1 × C in 3D point-wise, the calculation process is consistent with that of the standard 3D convolution.
is changed to:
Dk · DK · DK · M · N Dk = ≤1 DK · DK · M · N c · Df
(3)
And the cost of computing becomes:
Dk · DK · DK · M · N · DF · DF · D f D = k ≤1 DK · DK · M · N · DF · DF c
(4)
It can be seen that the number of parameters and the calculation are reduced, as shown in Fig. 2, and because the input of network in this paper is a set of sequence images, 3D CNN is more in line with our needs. 3.2.2. 3D CNN We proposed a three-dimensional convolutional neural network for dynamic gesture recognition shown in Fig. 3. Its structure includes a frame difference layer for filtering background information, a deep 3D-CNN for extracting temporal-spatio feature, and a softmax layer for predicting the probability of the object be classified in different classes. The network’s structure has the following characteristics: 1. In order to reduce the number of weights, the convolution kernel is sized 3 × 3 × 3; 2. The down-sampling of temporal dimension occurred closest to the end so that most convolutional layers can have more feature maps; 3. By increasing the stride instead of pooling, the network can learn the down sampling process; 4. By increasing the number of channels before reducing the size of the feature maps, the network will be able to extract more feature information. We now introduce the operation of the network in detail. We define a video clip as C ∈ V w× h ×c×m , C = {C0 , .., Cm−1 }, where m ≥ 1, t is the time node, w × h is the image’s size, C is the
number of channels. The output of the frame difference layer is L0 ∈ (Co, D0 , .., Dm−2 ), where Di = Di f f (Ci , Ci+1 ), and Diff() represents the function of frame difference. The m−1 frame difference image is obtained from the initial m-frame image by frame difference, and then the first frame of the initial clip is merged with the m-1 frames into the input, so the input is still m frames. In the 3D CNN layer, the calculation formula is as follows:
lixyz j
= f
i −1 Q i −1 R i −1 P
m p=0 q=0 r=0
x+ p y+q z+r wipqr l ( )( )( ) jm (i−1 )m
+ bi j
(5)
where f () is the activation function, the Leaky ReLU is used in this xyz paper, and other activation functions can also be used [43,44], li j is the value of (x, y, z) in the jth feature image of the ith layer, and pqr wi jm is the value of (p, q, r) in the convolution kernel connected to the m-th feature image of the previous layer, Ri is the temporal dimension of the 3D convolution kernel, Pi and Qi are the height and width of the convolution kernel, bij refers to the bias. Formula of the 3D AvgPool layer is basically the same as the 3D CNN layer, except that the value of w is fixed and there is no bias. It is worth noting that the main purpose of the 3D AvgPool layer is to reduce the number of parameters. We know that the parameters of the fully connected layer are often too much. If the features of the final convolutional layer are directly reshaped into a one-dimensional vector, it will result in a huge amount of calculation. So the 3D AvgPool layer is used to convert the feature maps, and then the classification result is obtained through the fully connected layer. 3.2.3. Architecture optimization We want to further compress the network’s structure, the direct way is to reduce the values of the eight parameters in Eq. (2),
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
ARTICLE IN PRESS
JID: NEUCOM
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
5
Fig. 5. 3D depth-wise with shuffle operation. (a) is the standard 3D depth-wise process; (b) is the equivalent form; (c) is the process of 3D depth-wise added shuffle operation.
Fig. 6. 3D separable CNN with skip connections. In contrast to the 3D CNN shown in Fig. 2, the 3D CNN layers other than the first layer are decomposed into 3D depth-wise and 3D point-wise processes, while the rest layers are basically the same. In order to ensure that the gradient flows, four skip connection are added at the locations shown in the figure.
we can start by using a relatively small convolution kernel, that is, to use an relatively small value Dk · DK · DK . In order to ensure that the spatial-temporal feature can be extracted, the convolution kernel’s size is 3 × 3 × 3 in this paper. Of course, we can continue to reduce the number and size of the feature images, that is, using of small value of M, N and DF · DF · Df , but in order to ensure the diversity of features, this method is not applied. In the work of Howard and others [6,10,11], they reduced the computational cost by decomposing the standard 2D convolution into depth-wise and point-wise processes. We extend that idea to 3D convolution, that is, the standard 3D convolution process is decomposed into two processes: 3D depth-wise and 3D point-wise. As shown in Fig. 4, the convolution kernels separately calculated the input sequence from each channel in the 3D depth-wise phase, and the convolution result is not combined. Later the results of the adjacent frames are concatenated. A 1 × 1 × 1 size convolution kernel is used for 3D convolution in the point-wise phase. Thus the calculation process becomes:
dixyz j
= f
Pi −1 Qi −1
p=0 q=0
k = ceil
pqk (x+ p)(y+q)(z+k ) wih l(i−1)h
+ bi j
(6)
j ,h = j − k ∗ m− 1 m
(7)
lixyz j
= f
xyz wi jm dim
+ bi j
(8)
m
So the computational cost after such decomposition becomes:
Dk · DK · DK · M · DF · DF · D f + Dk · M · N · DF · DF · D f
(9)
Compared with the standard 3D CNN, the computational cost becomes:
Dk · DK · DK · M · DF · DF · D f + Dk · M · N · DF · DF · D f Dk · DK · DK · M · N · DF · DF · D f =
1 1 + N DK · DK
(10)
In this paper the 3 × 3 × 3 convolution kernel is used, N ≥ 32, the amount of calculation is reduced by about 8 to 9 times. While the amount of parameters is reduced:
Dk · DK · DK · M + M · N 1 1 = + Dk · DK · DK · M · N N Dk · DK · DK
(11)
Its number of parameters decreased by about 10–20 times. It can be seen that the decomposition operation can greatly reduce the number of model’s parameters and the amount of calculation, at the same time the number of layers doubled so that
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
ARTICLE IN PRESS
JID: NEUCOM 6
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
system, which improves the accuracy of the network’s output. The skip connection method requires two layers of the same size, but only a small number of layers meet such requirement in this paper. Therefore, this method alleviates the problem of gradient dispersion, but not completely solves it, as shown in Fig. 8. It can be seen that the gradient of deep layer is significantly improved after the skip connections are added, which means that the skip connections can effectively promote the flow of the gradient. But the gradient is still small on the whole so we then also use the layerwise learning rate method, in which different layers will have different learning rates. The learning rate of shallow layer is set relatively large, and the learning rate of deep layer is low, so that each layer of the network can be properly trained. In order to avoid having too many hyper parameters to adjust, the network is divided into several parts, with each part having the same learning rate. In the end, we use both skip connection and layer-wise learning rate, which not only render the network’s calculation converge, but also improves the accuracy of the network’s prediction.
Fig. 7. The skip connection.
4. Experiment the nonlinearity of network is enhanced, and deep networks usually have better performance than shallow networks. However, we should note that the output channel is only related to the corresponding input channel in the 3D depth-wise phase, which leads to poor information flow and lack of ability of expression. To solve that problem, we shuffled the input feature before the depth-wise convolution operation, as shown in Fig. 5. Such idea is also used in the work of Zhang et al. [9], which makes the information in the feature images evenly fused and better learned. 3.3. Training We choose negative log-likelihood as the cost function:
L ( t, y ) = −
i
ti log yi = −
ti log f (xi , {w} )
(12)
i
The goal of our training is to optimize the weight {w} to make the cost function L reach its minimum, where xi is the input, yi = f (xi , w ) is the probability value for each class, and ti represents the value of one-hot label. Then the network can be trained by using the stochastic gradient descent method or its optimized version (Fig. 6). Because we have adopted the 3D separable convolutional neural network, the number of layers doubles that of a standard 3D CNN network. Although the non-linearity of the network is enhanced so that the network has a better classification effect, it brings out the problem of gradient dispersion. Gradient dispersion makes the gradient of the shallow network very low, and the gradient of deep network relatively high, as shown in Fig. 7. When the learning rate is large, the calculated results in the shallow network will drastically change, whereas when the learning rate is small, the deep network will not be effectively updated. In both cases, the calculation results in the network won’t be able to converge. In order to solve the problem of gradient dispersion, we first use the skip connection method, as shown in Fig. 7. Where xl represents the input of a layer and F(xl , {wi }) represents the output after several layers. A successful case using that method is the Residual network [31]. Since the Back Propagation is based on chain-based derivation rules, the gradient will decay every time it passes through the layer. With skip connection, the gradient in the deep network can skip several layers and pass directly to the shallow layer, thus solving the problem of gradient dispersion. What’s more, the residual network is a combination of many parallel subnets. The whole residual network is actually an equivalent to a multiplayer voting
4.1. Dataset The goal of this paper is to identify dynamic gestures for the use of AR glasses, such as HoloLens. As far as we know, since the commercial AR glasses were released for not a long time, there is no publicly available hand gesture dataset. So we’ve built our own gesture dataset, called AMI Hand Gesture Dataset, and we will make it public. For convenience, this article designed five kinds of dynamic hand gestures: up, down, right, left, open, as shown in Fig. 10. The set was obtained by an RGB camera of HoloLens from 10 people. In order to improve the accuracy of the model’s prediction in practice, the dataset includes pictures with both simple background and complex background. Finally, a total of 110 K training samples and 10 K test samples were made. Each sample consisted of 8 consecutive images. The ratio of the number of negative samples to the number of the five kinds of gesture samples is about 30: 1. This is because the dataset is collected by simulating the normal usage scenario of the user. When the positive samples are selected, the remaining samples are taken as the negative ones. Therefore, this ratio is more consistent with the actual scenario. It can be seen that the numbers of positive and negative samples are quite different, but it is in line with real world environment situation. However, the imbalance between numbers of positive and negative samples will be detrimental to the network’s training, so resampling technique is used in the training process to expand the five kinds of gesture samples 30 times. In order to have an overall understanding of the gesture dataset, this paper uses t-SNE [32] to visualize the 1.2 K samples selected from the set, as shown in Fig. 9. Other visualization methods are also available, like LDFA [35], PCA [36], etc. It can be seen that different types of gestures (including negative sample) clusters in different blocks of space, but the blocks intersect and it is still a complex nonlinear problem, so a deep network model is needed. 4.2. Evaluation criterion The goal in this paper is to design a network with both high accuracy and compact structure, so it can be evaluated from three aspects: accuracy, model size and computational cost. There are five criteria of accuracy: Confusion Matrix, Average Accuracy, Precision, Recall, and F1. The accuracy of predicting hand gesture from each category can be obtained using the confusion matrix. The average accuracy rate
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
7
Fig. 8. Gradient distribution of each layer. In the figure, we show the average gradient distribution of each layer of before and after adding the skip connections for the first 100 iterations. The horizontal axis represents the number of iterations and the vertical axis represents the value of the logarithm of the gradient.
Fig. 9. t-SNE Visualization of AMI Hand Gesture Dataset. We proportionally selected a total of 1.2 K samples from every category of the gesture set, and then visualized using the t-SNE method. As shown in the figure above, each category of samples clusters in different blocks of space, but the blocks intersect, so a complex nonlinear model is needed.
is the overall evaluation of the network, which gives an overall understanding of the network’s performance. However, because the numbers of positive and negative samples are not balanced because the latter are of larger number, so the average accuracy is not very informative. Because we are more concerned about the prediction of the five positive gestures, the evaluation of such minority class can be done using Presicion, Recall and F1. Those three criteria are generally used for dichotomies, when the five gestures in this paper are considered as a whole, positive samples, we can use them. During the experiment, we found that the prediction of these five gestures was either correctly predicted or negative. So
it’s feasible to make the five gestures as a whole. Through those five criteria, the accuracy of the model can be fully tested, and the influence of the imbalance of sample types is avoided. 4.3. 3D CNN vs. modified inception v3 In order to prove the validity of 3D CNN for sequence data, this section uses both the 3D CNN network designed in 3.2.2 and the classic Inception V3 network [12] in 2D CNN. To adjust Inception V3 to the tasks proposed in this paper, this paper made a small amount of changes on it, mainly in the following way: 1.
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM 8
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
Fig. 10. AMI Hand Gesture Dataset. As shown in the left side of the figure, in this paper we uses HoloLens’ RGB sensor to collect five kinds of gesture. The five kinds of gesture’s sequence images, as shown on the right side, are upper, lower, left, right and open. In order to enhance the effect of the model in real world environment, the image data have different backgrounds. Table 1 Learning rate. In order to solve the problem of gradient dispersion of 3D separable CNN, the layer-wise learning rate strategy is adopted. Since the gradient changed after adding the skip connection, the “different learning rate for different group of layers” strategy is adopted. Specific parameters are shown in the table below. Model
Group1 conv3d_1:conv3d_3 pw
Group2 conv3d_4 dw: conv3d_5 pw
Group3 conv3d_6 dw: conv3d_7 pw
Group4 conv3d_8 dw: fc
3D Separable CNN 3D Separable CNN with skip connection 3D Separable CNN with skip connection and shuffle
0.04 0.02 0.02
0.02 0.01 0.01
0.01 0.005 0.005
0.0 0 05 0.0 0 05 0.0 0 05
Table 2 Comparison of our models to the state-of-the-art methods with different evaluation criteria. Model
Accuracy
Precision
Recall
F1
Model Size (MB)
Billion Mult-Adds
Computation time (s)
Modified Inception V3 3D CNN 3D Separable CNN 3D Separable CNN with skip connection 3D Separable CNN with skip connection and shuffle 3D CNN of Pavlo et al. [7]. Recurrent 3D CNN [8] C3D [45] Two-stream CNN [24]
0.9122 0.9745 0.9632 0.9863 0.9883 0.9481 0.8883 0.9667 0.9607
0.6260 0.8585 0.8129 0.9591 0.9652 0.7377 0.5564 0.8172 0.7970
0.9112 0.9770 0.9540 0.9415 0.9494 0.9928 0.9606 0.9790 0.9625
0.7422 0.9139 0.8778 0.9502 0.9572 0.8464 0.7046 0.8908 0.8719
96.9 53.9 6.17 6.17 6.17 5.15 281 261 369
4.2 9.8 1.3 1.3 1.3 0.5 36.8 24.1 1.3
0.0236 0.0224 0.0183 0.0183 0.0183 0.0117 0.0445 0.0440 0.0154
Convolutional Padding type is changed from the original ’VALID’ to ’SAME’; 2 . The base module loses the last layer of pooling; 3. The size of the kernel of the third layer in the Aux_Logits branch is changed from 5 × 5 to 6 × 6. Using the resampled gesture set, the 3D CNN and the modified Inception V3 were trained. The optimizer used is RMSProp, batch size was 16, and the number of epoch iterations was 32,0 0 0, and there are 5 epochs. The learning rate of the 3D CNN is 0.0 0 05, and the learning rate of the modified Inception V3 is 0.001. In order to have a better convergence rate, the learning rate of both networks were set to decay exponentially and the step of decay was 40 0 0, the decay rate was 0.9. The test results are shown in Fig. 11 and Table 2. We can see that the accuracy of 3D CNN is greatly better than Modified Inception V3. As can be seen from Table 2, the model size of 3D CNN is smaller, but the calculation is more because the 3D CNN designed in this paper has more feature images. In summary, the structure of the 3D CNN designed in this paper is not as sophisticated as the Inception V3, but the accuracy is higher and the parameters are less, which means that the 3D CNN is more suitable for dealing with sequential images. 4.4. 3D separable CNN vs. 3D CNN In this section, we use four models: 3D CNN, 3D Separable CNN, 3D Separable CNN with skip connection, 3D Separable CNN with skip connection and shuffle. Unlike the learning rate, the
hype-parameters of these four models are consistent with the previous section. As mentioned in Section 3.3, since the 3D Separable CNN makes the network deeper and causes its gradient to be small and the network difficult to train, the layer-wise learning rate is used. In order to avoid having too many hyper-parameters, the network’s layers are divided into four groups, each group has a learning rate, and the exponential decay is set with an decay step of 40 0 0 and decay rate of 0.9. Since the skip connection is beneficial to the gradient’s propagation, the learning rate is reduced by skip connection. The four groups are as shown in the Table 1: It can be seen from Fig. 11 that the four 3D CNNs are superior to the modified Inception V3, and though 3D Separable CNN has slightly worse performance compared to 3D CNN, it still has a high accuracy. After adding Skip Connection, the accuracy of the model is improved, which is mainly reflected in the fact that classification of negative cases has a high accuracy. In practice, we are less tolerant of false positive than false negative, so the model with skip connection is better. And shuffle operation also slightly helps improve the accuracy. In addition to confusing the matrix, the results of evaluation adopting other criteria is shown in Table 2. It can be seen that the average accuracy of these five models is very high, because there are many negative samples and even if the model predicts all the test samples as negative the accuracy rate would be high. Therefore, other criteria must be used.
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
9
Fig. 11. Confusion matrix. The confusion matrices of the five models are shown in the figures, and the accuracy of the five models can be seen in the confusion matrix.
The reduction in the amount of parameters in the 3D Separable CNN undermines its performance compared to 3D CNN. However, skip connection is added, its all performance indexes have obviously become better because the network becomes a composite model in some sense, and thus the accuracy im-
proves. Judging from the results using both Fig. 11 and the Recall, 3D CNN has better prediction on positive samples, which leads to its relatively more erroneous judgement and lower precision. Combining results from the various criterion, we can see that the last two models are better. Both have similar per-
Please cite this article as: Z. Hu et al., 3D separable convolutional neural network for dynamic hand gesture recognition, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.08.042
JID: NEUCOM 10
ARTICLE IN PRESS
[m5G;September 7, 2018;20:31]
Z. Hu et al. / Neurocomputing 000 (2018) 1–11
formance, but the model with shuffle operation is relatively superior. 4.5. Comparison with others To further verify the effectiveness of the proposed method, four state-of-the-art methods are selected: the 3D CNN of Pavlo et al. [7], the Recurrent 3D CNN [8], the C3D [45] and the Two-Stream CNN [24]. These four models are also mentioned in the related work. The first two models are for dynamic gesture recognition. The goal of the C3D is to learn spatiotemporal features, which is also used in the Recurrent 3D CNN. The Two-Stream CNN is the classic action recognition model, but it’s based on 2D CNN, because the size of the input images is different from the original settings of these models. In order to make these models applicable to the identification problem of this paper, so like the Modified Inception V3, we changed these models a little bit. The changes mainly include: the size of the last convolution kernel in the 3D CNN of Palvo et al. is changed from 3 × 5 × 3 to 5 × 5 × 3, and the stride of the pooling layer is changed from 2 × 2 × 2 to 1 × 2 × 2. The input of the Recurrent 3D CNN is divided into 3 clips and each clip contains 4 frames. The four models were trained using the same hyper-parameters, and the results are shown in Table 2. It can be seen that the method proposed in this paper is still optimal in terms of accuracy. From other evaluation criterions, the four state-of-the-art methods have achieved high Recall Rate at the expense of Precision. In terms of model optimization, except the 3D CNN of Pavlo et al., the amount of calculation and model size of the other three models are relatively high. In contrast, the 3D CNN of Pavlo et al. is optimal in terms of calculation, model size and computation time. This is because the model is relatively simple, so its accuracy is not very high. Although our models is not optimal for each indicator, this is because this paper mainly proposes a 3D separable convolution method. This method can be used to optimize a three-dimensional convolutional neural network, and its feasibility is proved by experiments. For the proposed 3D CNN model, there are not too many sophisticated designs, which is mainly used to verify our method. In the future, the 3D CNN of Pavlo et al. can be optimized by the method proposed in this paper. We believe that it can be further optimized on the basis of ensuring or even improving its accuracy. 5. Conclusion In this paper, a 3D Separable CNN is proposed for dynamic gesture recognition. The goal is to design a compact model with low computational complexity. To achieve this goal, the standard 3D convolution process is decomposed into two parts: depth-wise and point-wise. However, due to the decomposition, the network is deepened, which lead to gradient dispersion and renders the network difficult to train. In order to solve that problem, two methods are used: skip connection and layer-wise learning rate. Skip connection alleviates the problem of gradient dispersion, but it does not completely solve it, so the layer-wise learning rate is needed, such that the network can be trained. It is worth noting that skip connection can improve the accuracy of the network. At the same time, in order to alleviate the problem that different channel’s information flow is difficult in the depth-wise phase, the shuffle operation is added to help fuse the different channel’s information evenly. And finally the compressed model with good performance is generated. Since our ultimate goal is to run dynamic gesture recognition through AR glasses, and there is no suitable gesture dataset we know of yet, we specified five types of gestures and built a dynamic gesture dataset with Hololens’ RGB camera. And with this
dataset, we have verified the proposed method. We find that 3D CNN is better than 2D CNN network when dealing with sequential data, and the number of parameters is less. Using our proposed separation method, we can further reduce the parameter’s quantity and the amount of calculation, but the performance downgrades a little, and gradient dispersion emerges. Then those problems are solved using the skip connection method, layer-wise learning rate method and shuffle operation. Finally, the 3D separable CNN with skip connection and shuffle operation has worked out and achieves the best performance in this study. For now, the method we proposed has only been validated on Tensorflow [33]. In future work, we will further optimize the model and deploy it on AR glasses. Acknowledgments This work was supported by the National Key R&D Program of China (grant numbers 2017YFD0400405). References [1] S. S. Rautaray, A. Agrawal, Vision based hand gesture recognition for human computer interaction: a survey[J], Artif. Intell. Rev. 43 (1) (2015) 1–54. [2] P.K. Pisharady, M. Saerbeck, Recent methods and databases in vision-based hand gesture recognition: a review[J], Comput. Vis. Image Underst. 141 (2015) 152–165. [3] H. Hasan, S. Abdul-Kareem, Human–computer interaction using vision-based hand gesture recognition systems: a survey[J], Neural Comput. Appl. 25 (2) (2014). [4] S. Ji, W. Xu, M. Yang, et al., 3D convolutional neural networks for human action recognition[J], IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231. [5] F. N. Iandola, S. Han, M. W. Moskewicz, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and