Social Touch Gesture Recognition using Deep Neural ...

34 downloads 92711 Views 597KB Size Report
[developer.apple.com]). 2.2.2 Non-linearity. The non-linearity can be used to adjust or cut-off the generated output. There are many nonlinear functions which ...
FIRST DRAFT OF THE SOCIAL GESTURE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK1

Social Touch Gesture Recognition using Deep Neural Networks Saad Albawi, Fatih Yetkin , Kerem Altun Abstract In recent year’s, social touch gesture recognition has been considered as an important study for touch modality, which can lead to more efficient and realistic Human-Robot Interaction (HRI). In this paper, we performed touch gesture recognition on a dataset provided by the Recognition of the Social Touch Gestures Challenge 2015. The datasets had already measured numerous subjects performing different social gestures dubbed the Corpus of Social Touch (CoST),in which the touch was performed on a mannequin arm . A Deep Convolutional Neural Network (CNN) is selected to implement the social touch recognition system from raw inputs of the sensors. A leave-one-subject-out cross-validation method is used to evaluate the system performance. The result is competitive with the state-of-the-art performance. Morever , our proposed approach can recognize the gesture in almost real time (after acquiring a minimum number of frames). Index Terms Social touch, gesture classification, neural network, deep learning, convolutional neural network.

F

1

I NTRODUCTION

One of the basic interpersonal methods to communicate emotions is through social touch. Social touch classification is one of the leading pieces of research which has great potential for more improvement. Social touch classification can be beneficial for human-robot interaction. One of the most demanding, and yet simple questions in this area is to identify the type (or class) of the touch when a human touches the robot’s artificial skin. A human can easily distinguish and understand social touch. However, in the Human-Robot Interaction domain first, we need to develop an interface to record the social touch. There have been many attempts to build devices to classify human social touch and record them in an available dataset such as [Chang et. al., 2010]. However, we will concentrate on [Jung et. al, 2015], [Jung et. al., 2014b], and [Wingerden et. al, 2014] which proposed a setup to measure the pressure of touches to both record data, and also to cover more social touch gesture classes. This setup used a kind of artificial skin to record the pressure on it. The goal of ( [Jung et. al, 2015], [Jung et. al., 2014b], and [Wingerden et. al, 2014]) was to identify the touch classes out of 14 predefined classes. These • •



Saad Albawi is a PhD candidate at Istanbul Kemerburgaz University, Istanbul, Turkey. E-mail: [email protected] Dr. Fatih Yetkin is with Department of Computer Engineering Faculty of Engineering and Architecture Istanbul Kemerburgaz University Istanbul, Turkey. E-mail: [email protected] Dr. Kerem Altun is with Department of Mechanical Engineering Faculty of Engineering Ik University Istanbul, Turkey. E-mail: [email protected]

FIRST DRAFT OF THE SOCIAL GESTURE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK2

social gestures were grab, hit, massage, pat, pinch, poke, press, rub, scratch, slap, stroke, squeeze, tap and tickle. However, it would not be possible to introduce a fully real-time system. Even humans need to wait some amount of time (e.g. in order of milliseconds) to understand the social touch class. Therefore, our attempt in this paper was to classify social touch in (almost) real time. Consequently, how much data (how many frames) on average would be needed to be able to classify the social gesture. It should be emphasized that our method can lead to more realistic and almost real-time human-robot interaction. The second issue is to avoid preprocessing, which usually means developed case-dependent, and as discussed earlier, prevents real-time performance. In this paper, we propose a model for the social touch classification problem to avoid preprocessing. In other words, how can the social touches be classified only by providing raw input samples? However, using the sensor data without preprocessing is so large that a very powerful approach to efficiently classify the gesture classes is required .To be able to handle this huge amount of data, we use one of the most powerful tools, which has become very popular in the literature [LeCun, Benigo and Hinton, 2015]. Deep learning is artificial neural networks that are currently surpassing classical methods in performance in different fields especially pattern recognition. For example, in ref [Karpathy et. al., 2014 ] using deep learning Google were able to achieve more than 80% CCR in 487 classes of videos where the input dimension is 170x170x3xN. This paper introduces a form of convolutional neural network to classify social gesture in an end-to-end architecture with almost no proprocessing . The proposed method can compete with state-of-the-art results in terms of accuracy. Moreover, it is able to predict the class of the social gesture in almost real-time. It means that our method can start to classify after a minimum amount of frames are received. Our approach can classify even if the data sample is given in the middle of the gesture. Our result shows that with the leave-one-subject-out cross-validation method, we can achieve a 59.2% accuracy in the unseen subject. The next section of the paper gives a brief introduction in to the CoST dataset and Convolutional Neural Network. The Methodology section describes the architecture of our proposed deep neural network. The performance of our method is discussed in the results section. Finally, the conclusion section summarizes the main findings, and proposes future research.

2

BACKGROUND

This section consists of the background material related to this paper in order to make it as self-explanatory as possible . 2.1

CoST Dataset

The CoST dataset [Jung et. al., 2014b] provides recorded social touch gestures from different subjects. The data frame is collected with the pressure sensor installed in the mannequin arm in an 8x8 grid shape to cover the artificial skin. A single experiment is collected from a subject consists of an 8x8xN matrix (where N is the number of frames, or as it is called in this paper, frame length). The experiment is performed on 31 different subjects (24 males and 7 females ). There are total on 14 social gestures including grab, hit, massage, pat, pinch, poke, press, rub, scratch, slap, stroke, squeeze, tap, and tickle. The subjects are also asked to perform each gesture in three variations of gentle, normal, and rough, with a repetition of six in each variation. Therefore, each subject performed 252

FIRST DRAFT OF THE SOCIAL GESTURE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK3

gestures overall, and the CoST had a total of 7805 gestures from 31 subjects (few gestures are missing from the dataset). Jung et al (2014a) classified the gestures using two methods to create a baseline including the Bayesian classifier and the Support Vector Machine. For the Bayesian method, the results ranged from 24% to 75% (Mean = 54% and SD = 12%) . While for the Support vector machine, the results range from 32% to 75% (Mean 53% and SD = 11%). There were other attempts to achieve a higher performance on the gesture recognition using the CoST dataset. Siewart van Wingerden et. al applied a feedforward neural network for the classification of the CoST dataset [Wingerden et. al, 2014]. They divided the data into 35% training and 65% testing. The training data was used to select the best parameters for the NNs. Their NN architecture has one hidden layer containing 64 neurons . The essential features were selected based on the energy histogram and dynamic features in addition to the basic features used in [Wingerden et. al, 2014] . The best performance they could achieve using all the features was 64.6%. However, using the leave-onesubject-out cross validation, they could only obtain 54%, which was almost the same performance in [Jung et. al., 2014b]. Furthermore, Altuglu and Altun (2015) used CoST as 65% of the dataset for training and 35% for testing [Altuglu and Altun, 2015].They achieved a classification rate between 26% to 95% , and the mean classification was 55.6% for 14 social touches depending on the sequential forward floating search. The features employed in the model included pressure data, an image feature, a Hurst exponent, Hjorth parameters, and autoregressive model coefficient features taken from [Jung et. al., 2014b]. The total features become 42 features. Altuglu and Altun (2015) also test their method on a second dataset Human Animal Affective Robot Touch (HAART) , the dataset consists of 7 different touch gestures (no touch , constant, pat, rub, scratch, stroke, tickle ) and satisfy classification ratio (60% to 70%) based on 133 selected different feature. 2.2

Convolutional Neural Network

Convolutional Neural Network (CNN) is a type of artificial neural network, which can have multiple layers including convolutional layers, non-linearity layers, and pooling layers. This section provides a very brief introduction to the CNN (see [LeCun et. al., 1998], [Schmidhuber, 2015] and [LeCun, Benigo and Hinton, 2015] for more details). 2.2.1 Convolutional Layer In the convolutional layer, in the given input, there are multiple filters which slide over it. Then a summation of an element by element multiplication of the filters and receptive filed of the input are calculated as the output of this layer (see figure 1 ). The weighted summation is to be placed as one element of the next layer. As can be seen in Figure 1, the filter matrix (at the middle) is multiplied by the focus area (at the left matrix) which is shown by the green area and red color as its center. It should be noted this operation is not matrix multiplication and it is a so-called element by element multiplication. The result of this multiplication will be stored at the corresponding place of the center of focus in the next layer. Then we can slide the focus area and fill the other elements of the convolution result. Each of the convolutional operations is specified by stride, and zero-padding. Strides which is a positive integer number determines the sliding step. For example, stride 1 means that we slide the filter one to the right each time and then calculate the output. The zero-padding adds a zero row and columns to the

FIRST DRAFT OF THE SOCIAL GESTURE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK4

original input matrix. The main purpose is to include the data in the edge of the input matrix as equal as the center one. Without the zero-padding, the output of the convolution is smaller in size than the input. Therefore, the network size shrinks through having multiple layers of convolutions. Consequently, one may be limited in the number of convolutional layers in a network. However, zero-padding prevents the shrinking of the networks, and allows us to have an unlimited number of deep layers in our network architecture.

Fig. 1. The convolution layer slides the filter over the given input. The output is the summation of an element by the element matrix multiplication of the filter and the receptive field. (image from [developer.apple.com])

2.2.2 Non-linearity The non-linearity can be used to adjust or cut-off the generated output. There are many nonlinear functions which can be used in the convolutional neural network. However, the Rectified Linear Unit (ReLU) as shown in Figure 2, is one of the most common nonlinearities applied in fields such as image processing. The ReLU can be represented as 

ReLU (x) =

0 if x if

x