Deep Learning-based Multimodal Control Interface for ... - Science Direct

Available online at www.sciencedirect.com

ScienceDirect

Available online atonline www.sciencedirect.com Available at www.sciencedirect.com

ScienceDirect ScienceDirect

Procedia CIRP 00 (2018) 000–000

www.elsevier.com/locate/procedia

Procedia CIRP 00 (2017) Procedia CIRP000–000 72 (2018) 3–8 www.elsevier.com/locate/procedia

51st CIRP Conference on Manufacturing Systems

28th CIRPMultimodal Design Conference, May 2018, Nantes, France Deep Learning-based Control Interface for Human-Robot Collaboration A new methodology to analyze the functional and physical architecture of Hongyi Liu*, Tongtong Fang, Tianyu Zhou, Yuquan Wang, Lihuiidentification Wang existing products for an assembly oriented product family KTH-Royal Institute of Technology, Brinellvägen 68, 11428 Stockholm, Sweden

Paul Stief *, Jean-Yves Dantan, Alain Etienne, Ali Siadat * Corresponding author. Tel.+4687907824; E-mail address: [email protected] École Nationale Supérieure d’Arts et Métiers, Arts et Métiers ParisTech, LCFC EA 4495, 4 Rue Augustin Fresnel, Metz 57078, France *Abstract Corresponding author. Tel.: +33 3 87 37 54 30; E-mail address: [email protected]

In human-robot collaborative manufacturing, industrial robot is required to dynamically change its pre-programmed tasks and collaborate with human operators at the same workstation. However, traditional industrial robot is controlled by pre-programmed control codes, which cannot Abstract support the emerging needs of human-robot collaboration. In response to the request, this research explored a deep learning-based multimodal robot control interface for human-robot collaboration. Three methods were integrated into the multimodal interface, including voice recognition, Inhand today’s business environment, theposture trend towards moreDeep product variety and customization is unbroken. Due to this development, the Humanneed of motion recognition, and body recognition. learning was adopted as the algorithm for classification and recognition. agile reconfigurable production emerged to cope the with various products and product families. To at design andofoptimize robotand collaboration specific datasets systems were collected to support deep learning algorithm. The result presented the end the paperproduction shows the systems as deep to choose theinoptimal productcollaboration matches, product analysis methods are needed. Indeed, most of the known methods aim to potentialastowell adopt learning human-robot systems. analyze a product or one product family on the physical level. Different product families, however, may differ largely in terms of the number and nature ofThe components. This fact by impedes anB.V. efficient comparison and choice of appropriate product family combinations for the production © 2018 Authors. Published Elsevier system. A new methodology is proposed to analyze existing products in CIRP view of their functional and physicalSystems. architecture. The aim is to cluster Peer-review under responsibility of the scientific committee of the 51st Conference on Manufacturing these products in new assembly oriented product families for the optimization of existing assembly lines and the creation of future reconfigurable assembly Basedcollaboration; on Datum Flow Chain, the physical Keywords:systems. Human-robot Deep learning; Robot controlstructure of the products is analyzed. Functional subassemblies are identified, and a functional analysis is performed. Moreover, a hybrid functional and physical architecture graph (HyFPAG) is the output which depicts the similarity between product families by providing design support to both, production system planners and product designers. An illustrative example of a nail-clipper is used to explain the proposed methodology. An industrial case study on two product families of steering columns of thyssenkrupp Presta France is then carried out to give a first industrial evaluation of the proposed approach. of deep learning algorithms has Recently, the emergence 1. Introduction © 2017 The Authors. Published by Elsevier B.V. greatly re-shaped industry and society [3,4]. In many Peer-review underindustrial responsibility of controller the scientificutilises committee of the 28th CIRP Design Conference 2018. tasks, deep learning was proved to recognition and strategy Traditional robot model-based

control method, which is supported by a programming-based user interface [1]. Therefore, to control an industrial robot, operators need to write brand-specific robot control codes. With the control codes written, the industrial robot is controlled and robot tasks are designed. However, with the emerging 1.potential Introduction of Industrial 4.0 methodologies, the highly customised products require programming interface between human Due ato flexible the fast development in the domain of operators and robots. Especially, the emerging of human-robot communication and an ongoing trend of digitization and collaboration manufacturing (HRC) and its application animportant industrial digitalization, enterprises requires are facing robot dynamically change its pre-programmed tasks and challenges in today’s market environments: a continuing collaborate with human operators at the same workstation. In tendency towards reduction of product development times and an HRC workstation, the industrial robot is required to support shortened product lifecycles. In addition, there is an increasing and collaborate with the human much as possible [2]. demand of customization, beingworker at the as same time in a global The traditional control code-based robot programming method competition with competitors all over the world. This trend, clearlyiscannot support emerging needs HRC.toInmicro HRC which inducing the the development from of macro application, a sensor-based, flexible, and integrated solution markets, results in diminished lot sizes due to augmentingis needed.varieties (high-volume to low-volume production) [1]. product To cope with this augmenting variety as well as to be able to identify possible optimization potentials in the existing 2212-8271 ©system, 2018 The it Authors. Publishedtobyhave Elsevier B.V. production is important a precise knowledge Keywords: Assembly; Design method; Family identification

outperform the level of human expert [3,5]. Compared with traditional machine learning technologies, deep learning does not require careful engineering and expert-level domain knowledge to design a feature extractor which is an important step transform theand rawcharacteristics data into a pattern that machine of thetoproduct range manufactured and/or learning classifier could use as input [4]. For HRC assembled in this system. In this context, the main applications, challenge in deep learning algorithms could potentially the modelling and analysis is now not only to copeincrease with single flexibility and the applicability of an HRC system. products, a limited product range or existing product families, this build aproducts multimodal robot but In also to paper, be ablethe to authors analyze intend and to to compare to define control interface for HRC. As shown in Fig. 1, the multimodal new product families. It can be observed that classical existing robot control utilises deep learning product families interface are regrouped in function of clients algorithms, or features. sensors inputs, and open source robot interfaces. However, assembly oriented product familiescontrol are hardly to find. Firstly, three information sources are collected as theinHRC On the product family level, products differ mainly two dataset: human body(i)posture captured from cameras, main characteristics: the number of components and (ii)hand the motion captured by (e.g. Leapmechanical, Motion sensors, and voice command type of components electrical, electronical). recorded a microphone. Secondly, the single deep products learning Classicalbymethodologies considering mainly algorithms are trained to understand and classify the or solitary, already existing product families analyzeabove the dataset. structure Thirdly, on theaclassified results are further analysed and product physical level (components level) which causes difficulties regarding an efficient definition and comparison of different product families. Addressing this

Peer-review under responsibility of the scientific committee of the 51st CIRP Conference on Manufacturing Systems.

2212-8271©©2017 2018The The Authors. Published by Elsevier 2212-8271 Authors. Published by Elsevier B.V. B.V. Peer-review under responsibility of scientific the scientific committee theCIRP 51stDesign CIRP Conference Conference2018. on Manufacturing Systems. Peer-review under responsibility of the committee of the of 28th 10.1016/j.procir.2018.03.224

4 2

Hongyi Liu et al. / Procedia CIRP 72 (2018) 3–8 Hongyi Liu*, Tongtong Fang, Tianyu Zhou, Yuquan Wang, Lihui Wang / Procedia CIRP 00 (2018) 000–000

fused. At last, the robot control commands are generated and fed into the robot control interface, where the industrial robot is controlled respectively. Pick

Body posture

Voice command

Body posture dataset collecting

Hand motion dataset collecting

Voice command dataset collecting

Body posture model training

Hand motion model training

Voice command model training

Body posture realtime recognition

Hand motion realtime recognition

Voice command real-time recognition

Deep learning data processing

Hand motion

Control information fusion

Robot control interface

Robot controller

breakthrough, deep belief network [9], proposed by Hinton and implemented with greedy layer-wise pre-training strategy. The viability of the strategy was proven in other neural networks [10] as well. Thereafter, deep learning has been widely applied in several application areas, such as computer vision, speech recognition, and natural language processing, etc. Deep learning outperforms other traditional machine learning algorithms. Traditional machine learning algorithms are limited in processing the data with its raw form [4], while deep learning could maintain the internal representation of the data during the training process. Usually, traditional machine learning algorithms need domain-specific expertise to extract features; however, those extracted data representations may still lose part of data’s hidden patterns. In deep learning, there are several different algorithms. Each algorithm is suitable for specific tasks. Some popular algorithms are Convolutional Neural Network (CNN) for image or audio processing, long short-term memory (LSTM) networks for sequence-tosequence modelling and Autoencoder for feature learning. Deep learning and its related algorithms, for instance, CNN and LSTM, provide a cost-effective approach to understand human activities. With standard open libraries, it is also possible to interface the developed deep learning-based multimodal recogniser with an industrial robot. 3. Multimodal control methods In this section, the methods towards multimodal control interface are introduced. The authors explored voice recognition, hand motion recognition, and human body posture recognition for HRC applications. 3.1. Voice recognition

Fig. 1. Deep learning-based multimodal control for human-robot collaboration

2. Related work Deep learning is a branch of machine learning that enables the capability to improve the performance of algorithms by learning through deep representations of the given dataset. The common structure of deep learning algorithm consists of one layer as input, several hidden layers to extract the deep features from the input layer, and one output layer for inference. Among the different layers in the network, the output of the previous layer is the input of the current layer. Representations of data are learned from each layer and form the abstracted concepts in a hierarchy manner from simple to complex characteristics. There were three historical waves of deep learning development, which could be dated back to 1940s [6]. The birth period was the implementation of perceptron [7], a single neuron machine, which was inspired by a biological mechanism. The second historical wave was led by the successful application of back-propagation algorithm [7,8], to minimise the cost functions in the neural network training process. Back-propagation algorithm enabled the multi-layers’ neural network structure. Recently, as the advancement of computational capacity, the cost of complex neural networks training has been greatly reduced, which paved the way for the third historical wave. The third wave was triggered by a

The first approach focuses on utilising audio data recognition to control a robot. The authors selected five voice commands from a standard speech commands dataset [11]. The five commands are: ‘down’, ‘go’, ‘left’, ‘right’ and ‘up’, as those five words are the best fit for the HRC applications. There are different methods to realise voice classification. In this paper, the authors explored an CNN-based deep learning method. CNN has been proven as an effective algorithm in 2D images classification and recognition tasks [5,12]. However, audio data is represented as 1-D time series, as shown in Fig. 2 (a). To apply CNN on voice recognition, the authors transformed the audio files into 4032 2-D images, as shown in Fig. 2 (b). The implementation details and architecture of the CNN are discussed in Section 4. 3.2. Hand motion recognition The second approach for multimodal robot control interface is hand motion recognition. By understanding data captured by Leap Motion sensor, the robot is controlled accordingly. Leap Motion is a non-contact sensor produced by Leap Motion Inc. [13]. Leap Motion is capable of capturing hand motions by accurately locating the hands and fingers within an interaction box, as shown in Fig.3 (a). A Leap Motion sensor connected with a laptop is shown in Fig.3 (b). The data output from the


5 3

Leap Motion provides an abstracted real-time representation of human hands, which includes the timestamp, the fingers position, and hand position. The definition of interaction box and coordinate of the Leap Motion sensor are shown in Fig. 3 (a).

(a)

(b)

Fig. 3. (a) The coordinate and interaction box definition of Leap Motion sensor; (b) Working in the lab with a Leap Motion sensor

Fig. 2. (a) Audio data with waveform representation; (b) The same audio data with 2-D image representation

There is no standard dataset available for Leap Motion sensor in hand motion recognition field. To apply Leap Motion sensor in HRC scenario, it is reasonable to build a Leap Motion hand motion dataset. The authors started with binary classification task. The expected outcome of the algorithm is to distinguish the defined motion from other random motions. The authors generated and recorded a list of hand-waving motions from left to right, as shown in Fig 4 (a). The dataset above is labelled as positive. The random motions dataset is also collected and labelled as negative, as shown in Fig 4 (b). By training with the labelled dataset, the algorithm is expected to identify the hand-waving motion.

(a)

(b)

Fig. 4. Screenshots of the hand motion captured by Leap Motion sensor. The hand skeleton model is the detected hand. The colourful lines are the moving traces: (a) Defined hand motion from left to right; (b) Random hand motions

6 4


As introduced in the earlier section, the authors recorded, transformed, and labelled dataset collected from Leap Motion sensor. The dataset is further randomly split into training and test dataset with ratio 3:1. The combined dataset contains 8520 rows and 88 columns. Each row represents a set of observation data, with all attributes in the column. The attributes are x, y, z coordinates of 87 nodes detected, such as the thumb metacarpal, thumb proximal, thumb intermediate, thumb distal, thumb tip, etc. Based on the prepared dataset, the authors tested several neural networks and chosen the best architecture to extract deep features from the data. The first model is a multilayer perceptron (MLP) model with three fully-connected layers. The second model is a three-layered CNN. The third model is a long short-term memory (LSTM) network. In this paper, the authors only address the test with MLP. MLP architecture and implementation details are shown in Section 4.

200200 greyscale images as input, and the output of the CNN is two classes with true or false. Details of the designed CNN are shown in the following section.

Fig. 5. Customised HRC dataset 1

3.3. Body posture recognition In this section, a deep learning-based body posture recognition method for HRC is introduced. With collected HRC-specific dataset, human body postures are trained and recognised from the video stream. The robot is controlled according to the recognised human body posture. There are two different types of human body movements:  Static body posture: the posture can be captured with only one image.  Continuous body motion: the motion need to be captured by a video stream (many images). In this paper, the authors only consider the static body posture which can be captured by only one image. There are many different open-source human body posture datasets available, for instance, MPII dataset and VGG human pose estimation datasets [14,15]. However, HRC is a specific application of human body posture recognition with many specific postures and environments. Therefore, an HRCspecific human body posture dataset is needed. As shown in Fig. 5, the first collected dataset is a binary classification dataset that consists of two different postures (standing and hand waving). Each of the postures includes 36 images. Fig. 6 is the second dataset, which is also a binary classification dataset that consists of two different postures. In the second dataset, the difference between two postures is much smaller than the first dataset. The second dataset involves a standing posture and a similar posture with small differences. Also, the second dataset provides more images (in total 187 images). The choice of machine learning algorithm is a crucial factor for the recognition result. It has been proven that some of the two-step traditional feature-based machine learning algorithms cannot handle the complex background and small human posture. After experiments, this research utilises CNN as the machine learning algorithm to train and recognise the human postures. The authors designed a data cleaning pre-processing step before training that transfers the input images into 200200 greyscale images. With pre-processed images prepared, the images are labelled and randomly sorted. The prepared images are further split into 80%-20% training dataset and test dataset. Therefore, the designed CNN receives

Fig. 6. Customised HRC dataset 2

4. Implementation and results In this section, the three mentioned methods in the previous subsections are implemented and tested by the authors. The detailed deep learning network architectures and test results are explained. 4.1. Voice recognition One of the difficulties for neural network training is to decide the appropriate model architecture, as complex tasks require complex models. After trial and error, the authors decided a suitable model for voice recognition, considering both model complexity and computational efficiency. The model architecture is shown in Table 1. Table 1. Architecture of CNN for voice recognition Layer (type) Input Layer (40  32  1) 4  Conv (5  5) + ReLU Batch Normalisation MaxPooling (2  2) Dropout 8  Conv (3  3) + ReLU Batch Normalisation MaxPooling (2  2) Dropout Flatten Fully-Connected Layer 128 + ReLU Fully-Connected Layer 64 + ReLU Softmax  10

To tune hyper-parameters (e.g. dropout, learning decay), the authors first applied a training-validation-test (70%-15%-15%)


set split. The validation set was used for model selection. For the final test of the model with selected parameters, the authors applied a simple training-test split by 80%-20%. Fig. 7 shows the accuracy of the training set and test set. After training for 40 epochs, the average accuracy on test set has converged to 92% - 93%.

7 5

two collected datasets. Figure 9 (a)(b)(c) shows the result of the CNN for human body posture recognition on dataset 1. After 20 epochs, the CNN achieved an accuracy of 0.96 on the training dataset. The training loss is 0.11. The test accuracy on the test dataset converged to 0.93. It is clear that the different postures are well classified without obvious overfitting. Table 3. Architecture of CNN for body posture recognition Layer (type)

Output Shape

Param #

conv2d_1 (Conv2D)

(None, 198, 198, 32)

320

conv2d_2 (Conv2D)

(None, 196, 196, 64)

18496

max_pooling2d_1

(None, 98, 98, 64)

0

dropout_1 (Dropout)

(None, 98, 98, 64)

0

flatten_1 (Flatten)

(None, 614656)

0

dense_1 (Dense)

(None, 128)

78676096

dropout_2 (Dropout)

(None, 128)

0

dense_2 (Dense)

(None, 2)

258

(MaxPooling2)

Fig. 7. Accuracy on training set (orange) and test set (blue) over 40 epochs

4.2. Hand motion recognition Table 2 is the architecture of the selected MLP model. The model is designed with three fully-connected/dense layers. A dropout of 0.5 is added to the first two layers to avoid data overfitting. Table 2. Architecture of MLP network for hand motion recognition Layer (type)

Output Shape

Param #

dense_24 (Dense)

(None, 32)

2816

dropout_19 (Dropout)

(None, 32)

0

dense_25 (Dense)

(None, 32)

1056

dropout_20 (Dropout)

(None, 32)

0

dense_26 (Dense)

(None, 2)

66

Fig. 8 shows the accuracy and loss of the MLP during the training process. The training accuracy is stabilised around 0.98 and the loss is stabilised around 0.20. The evaluation on the test dataset shows that the test accuracy is 0.96 and the loss is 0.12. There is no significant gap between training and test accuracy. Therefore, the trained model obtained good generalisation ability.

To further push the limit, dataset 2 is collected to challenge the capability of the designed CNN. As shown in Fig. 6, in dataset 2, one body posture is standing, and the other body posture is standing with a very small finger posture. The posture is so small that even for human eyes, it is difficult to distinguish. As shown in Fig. 9 (d)(e)(f), the CNN achieved the accuracy of 0.71 in the training dataset. The accuracy on the test dataset is 0.70. The authors conclude that this model is not ‘well-learned’. However, if the training process continues (by more epochs), the model will get overfitting after around 30 epochs. It is clear that the proposed CNN model cannot perfectly distinguish the two human postures. The maximum accuracy is around 0.7 for this dataset.

Time/hour

(a)

Time/hour

(d)

Time/hour

(e)

(b)

(a)

Time/hour

(b)

Fig. 8. (a) MLP model accuracy on training dataset; (b) MLP model loss on training dataset

4.3. Body posture recognition The architecture of the selected CNN is shown in Table 3. As shown in Fig. 5 and Fig. 6, the authors tested the CNN on

Time/hour

(c)

Time/hour

(f)

Fig. 9. (a) Model accuracy on the training set of dataset 1; (b) Model accuracy on the validation set of dataset 1; (c) Model loss on the training set of dataset 1; (d) Model accuracy on the training set of dataset 2; (e) Model accuracy on the validation set of dataset 1; (f) Model loss on the training set of dataset 1

8 6


4.4. Discussions

perspective. Annu Rev Control 2007;31:69–79. doi:10.1016/j.arcontrol.2007.01.002.

In this study, the authors reached the expected goal of building a robot control interface by utilising deep learning algorithms. The performance on test dataset proves the capability and efficiency of deep learning algorithms. Compared with the traditional code-based pre-programmed approach, deep learning-based method is sensor-based, viable, and flexible. Deep learning-based system, as a promising approach, will potentially benefit HRC applications.

[2] Liu H, Wang L. Gesture recognition for human-robot collaboration: A

5. Conclusions and Future work

[6] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press; 2016.

In the study, the authors built a deep learning-based multimodal robot control interface for HRC. Three methods were integrated into the multimodal interface, including voice recognition, hand motion recognition, and body posture recognition. Deep learning was adopted as the algorithm for classification and recognition. HRC specific datasets were collected for deep learning algorithms. The result presented in this paper shows potential to adopt deep learning in HRC systems. As future work, transforming the current batch processing interface into a real-time processing robot control interface is needed. Besides, a larger variety of hand and body postures could be added in the dataset to increase the applicability of the interface. Moreover, research is needed to find the best suitable deep learning model for a given method. For instance, LSTM would be desired in hand motion recognition in the future and expected to outperform the current MLP model. At last, the authors will compare the performance of the implemented deep learning algorithms with other traditional machine learning algorithms. References

review. Int J Ind Ergon 2016:1–13. doi:10.1016/j.ergon.2017.02.004. [3] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of go without human knowledge. Nature 2017;550:354. [4] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–44. [5] Lawrence S, Giles CL, Tsoi AC, Back AD. Face recognition: A convolutional neural-network approach. IEEE Trans Neural Networks 1997;8:98–113. [7] Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol Rev 1958;65:386. [8] Williams D, Hinton G. Learning representations by back-propagating errors. Nature 1986;323:533–8. [9] Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput 2006;18:1527–54. [10] Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst., 2007, p. 153– 60. [11] Warden P. Speech Commands: A public dataset for single-word speech recognition. Dataset Available from Http//download Tensorflow Org/data/speech_ Command v0 2017;1. [12] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst., 2012, p. 1097–105. [13] Potter LE, Araullo J, Carter L. The leap motion controller: a view on sign language. Proc. 25th Aust. Comput. Interact. Conf. Augment. Appl. Innov. Collab., ACM; 2013, p. 175–8. [14] Charles J, Pfister T, Magee D, Hogg D, Zisserman A. Personalizing human video pose estimation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, p. 3063–72. [15] Andriluka M, Pishchulin L, Gehler P, Schiele B. 2d human pose estimation: New benchmark and state of the art analysis. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, p. 3686–93.

[1] Brogårdh T. Present and future robot control development-An industrial

Deep Learning-based Multimodal Control Interface for ... - Science Direct

Deep Learning-based Multimodal Control Interface for ... - Science Direct

Suggest Documents

Deep Agraphia - Science Direct

A MULTIMODAL LEARNING INTERFACE FOR

multimodal interface for disabled persons

Multimodal visualization interface for data

A Flexible Architecture for a Multimodal Robot Control Interface

Towards Multimodal Human-Machine Interface for Hands-free Control ...

Multimodal Deep Gaussian Processes

Multimodal Emotion Recognition Using Multimodal Deep Learning

Textile Structures for Climate Control - Science Direct

PixelTone: A Multimodal Interface for Image Editing

Multimodal Interface for Active Four-wheel Steering

A Multimodal Interface for Road Design

Speech Generation in a Multimodal Interface for

Multimodal Speech-Gesture Interface for Handfree ... - CiteSeerX

Multimodal Man-Machine Interface for Mission Planning

Liquid / Liquid Interface-Based Electrochemical ... - Science Direct

Multimodal User Interface Management - CiteSeerX

A Direct Manipulation User Interface for the Control of Communication ...

Postural control system - Science Direct

Multimodal User Interface Management - CiteSeerX

Deep Learning Models for Multimodal Sensing

A Deep Semantic Network for Multimodal Sentiment

A Fourth-order Sigma-Delta Interface Circuit for ... - Science Direct

Deep Impression: Audiovisual Deep Residual Networks for Multimodal ...