Hand Gesture Modeling and Recognition for Human ...

International Journal of Advanced Robotic Systems

ARTICLE

Hand Gesture Modeling and Recognition for Human and Robot Interactive Assembly Using Hidden Markov Models Regular Paper

Fei Chen1,2, Qiubo Zhong3,4*, Ferdinando Cannella1, Kosuke Sekiyama2 and Toshio Fukuda2 1 Advanced Robotics Department, Istituto Italiano di Tecnologia, Genova, Italy 2 Department of Micro-nano System Engineering, Nagoya University, Japan 3 School of Electronic and Information Engineering, Ningbo University of Technology, Ningbo, China 4 State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, China * Corresponding author(s) E-mail: [email protected] Received 02 October 2014; Accepted 09 December 2014 DOI: 10.5772/60044 © 2015 The Author(s). Licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Gesture recognition is essential for human and robot collaboration. Within an industrial hybrid assembly cell, the performance of such a system significantly affects the safety of human workers. This work presents an approach to recognizing hand gestures accurately during an assem‐ bly task while in collaboration with a robot co-worker. We have designed and developed a sensor system for measur‐ ing natural human-robot interactions. The position and rotation information of a human worker’s hands and fingertips are tracked in 3D space while completing a task. A modified chain-code method is proposed to describe the motion trajectory of the measured hands and fingertips. The Hidden Markov Model (HMM) method is adopted to recognize patterns via data streams and identify workers’ gesture patterns and assembly intentions. The effectiveness of the proposed system is verified by experimental results. The outcome demonstrates that the proposed system is able to automatically segment the data streams and recognize the gesture patterns thus represented with a reasonable accuracy ratio.

Keywords Hybrid Assembly System, Human-robot Collaboration, Artificial Cognition, Hidden Markov Model 1. Introduction There is exceptional demand within the manufacturing industry to meet the high-mix, low-volume requirements for changing consumer-market demands. These demands are also met with an ever-increasing number of product variants and smaller lot-sizes [1, 2]. A fully robotic manu‐ facturing cell has already been designed and adopted for this purpose [3]. However, a fully robotic manufacturing process cannot obtain sufficient flexibility with a highly variable product line. At the same time, it must also aim towards complimentary cost-effectiveness in order to support this demand. By taking advantage of a human’s adaptability and flexibility, we can exploit the concept of a hybrid assembly system for medium-sized manufacturing processes. Hybrid assembly creates a modern assembly mode whereby the robot works as a co-worker to collabo‐ rate with the human and share the same working space and time [1, 4, 5]. Within this scenario, the advantages of human-robot collaboration are exploited through an Int J Adv Robot Syst, 2015, 12:48 | doi: 10.5772/60044

1

optimized task-scheduling system while the shortcomings are avoided. The realization of a hybrid system can also have positive impact in society [6]. In previous work, it is shown that effective collaboration between human and robot can achieve cost-effective performance, and reduce the total assembly time (i.e., makespan) and cost of pro‐ duction [6]. The importance of hybrid assembly systems has become increasingly apparent both in industry and academia as regards improving production efficiency in manufacturing processes [7-10]. The main objective of this system is to make use of an individual’s intelligence, expertise and flexibility. In this way, the robotic system is also able to take advantage of the human operator’s sense, sensibility and resourcefulness to complete a required task. On the other hand, the human can utilize the high-precision, strength and repeatability of the robotic system, and thus reduce fatigue and the risk of injury as well as increase overall work safety [1, 11]. A hybrid assembly cell (HAC), where human and robot collaboratively work within a limited space, is commonly agreed as one realization of hybrid assembly systems [1, 12]. Compared with human and robot collaboration in an open environment, a cell-based assembly can concentrate much more on multi-functional modular manufacturing with small-volume requirements. This task-oriented assembly can be quickly deployed and allocated between human and robot co-workers [6]. However, despite carefully examining the research regarding this issue, it is always difficult to model such cooperation in a quantitative way. It is hard to identify any effective models describing the interaction between human and robots from recent research papers. However, when addressing industrial issues it is always very important to evaluate the coopera‐ tion performance quantitatively for industrial assembly in terms of the time span or errors which will directly affect the profits that the company can make. We study the mathematical model for human and robot cooperation by building a stochastic petri-net system, as in [13]. It is an event-driven system, whereby the robot detects cer‐ tain‘trigger” events in order to carry out corresponding reactions. Similar works can be found in [14], where a HAC called ’multi-modal assembly-support system’ (MASS) is developed. MASS is equipped with physical support and information support, guaranteeing human workers’ safety as well as the assembly task-flow. However, MASS mainly focuses on the safety rules category within the hybrid assembly and disregards any collaboration between the human and the robot co-worker. The recent research in [11] addresses a safety-control strategy for a robot co-worker by monitoring the position of the robot end-effector. Speed control for the end-effector is categorized into different stages within the working space. However, this configura‐ tion method has profoundly restricted the performance of the robot and lowered collaboration levels. Another platform called ’joint-action for humans and industrial robots’ (JAHIR, within the CoTeSys Project) is introduced 2

Int J Adv Robot Syst, 2015, 12:48 | doi: 10.5772/60044

in [15]. JAHIR focuses on monitoring the status of human workers and the assembly work-flow. Two cameras mounted on top of the working area are used to determine the 3D position of the human operator’s hands. Tracking of the hands is achieved based on a 3D occupancy-map generated by the cameras. This configuration of a sensor system can only obtain the raw information of the hand’s position, and therefore it cannot achieve accurate pattern recognition. This method is also time-consuming for online 3D occupancy-map generation, and makes it difficult for the robot co-worker to respond quickly. Within a HAC system, the human action-pattern recogni‐ tion and intention estimation are the key issues that must be addressed [16]. The assembly tasks assigned to the human and the robot are defined in advance based on a selection of optimal rules [6, 13]. According to this taskflow, the human worker can easily perceive and under‐ stand what his partner is doing, while the robot co-worker is limited and unable to perceive and react accordingly. A vision-based sensor system is already widely applied for non-contact environment-awareness in human-robot interaction (HRI) [17, 18]. However, the recognition of static and dynamic gestures within dynamic environments is difficult. It is essential to isolate the objects from complex and dynamic scenes with cluttered backgrounds. Conse‐ quently, pattern recognition on RGB image data combined with depth information (RGB-D data) has been introduced in recent years [19, 20]. Aligned with this research, the following techniques have also presented methods for addressing static and dynamic gestures. Single-frame databased human gesture recognition [21], object recognition [22] and 3D environment reconstruction [23] have been reported. There are already a number of commercial RGBD cameras available in the market at prices ranging from 200 USD to 50,000 USD and capture speeds from 0.033 seconds to three seconds. These features have restricted the wider usage of RGB-D cameras in industry. Among these cameras, Microsoft Kinect costs less than 200 USD with the frame rate 30FPS [24]. One drawback of Kinect is its inaccurate measurement output. Subtle movements of the human hand and fingertips are difficult to measure. In [25], the authors developed algorithms to detect the human palm and fingertips, and in [26] the authors developed an algorithm based on flocking in order to interact with computers more naturally. In [27], the pose of a human was computed from the fusion of data from a gypsy-giro suit based on accelerometers and UWB sensors for assembly and disassembly tasks in collaboration between humans and industrial robots. In the present research, we assume that the illumination within the assembly cell is always stable. When a human worker is collaborating with a robot, his hands are tracked only when they are completely exposed to Kinect and the light source. Supporting sensor data are sometimes provided for recognizing such actions. The HMM is widely applied to pattern recognition, including speech recognition [28], handwriting recognition [29, 30], human behaviour recognition [31] and trajectorylearning [32]. It can also be applied to human-action pattern recognition via sensor-data streams. Related works

demonstrate pattern recognition [33] and prediction [34, 35]. One of the challenges surrounding the use of HMM for online applications is the issue of a method for dealing with data segmentation via data streams and recognizing patterns via short segments. The original contribution of this work is in the design of an intelligent human-robot collaborative hybrid assembly (iHRCHA) cell. In this, a sensor system serves as a natural interface between a human and a robot co-worker. Via this interface, the robot co-worker is able to identify a human worker’s hand gestures accurately and rapidly. This mechanism utilizes an RGB-D camera and a supportive glove to produce the position data and rotation information of the humans palm and fingertips in 3D space. An algorithm is developed to obtain accurate information from the raw data streams of the sensor system. The HMM is combined with this interface for online hand-gesture recognition using a segmentation technique. This system is cheap and can be deployed quickly. It demonstrates an improvement in the current setup of HACs by providing a robot co-worker with the capability of carrying-out collaborative tasks rapidly, effectively and safely. The remainder of this paper is organized as follows: Section II introduces the basic assumption, problems and challeng‐ es in iHRCHA. Section III describes the experimental setup, including the sensor system and the algorithm to process the sensor information and obtain the featured data. A trajectory descriptor is introduced to encode the human’s palm and fingertip movement trajectories. Section IV describes a task scenario which assesses the presented system. Section V and Section VI explain the experimental results, discussion and conclusion respectively. 2. An intelligent human and robot collaborative hybrid assembly cell 2.1 Basic problem In iHRCHA cells, human workers and robot co-workers must work closely with each other. In [1], the author describes several typical human and robot coordination models: 1.

The human alternately collaborates with a robot coworker in performing a task. In this case, the human and the robot perform the assembly sequentially (Fig. 1-(a)). The human and the robot do not share the working time but they do share the working space.

2.

The human collaborates with the robot co-worker in performing a task. In this case, the human and the robot collaboratively perform the assembly task simultaneously (Fig. 1-(b)). The human and the robot share both the working time and the working space.

When the human worker is performing the assembly task, the robot monitors, detects and estimates his actions and intentions. Being aware of the predefined scheduled assembly tasks, the robot co-worker can assist the human and carry out its own assembly task. The human cannot

Figure 1. Gantti diagram for “Human alternates with robot”

Figure 1. Gantti diagram for a “human/robot tasks shift”

can identify a human worker’s hand gestures accurately

andexplicit rapidly.orThis mechanism utilizes an RGB-D camera send implicit commands using a traditional and a supportive glove to produce the position data and human-machine interface, such as a keyboard or a mouse. rotation information of the humans palm and fingertips in He 3D hasspace. to concentrate and isfocus on more urgentaccurate tasks. As An algorithm developed to obtain information from raw datatostreams the sensor a consequence, it is the reasonable design of a human-robot system.in a The HMMway, is combined with thisand interface for interface natural using gestures a language. online hand gesture recognition using a segmentation Languages are the most natural way in which humans technique. This system is cheap and can be deployed communicate. Despite theto last decade of development, quickly. It demonstrates improve the current setup of natural language process technology is still a HAC by providing a robot co-worker with theincapable capability of of adoption carrying out tasks rapidly, effectively, and wide forcollaborative mutual applications due to the compli‐ safely. cated processing involved. Human gestures - alternatively The remainder thismost paper is organized as follows: - are believed to beof the convincing natural interface Section II introduces assumption, problems while still retaining the the richbasic information of human-robot and challenges in iHRCHA. Section III describes the communication. experimental setup, including the sensor system and the algorithm to process the sensor information and obtain

Human gestures usually refer to those gestures represented featured data. A trajectory descriptor is introduced via to theencode body, arm hands of a human. In iHRCHA, human the or human’s palm and fingertip movement andtrajectory. robot co-workers mainly perform electronic Section IV describes a task scenario manufac‐ which assesses the presented V andtoSection turing assembly tasks onsystem. a bench.Section It is unwise monitor VI explain the experimental results, and human body gestures because these willdiscussion not change very conclusion respectively. often or obviously during the assembly task. It is also not practical to monitor human arm gestures because it is 2. Intelligent human and robot collaborative hybrid difficult to capture sufficient information to describe them. assembly cell Hand gestures are appropriate to describe a human 2.1. Basic Problem status during manual assembly. The worker’s working posture of a human worker’s hands duehave to the In iHRCHA cell, human worker and robotvaries co-worker to workrequirements closely with eachof other. [1], the author describes different theIncurrent assembly task. several typical and robot coordination Therefore, it is human of interest to examine themodel: relationship between a human worker’s hand with gestures and the actual 1) Human alternately collaborates a robot co-worker in performing task. Inintentions, this case, human and robot assembly actions.a Human as associated with performwork, the assembly sequentially (Fig. 1-(a)). Human assembly are defined as the assembly action that a and robot do not share the working time but share the given human worker is performing or intending to per‐ working space. form.

consequence, robot interfac languages. L human comm development, cannot be wid to the compli alternative wa natural interfa for human and

Human gestur via body, arm human and ro electronic man is not wise to it will not ch assembly. It arm gesture b information to describe the h manual assem varies due to assembly task. relationship be actual assembl the assembly w human worker

2.2. Hand Gestu

In electronic requires huma operations. I operator carri Therefore, we action by mo visual object with effective complex palm environments. interface based an effective al is proposed by it is possible based on thes reliable. There with a 3-axis a for accurate p collection.

2) Human collaborates with a robot co-worker in performing a task. In this case, human and robot There are two 2.2 Hand gesture analysis in iHRCHA collaboratively perform the assembly task simultaneously One is the stati (Fig. 1-(b)). Human and robot share both the working time the other is dy In electronic manufacturing systems, assembly requires a and working space. time. For the fo human to perform accurate and quick assembly operations. frame of data f When human worker is doing the assembly work, robot In current manufacturing systems, operators carry be described u monitors, detects and estimates hishuman action and intention. out Being most operations hand. Therefore, look to tasks, identify streams. aware of theby predefined scheduledwe assembly

human monitoring gestures. robot assembly co-worker actions can assistby human and carryhand out its own For static gestu assembly task. Humanrecognition cannot send technologies explicit or implicit Previous visual-object cannot the 3D position commands by using traditional human and and identifying machine information. F provide an effective method for tracking interface, such as keyboard or mouse. They have to complex palm and fingertip gestures in complex environ‐ data for each f concentrate and focus on more urgent tasks. As a ments. In this research, a human-robot interface based on trajectory. human hand gestures is designed and an effective algo‐ rithm for palm and fingertip recognition is proposed by www.intechopen.com analysing the RGB-D data ofGesture hands. Although it is possible Hand Modeling and Recognition for Human and Robot I to analyse palm rotation information based on such RGBD data, it is still not reliably accuracy. Therefore, we design a simple glove for a human with a three-axis acceleration

Fei Chen, Qiubo Zhong, Ferdinando Cannella, Kosuke Sekiyama and Toshio Fukuda: Hand Gesture Modeling and Recognition for Human and Robot Interactive Assembly Using Hidden Markov Models

3

sensor and a gyro-sensor attached for accurate palm movement and rotation information collection. There are two types of hand gestures in iHRHAC (Fig. 2). One is the static gesture, which takes place immediately, and the other is the dynamical gesture which takes place over a time. The former can be described using a single frame of data from the data streams, while the latter can be described using sequential frames of data from the data streams.

Figure 2. Human hand gesture using OpenSim [36] Figure 2. Human hand gesture using OpenSim [36]

4

In this research, the sensor system for the human and robot For a static single frame of data includes only interface is gesture, built bya two sets of sensors: A visionnot system using andofa the datapalm glove using 3-axis but acceleration the 3DKinect position and fingertips also the sensor gyro. TheFor information gotten by besides this sensor rotationand information. a dynamical gesture, the system is represented in frame, two respects: the position of static gesture data for each it also contains the hand human hands and fingertips, and the rotation information movement trajectory. of hands, including the 3D acceleration and angular In this research, thean sensor for thehand human-robot velocity. Therefore, actionsystem of human’s could be uniquely thesets combination of athese twosystem types interface described is built byby two of sensors: vision of information. using Kinect, and a data glove using a three-axis accelera‐ tion sensor and gyro. The information acquired by this 2.3. Challenges Recognition Data Streams sensor systemofisPattern represented in twovia respects: the position of the human hands and fingertips, and the rotation In [37] [38], the authors have listed some very challenging information of the hands (including acceleration the issues for pattern recognition on3D data streams. andOur angular velocity). Therefore, thethe action of a human’s iHACHA is designed based on following respects:hand canCost-effectiveness: be uniquely described the combination these two 1) The by sensor should beofcheap for practical deployment, while providing with promising types of information. results. 2) interface system should no require 2.3Naturalness: Challenges forThis pattern recognition via data streams human to wear additional heavy devices, or bring stress to physical and psychological way [39]. In human [37, 38], in theboth authors list some particularly challenging Therefore, conventional complicated data glove not issues for pattern recognition in data streams. OurisiHA‐ acceptable [40]. CHA is designed based on the following considerations: 3) Interaction space: In traditional research, the system requires human standing within should a fixed be environment 1. Cost-effectiveness: The sensor cheap for without moving. However,while the noisy background in a practical deployment providing promising HACresults. can greatly affect the output of the system, not even to retrieve the 3D position information. 2. Outlier Naturalness: This interface system should notstreams require 4) point detection on data streams: Data humanrecognition to wear additional devices or else basedthepattern requires heavy that the system can automatically detect outlier point stress and segment data cause physical or psychological for the human streams. Therefore, the cue study detecting- data the [39]. Therefore, a conventional - i.e., for complicated segment potential human action is important glovecontaining is not acceptable [40]. [41]. 3. Responsiveness: Interaction space: Traditionally, the system requires 5) When human performs an action, thea robotic system shouldwithin response in near real-time.without 45ms human standing a fixed environment [42] is thought to be the threshold value for real time response like Syst, human we find out the start Int J Adv Robot 2015,does. 12:48 |After doi: 10.5772/60044 point, the length of the data segment should be not too long so that robot can analyze that data segment and response in real time.

moving. However, the noisy background associated with a HAC can greatly affect the output of the system (irrespective of considerations of retrieving 3D position information). 4.

Outlier point detection on data streams: Data streamsbased pattern recognition requires that the system can automatically detect outlier points and segment data streams. Therefore, the cue study for detecting the segment containing potential human action is impor‐ tant [41]. 3. Experimental Setup and Methodology 5. Responsiveness: When a human performs an action, the 3.1. Experimental Environment robotic system should Setup respond in near real-time. 45ms [42] is thought to be the value for a realA concept of human and robot threshold coordinated assembly is time response, as with a human. After we determine shown in Fig. 3, and the control diagram for this whole theisstart point, the length of the data segment should system shown in Fig. 4. be not too long in order that the robot can analyse that The 3D region where the human and robot collaboration segment andas respond real-time. takesdata place is named hazardin zone. A kinect camera is used to detect this hazard area. The detection/tracking 3. setup and methodology areExperimental only triggered when human worker moves his hands into the “human/robot coordinated working area” in Fig. 3-(a). There are two typessetup of cameras mounted in a 3.1 Experimental environment kinect, a normal camera and an infrared camera. By A concept of human-robot coordinated assembly is shown combining these two cameras, one can get the depth image in 3, and the control diagram wholeBased systemon is of Fig. an interested area shown in for Fig.this3-(a). the depth image, shown in Fig. 4. Algorithm 1 is developed to calculate the position of human hands and fingers in 2D. In Fig. The region where human and robot collaboration 3-(b),3Dwe can find outthe that Algorithm 1 can effectively takes is called the ’hazard zone’. Kinect camera is detectplace the human operator’s palm and A fingertips when he used to detect this hazard detection and tracking performs different actions.area. We The are mostly interested with are only triggered when the human moves his the fingertip position of thumb, index worker finger and middle hands working area”, finger. into Butthe in “human/robot algorithms, wecoordinated tried to detect all the five fingertips within each sampling time. Sometimes, as in Fig. 3-(a). There are20ms two types of cameras mounted in it becomes more difficult to detect the third camera. finger and a Kinect - a normal camera and an infrared By little finger when human worker is grabbing and holding combining these two cameras, one can get the depth image compared with the moving case.in However, weBased still use of an interested area, as shown Fig. 3-(a). on the the five fingertips data when training the HMMs. Moreover, depth image, Algorithm 4 is developed to calculate the we do not consider the occlusion by robotic manipulator. position of human hands and fingers in 2D. In Fig. 3-(b), we The robot is supposed to stay in the standby zone and can see that Algorithm effectively detect the human monitor human worker.4 can When a cooperation decision is operator’s palm and fingertips when performing different made by the central control module, the robot will carry actions. mostly interested the to fingertip position out the We task.areAfter that, it goes in back standby status of the thumb, index finger and middle finger. However, in again. the algorithms, we tried to detect all five fingertips within A human gesture is represented in two ways described in each 20 ms sampling time. Sometimes, it becomes more Sect. II: (1) in static frame and (2) via sequential frames. difficult to detect thethe third finger or of thehuman little finger The former contains information hands,when and the worker the is grabbing or holding as compared with the human latter contains information of movement trajectory. the moving case. However, we still use the five fingertips data when 1training Moreover, do not Algorithm Positiontheof HMMs. palm and fingertipwedetection consider algorithmocclusion by the robotic manipulator. The robot is supposed the standby zoneout and 1. Basedtoonstay the in binary image, find themonitor Polygonthe of human hand worker. When a cooperation decision is made by the2.central module, robot will carry out the task; Based control on Polygon, findthe out the convex hull (Sklansky afterwards, it goes back to standby status once again. Algorithm) 3. Find out the convex hull vertices where θ < 160◦ , as A human gesture is represented in two ways, as described fingertip in Sect. II: (1) in a static frame, and (2) via sequential frames. 4. Find out the center point of the region (shown in Fig. The former contains the information about human hands, 3-(b)) as palm position while the latter contains aboutare thefingertips movement 5. Vertices points aboveinformation the center point trajectory. 6. Use Lukas-Kanade tracking algorithm to track interested points in 2D 7. Map the depth data of each detected point into height data

Hand information descriptor for static frame

unfold palm

start to hold

start to catch

Kinect unfold palm

Human

start to catch

start to hold

Robot

Kinect

Human

Robot

Assembly area Assembly area

（a）（a）

（b）（b）

Figure 3. Concept of human-robot coordinated assembly. (a) Sensor system configuration. (b) Process human-hand movement and rotation information.

Figure 3. Figure Concept human robot assembly. (a)(a)Sensor configuration. (b) Process human hand movement and 3. of Concept of and human and coordinated robot coordinated assembly. Sensor system system configuration. (b) Process human hand movement and rotation information. rotation information.

Figure 4. Control chart of human and robot coordinated assembly. If we treat it as a robotic system, the human worker could be viewed as disturbance, which will cause the robot co-worker change its work flow. “Human intention estimation module” is used to recognize human worker’s action based on the information from the sensor system. 0

3

Figure 4. Control chart of human-robot coordinated assembly. If we treat it as a robotic human worker could as a disturbance which chip system, on thetheworking glove. It beisviewed a nonintrusive sensor

1 change 2intentionIfestimation 4flow. A “human the 7robot to its work is used tosystem, recognize thehuman’s human worker’s actions based on Figurecan 4. cause Control chartco-worker of human and robot coordinated assembly. we method, treatmodule” it asnever a robotic the human worker could be viewed as interfering with assembly actions. the information fromcause the sensor disturbance, which will thesystem. robot co-worker change its work flow. “Human intention estimation module” is used to sensor recognize The measurement range of 3-axis acceleration is human worker’s action based on the information from the sensor system. r from −3.6g to 3.6g (g: acceleration of gravity), and the 6

2

5

0

1

0 3 Algorithm 1 Position of the palm- and fingertip-detection algorithm. 1 2 7 4 3 5 binary image, determinhe 6 1. Based on the the polygon of the hand 8 4 7 2. Based on the polygon, algorithm) (a) we determine the convex hull (Sklansky (b)

6

r 0 2 5 1 3. Determine the convex hull vertices where θ 3) hidden states, we find out that there is no obvious physical meaning for where a denotes the movement speed acceleration, (ax ,ay ,az )

(

)

(

)

denotes the three-axis speed acceleration (all acquired from the data glove), and (ωx ,ωy ) denotes the angular velocity. m www.intechopen.com

is the mass of the hand and l is the distance between the mass point and the rotational axis. ml 2 is the moment of inertia (which is treated as a constant value in this study), and so we have the following equation:

(

C2 µ wx2 + wy2

)

(12)

Four trajectories in Fig. 7 Trajectory descriptor using modified chain Codes (r = 10) A {44444473400002507000718870027107000030710000730030007070} B {434205000000000775030500030000737000030370} C {47220456003700878707515841003070005370} D {44463330510378230741100788782100036050000010}

Figure 8. Data streams from the sensor system (y-axis) for the “move and grab” action (A1) and the “move and hold” action (A2) versus the length of the Figure 8. Data streams fromsteps. the sensor system the “move and and grab” action and “move (A2) versus samples (x-axis) with 100 sampling The former 50 and(y-axis) the latterfor 50 contain the “move grab” action(A1) and the “move and and hold”hold” actionaction twice, respectively. the length of the (x-axis) with the 100y-axis length ofthe sampling steps. The (b) former 50 and later 50ofcontain “move grab” action and (a) Acceleration of samples the palm via the x-axis, and z-axis, respectively. Rotation information the palmthe around theand x-axis and the y-axis, the “move and hold” action taken twice respectively. Acceleration the palm y-axistwo and z-axispalm respectively. (b) Rotation respectively. (c) Number of detected fingertips. (d) Area of the(a) detected palm in aof2D panel. (e)via andx-axis, (f) represent criteria: movement acceleration information of palm around x-axis and y-axis respectively. (c) Number of detected fingertips. (d) Area of detected palm in 2D panel. (e) and rotational energy for sample segmentation. and (f) are two criterions: palm movement acceleration and rotational energy for sample segmentation.

4.4 Segment start point decision

5. Experiment and discussion

The segment should www.intechopen.com

not be too long or too short. If the : 5.1 Segment start point and length Gesture Modeling length is set too long, Hand the robot cannot reactand to Recognition the humanfor Human and Robot Interactive Assembly Using Hidden Markov Models worker quickly. If the length is set too short, the data When a human worker collaborates with a robot on segment cannot represent any meaningful information. In assembly tasks, the robot must be aware of the task this study, we suppose that all the meaningful human sequence that has been predefined. The only issue here is actions are completed by 1s. According to our experimental that the human worker cannot guarantee stable perform‐ setup, it utilizes 50 time-intervals. ance in time domain. Accordingly, the robot must con‐ Fei Chen, Qiubo Zhong, Ferdinando Cannella, Kosuke Sekiyama and Toshio Fukuda: Hand Gesture Modeling and Recognition for Human and Robot Interactive Assembly Using Hidden Markov Models

9

9

stantly monitor the human in order to know what to do next. In this case, the robot must detect some “trigger” action so that it can estimate which task the human is performing or else is going to perform by checking the task sequence defined in the database. We assume that grasping (move and grab) and handling (move and hold) are the two most typical “trigger” actions. In this section, two actions including “move and grab” and “move and hold” for a human worker’s left hand are recognized by the HMM model. We use the movement acceleration C1 and the rotational

energy C2 by comparing them with a certain threshold

value in order to detect the start point of a meaningful action. In this experiment, we observe the action samples (Fig. 8 - (e) and (f)) via time (the time for each sample interval is 20ms). From the observation, the length of the segment for these two actions is chosen as 10 and 10 in order to form a 0.2s length data segment, respectively (Table 4).

are performed for the “move and grab” action and another 50 for the “move and hold” action. In each 30 sets are used Table 4. training Start point detection segment length decision the as the data and 20and sets are used to validate HMM model that is being trained. HMMfor is start point de MotionThe typecreated Threshold in and Fig. 9. Each action only denoted λ = (A,B,Π), as shown move grab 1.7(C1 )/5.4(C2 ) lasts for less than 0.2s (the move sampleand interval at 1 )/5.3(C2 ) hold length is 101.7(C most). After we train the HMM with more (>3) hidden states, discover thatrecognition there is performace no obviousforphysical Table 5. we Hidden states and pattern “moving and gr meaning for some hidden states. Therefore, we choose a Gesture pattern Parameter Chain code C1 C2 Palm area Finge three-hidden state HMM to model a given action. On the S1 2 1.7 5.5 3660 other hand, because the length of the sample is short, we Move and grab S2 2 2.1 5.6 2990 should avoid over-fitting during the training process. S3 3 2.1 5.4 2770 Three hidden states are chosen as the most appropriate to S1 1 2.1 5.5 2990 model Move this hand-movement action. and hold S2 3 1.9 5.5 3880

S3 O1

Hidden States Motion type

Threshold for start point

1

S1

2.1 5.5

O2

S2

2880

O3

S3

Segment length

detection move and grab

1.7(C1) / 5.4(C2)

10

move and hold

1.7(C1) / 5.3(C2)

10

Table 4. Start point detection and segment length decision

Move and grab

Move and hold

5.2 Feature vector for HMM “Move and grab” action In this experiment, we assume that the rotation action usually occurs without any rotation in the 3D axis; there‐ fore, the feature vector for training is just the data vector from the data glove and the Kinect sensor. We define f h = { xa,ya,za,xh ,yh ,zh }, where a denotes the

acceleration and xh ,yh ,zh denotes the hand position in 3D. “Move and hold” action

In this experiment, we assume that the rotation action usually occurs without any coordination shifts in the 3D axis; therefore, the feature vector for training is just the data vector from the data glove. We define f h = {ax ,ay ,az ,ωx ,ωy }, where a denotes the acceler‐

ation and ω denotes the angular velocity.

We sampled the data segment (Fig. 8 - (a)(b)(c)(d)) for training based on the start-point detection and segmentlength decision method described in the previous subsec‐ tion. 5.3 HMM for the identification of two actions in Fig. 7 We acquire the sample segments using the segmentation method introduced in the previous section and use the Baum-Welch algorithm to train the HMM based on these samples for each action respectively. 50 sets of experiments 10


Figure 9. HMM construction with three hidden states for “move and grab”

Figure 9. HMM construction 3 hidden states for “move and and “move and hold” actions. Forwith each hidden state, it represents a single grab” “move and hold” action. For each hidden states, it frame and of a gesture. represents a single frame of gesture.

For each of the hidden states, it actually represents a frame some hidden states. Therefore, we choose a 3-hidden state of the action for “move and grab” and “move and hold”. HMM to model an action. On the other hand, becauseIn the Fig. 9, of it shows the representative within and length the sample is short, weaction should avoid“move overfitting and “move and hold”, described in Fig. ingrab” the training process. Threeashidden states are8.chosen to be the most appropriate to model this hand movement In Table 5 can be seen that the performance of the trained action. HMM in identifying the “move and grab” action and the

“move hold” action. The log likelihood (Equation 9) is For eachand of the hidden states, it actually represents a frame criterion for evaluating theand possibility a given ofthe themain action for “move and grab” “move of and hold”. by representative the HMM model. We define an Indatum Fig. 9,generated it shows the action within “move and grab” andfrom “move and hold” describeand in Fig. 8. experiment P positive instances N negative instances for a given condition. According to the experi‐

In Table 5, it shows the performance of the trained HMM mental setup, both P and N are equal to 20. We use the truein identifying the “move and grab” action and “move positive (TP) rate to evaluate the sensitivity (Equation 13) and hold” action. The log likelihood (Equation 9) is and the true-negative (TN) rate to evaluate the specificity the main criterion to evaluate the possibility of a certain (Equation 14) of the trainedby classifier - TPmodel. gives theWe number data segment generated a HMM define of successful identifications when the sample data an experiment from P positive instances and N contain negative the pattern identification, whiletoTN instances for requiring some condition. According thegives setupthe the number of failure identifications when the sample data experiments, both P and N are equal to 20 respectively. does pattern We usenot truecontain positivethe (TP) rate torequiring evaluateidentification. the sensitivity γaccuracy Moreover, the identification accuracy rate to is evaluate (Equation 13) and true negative (TN) rate calculated according to Equation 15 [44]. the specificity (Equation 14) of trained classifier. TP gives out the number of successful identification when the sample data contains the pattern needed to be identified. TP TPrate = of failure identification (13) TN gives out the number when P the sample data does not contain the pattern needed to be identified. Moreover, the identification accuracy rate γaccuracy is calculated according to Equation 15 [44].

where movem numbe needed

It is no a high a accu co-wor It shou recogn capabl time. effectiv robot,

6. Con

The d natura recogn camera fingert attache of the recogn enviro metho autom autom action

The m iHRCH natura With realizi been fu the na collabo the ou

Gesture

Parameter

Chain code

C1

C2

Palm area

pattern

Move and grab

Move and hold

Fingertip

TP rate

TN rate

Accuracy rate

0.85

0.85

0.85

0.85

0.90

0.875

number S1

2

1.7

5.5

3660

2

S2

2

2.1

5.6

2990

1

S3

3

2.1

5.4

2770

1

S1

1

2.1

5.5

2990

2

S2

3

1.9

5.5

3880

3

S3

1

2.1

5.5

2880

1

Table 5. Hidden states and recognition performance for the patterns “moving and grab” and “moving and hold”

TN N

(14)

TP + TN P+N

(15)

TNrate =

g accuracy =

where P denotes the number of samples containing the movement pattern to be detected and N denotes the number of samples which do not contain the movement pattern to be detected.

naturalness, it can effectively monitor hazardous areas where collaboration occurs, and it can automatically detect the outlier point, segment the data streams and reduce time to process the data. Therefore, the robot co-worker can respond to an action within around 0.2s. This short re‐ sponse-time can also eliminate the potential danger that a robot co-worker may hurt the human worker.

It is noted that the classifier generated by this method has a high TP rate as well as relatively high FP rate, yielding an accuracy rate greater than 85%. This suggests that the robot co-worker makes the right decision most of the time. It should also be pointed out that, because the pattern recognition is based on a 0.2s-long segment, the robot is capable of responding to the human worker’s actions in near real-time. This feature is significant, not only for carrying out effective and efficient assembly tasks by a human and a robot but also for safety.

From the illustrated example, when recognizing the gesture patterns “move and grab” and “move and hold”, our approach yields a high true-positive rate and a high accuracy rate. It demonstrates that our interface performs better than the previous, ordinary vision-based HRC interfaces, since we consider more information in describ‐ ing the gesture of the human’s hand and fingertips while improving the processing algorithms. This system can be easily extended to more action patterns in recognition situations. Future research will focus on human intention estimation by testing with various stream-pattern recogni‐ tion methods. One direction would be to consider the reconstruction of the data in subsequent frames based on the current recognized action patterns.

6. Conclusion

7. Acknowledgements

The design, development and evaluation of the novel natural human-robot interface for human gesture recogni‐ tion within a HRC have been presented. A RGB-D camera is used to detect the human operator’s palm and fingertip positions, and an accelerometer and gyro-sensor attached to a glove are used to detect the rotation and movement of the hand. This offers a novel research element for pattern recognition via data streams in a hybrid manufacturing environment. Based on this technology, a HMM-based method is adopted to identify human workers’ gestures automatically. This shows that the proposed system can automatically segment the data streams and recognize the action patterns represented with an acceptable accuracy ratio.

This material is based upon work funded by the Natural Science Foundation of China under Grant No. 61203360, Zhejiang Provincial Natural Science Foundation of China under Grant Nos. LQ12F03001, LQ12D01001 and LY12F01002, Ningbo City Natural Science Foundation of China under Grant Nos.2012A610009 and 2012A610043, the State Key Laboratory of Robotics and Systems (HIT) Foundation of China under Grant No. SKLRS-2012-MS-06, and the China Post-doctoral Science Foundation under Grant No. 2013M531022.

The main contribution of this paper lies in the design of an iHRCHA cell within which a sensor system serves as a natural interface between a human and a robot co-worker. With the help of this system, several challenges in realizing human-robot collaboration in HRC are fulfilled. In partic‐ ular, this system is affordable while not violating the

8. References [1] J. Krüger, TK Lien, and A. Verl. Cooperation of human and machines in assembly lines. CIRP Annals-Manufacturing Technology, 58(2):628–646, 2009. [2] A. Bannat, T. Bautze, M. Beetz, J. Blume, K. Diepold, C. Ertelt, F. Geiger, T. Gmeiner, T. Gyger, A. Knoll, et al. Artificial cognition in production systems.


11

Automation Science and Engineering, IEEE Transac‐ tions on, 8(1):148–174, 2011. [3] R. Haraguchi, Y. Domae, K. Shiratsuchi, Y. Kitaaki, H. Okuda, A. Noda, K. Sumi, T. Matsuno, S. Kaneko, and T. Fukuda. Development of produc‐ tion robot system that can assemble products with cable and connector. Journal of Robotics and Mecha‐ tronics, 23(6):939, 2011. [4] K. Dautenhahn. Methodology and themes of human-robot interaction: a growing research field. International Journal of Advanced Robotic Systems, Vol. 4, No. 1, 2007. [5] Scott A Green, Mark Billinghurst, XiaoQi Chen, and J Geoffrey Chase. Human robot collaboration: An augmented reality approacha˛ła literature review and analysis. In ASME 2007 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, pages 117–126. American Society of Mechanical Engineers, 2007. [6] F. Chen, K. Sekiyama, H. Sasaki, J. Huang, B. Sun, and T. Fukuda. An assembly strategy scheduling method for human and robot coordinated cell manufacturing. In International Journal of Intelligent Computing and Cybernetics, volume 4, pages 487 – 510. IEEE, 2011. [7] J. Krüger, V. Katschinski, D. Surdilovic, and G. Schreck. Pisa: Next generation of flexible assembly systems-from initial ideas to industrial prototypes. In Robotics (ISR), 2010 41st International Symposium on and 2010 6th German Conference on Robotics (ROBOTIK), pages 1–6. VDE, 2010. [8] New Energy and Industrial Technology Development Organization, “Project for Strategic Development of Advanced Robotics Elemental Technologies”, http : // www.nedo.go.jp/english, Accessed on 01 Apr 2006. [9] National Science Foundation, “National Robotics Initiative”, Available: http : //www.ns f.gov/pubs/2014/ ns f 14500, Accessed on 21 Jan 2014. [10] European Robotics Technology Platform, “The Strategic Research Agenda for robotics”, http : //www.roboticsplat f orm.eu/sra, Accessed on 01 Jul 2009. [11] M. Morioka, S. Adachi, S. Sakakibara, J.T.C. Tan, R. Kato, and T. Arai. Cooperation between a highpower robot and a human by functional safety. Journal of Robotics and Mechatronics, 23(6):926, 2011. [12] S. Takata and T. Hirano. Human and robot alloca‐ tion method for hybrid assembly systems. CIRP Annals-Manufacturing Technology, 2011. [13] Fei Chen, K. Sekiyama, F. Cannella, and T. Fukuda. Optimal subtask allocation for human and robot collaboration within hybrid assembly system. Automation Science and Engineering, IEEE Transac‐ tions on, 11(4):1065–1075, Oct 2014. [14] F. Duan, J.T.C. Tan, J.G. Tong, R. Kato, and T. Arai. Application of the assembly skill transfer system in 12


an actual cellular manufacturing system. Automa‐ tion Science and Engineering, IEEE Transactions on, (99):1–1, 2012. [15] F. Wallhoff, J. Blume, A. Bannat, W. Rösel, C. Lenz, and A. Knoll. A skill-based approach towards hybrid assembly. Advanced Engineering Informatics, 24(3):329–339, 2010. [16] T. Salter, K. Dautenhahn, and R. Boekhorst. Learn‐ ing about natural human–robot interaction styles. Robotics and Autonomous Systems, 54(2):127–134, 2006. [17] A.A. Chaaraoui, P. Climent-Pérez, and F. FlórezRevuelta. A review on vision techniques applied to human behaviour analysis for ambient-assisted living. Expert Systems with Applications, 2012. [18] C.L. Bethel and R.R. Murphy. Review of human studies methods in hri and recommendations. International Journal of Social Robotics, 2(4):347–359, 2010. [19] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb-d mapping: Using depth cameras for dense 3d modeling of indoor environments. In the 12th International Symposium on Experimental Robotics (ISER), volume 20, pages 22–25, 2010. [20] K. Lai, L. Bo, X. Ren, and D. Fox. Sparse distance learning for object recognition combining rgb and depth information. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 4007–4013. IEEE, 2011. [21] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Realtime human pose recognition in parts from single depth images. In CVPR, volume 2, page 7, 2011. [22] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE Interna‐ tional Conference on, pages 1817–1824. IEEE, 2011. [23] R.B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011. [24] Billy YL Li, Ajmal S Mian, Wanquan Liu, and Aneesh Krishna. Using kinect for face recognition under varying poses, expressions, illumination and disguise. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 186–192. IEEE, 2013. [25] Pablo Gil Vázquez, Carlos Mateo Agulló, and Fernando Torres Medina. 3d visual sensing of human hand for remote operation of a robotic hand. International Journal of Advanced Robotic Systems, 11:26, 2014. [26] Zihong Chen, Lingxiang Zheng, Yuqi Chen, and Yixiong Zhang. 2d hand tracking based on flocking with obstacle avoidance. International Journal of Advanced Robotic Systems, 11:22, 2014.

[27] Juan Antonio Corrales Ramón, Gabriel Jesús García Gómez, Fernando Torres Medina, and Véronique Perdereau. Cooperative tasks between humans and robots in industrial environments. International Journal of Advanced Robotic Systems, 9:1–10, 2012. [28] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [29] J. Hu, M.K. Brown, and W. Turin. Hmm based online handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 18(10): 1039–1045, 1996. [30] B.L. Van, S. Garcia-Salicetti, and B. Dorizzi. On using the viterbi path along with hmm likelihood information for online signature verification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 37(5):1237–1247, 2007. [31] R.R. Brooks, J.M. Schwier, and C. Griffin. Behavior detection using confidence intervals of hidden markov models. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(6):1484–1492, 2009. [32] A. Vakanski, I. Mantegh, A. Irish, and F. JanabiSharifi. Trajectory learning for robot programming by demonstration using hidden markov model and dynamic time warping. Systems, Man, and Cybernet‐ ics, Part B: Cybernetics, IEEE Transactions on, 42(4): 1039–1052, 2012. [33] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. In Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pages 379–385. IEEE, 1992. [34] Y. Yamada, T. Morizono, Y. Umetani, and H. Konosu. Warning: to err is human [human-friendly robot dependability]. Robotics & Automation Maga‐ zine, IEEE, 11(2):34–45, 2004.

[35] C. Lenz, A. Sotzek, T. Roder, H. Radrich, A. Knoll, M. Huber, and S. Glasauer. Human workflow analysis using 3d occupancy grid hand tracking in a human-robot collaboration scenario. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ Interna‐ tional Conference on, pages 3375–3380. IEEE, 2011. [36] S.L. Delp, F.C. Anderson, A.S. Arnold, P. Loan, A. Habib, C.T. John, E. Guendelman, and D.G. Thelen. Opensim: open-source software to create and analyze dynamic simulations of movement. Biomedical Engineering, IEEE Transactions on, 54(11): 1940–1950, 2007. [37] J.P. Wachs, M. Kölsch, H. Stern, and Y. Edan. Visionbased hand-gesture applications. Communications of the ACM, 54(2):60–71, 2011. [38] M.M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. ACM Sigmod Record, 34(2):18–26, 2005. [39] J. Triesch and C. Von Der Malsburg. Robotic gesture recognition by cue combination. Proceedings of the Informatik, 98:21–25, 1998. [40] D.J. Sturman and D. Zeltzer. A survey of glovebased input. Computer Graphics and Applications, IEEE, 14(1):30–39, 1994. [41] T. Baudel and M. Beaudouin-Lafon. Charade: remote control of objects using free-hand gestures. Communications of the ACM, 36(7):28–35, 1993. [42] T.B. Sheridan and W.R. Ferrell. Remote manipula‐ tive control with transmission delay. Human Factorsin Electronics, IEEE Transactionson, (1):25– 29, 1963. [43] E.Bribiesca. A new chain code. Pattern Recognition, 32(2):235–251, 1999. [44] R.Kohavi and F.Provost. Glossary of terms. Ma‐ chine Learning, 30(2-3):271–274, 1998.


13