Gesture-based Language for Underwater Diver-Robot ... - IEEE Xplore

Gesture-based Language for Diver-Robot Underwater Interaction D. Chiarella, M. Bibuli, G. Bruzzone, M. Caccia, A. Ranieri, E. Zereik National Research Council - Institute of Studies on Intelligent Systems for Automation CNR-ISSIA, Via De Marini 6, 16149 Genova, Italy, Email: marco.bibuli | chiarella @ge.issia.cnr.it

L. Marconi, P. Cutugno National Research Council - Institute of Computational Linguistics CNR-ILC, Via De Marini 6, 16149 Genova, Italy, Email: [email protected] Abstract—Underwater environment is characterized by harsh conditions and is difficult to monitor. The CADDY project deals with the development of a companion robot devoted to support and to monitor human operations and activities during the dive. In this scenario the communication and correct reception of messages between the diver and the robot are essential for success of the dive goals. However, the underwater environment poses a set of technical constraints hardly limiting the communication possibilities. For such reasons the solution proposed is to develop a communication language based on the consolidated and standardized diver gestures, commonly employed during professional and recreational dives, thus leading to the definition of a CADDY language, called CADDIAN, and a communication protocol. This article focuses on the creation of the language providing alphabet, syntax and semantics: future work will explain the part of recognition of gestures that is still in progress.

I.

I NTRODUCTION

Professional and recreational divers usually operate in environments characterized by harsh conditions and difficult to monitor; in such a context any sudden event causing the diver to deal with an emergency, as for instance technical problems or wrong actions, may jeopardize the diving campaign or even turn into worse consequences involving the safety of the diver himself. In order to face such situations, standard procedures suggest to pair up divers and to follow well defined rules to avoid the chance of accidents. Anyway, during the execution of extreme dives, these procedures may not be sufficient to avoid dangerous occurrences. With the aim of improving the safety level during dives, the EU funded CADDY project is developed with the idea of transferring the robotic technology into the diving procedures. The CADDY project mainly deals with the development of a companion/buddy robot devoted to support the human operations and activities during the dive, as well as to monitor the status of the diver in such a way to prevent harmful occurrences. With the aim of providing to the diver a reliable and helpful robotic support vehicle, one of the major issues is represented by the development of a communication and interaction methodology allowing the diver and the robot to actively cooperate for the fulfilment of the required tasks during the 978-1-4799-8736-8/15/$31.00 ©2015 IEEE

dives. The communication and correct reception of messages between the diver and the underwater robot are essential for the success of the dive goals. However, the underwater environment poses a set of technical constraints hardly limiting the communication possibilities. The strong attenuation of electro-magnetic waves makes the wifi/radio communications unreliable already at 0.5 m of depth, while optical communication has limited range due to water reverberation and scattering caused by suspended sediments [1]. The most reliable solution for underwater communication is given by acoustic technology, with two main drawbacks: high prices of the devices and very low data transmission rate [2] and [3]. For such reasons and aiming to provide the diver with the most “natural” underwater communication method, the solution proposed within the framework of the CADDY project is to develop a communication language based on the consolidated and standardized diver gestures that are commonly employed during professional and recreational dives, thus leading to the definition of a CADDY language called CADDIAN. The research in the gesture-based communication between humans and robots, also related to harsh and uncertain environment application as the underwater one, represent the one of the beyond-state-of-the-art challenges in robotics. Indeed, at the present state of the art, a wide number of works have been developed in order to try to endow robots with robust perceiving capabilities, in such a way to make them highly reactive to the external world and occuring events. In such a way to improve the coexistence between man and technology, above all within the robotic fields, one of the main aspects is to obtain a more “natural” interaction, also guaranteeing a suitable robustness to the system. To this aim, gesture recognition seems to be one of the most promising techniques, since it results very easy for humans. Many works have been presented in literature addressing the problem of exploiting hand gesture recognition algorithms within different contexts, such as robotics or computer science: many systems have been described allowing to interact with a computer controlling the mouse pointer through hand signs. These works are almost entirely developed for applications ‘in air’, meaning that the environment is highly simpler and less

harsh than within an underwater context. For example, in [4], the importance of a gesture-based human-computer interaction is stressed and more employable techniques are reviewed; a real-time tracking algorithm based on adaptive skin detection and motion analysis is implemented in [5]. Anyway, in such work, authors strongly constrain the task by assuming that the hand is the only moving object; moreover, their system is badly affected by camera noise, highly conditioning the tracking phase.Another gesture recognition system based on 3D information from stereoscopic cameras is described in [6]. This system, based on a learning algorithm, turns out to be heavy since it employs an online iterative training procedure to refine the classifier using additional training images. Moreover, a different work is presented in [7], where gesture recognition is exploited to help children, in particular those having disabilities, to learn on a new interactive paradigm, focusing in details in a musicrelated context. Also in this work, machine learning technique are exploited to discern between two gestures, one impulsive and the other continuous. Finally, work done in [8] deals with a system aiming to the hand gesture recognition for the interaction between the user and a videogame. It relies on convex hull and convexity defect computation; hence it seems easily prone to poor robustness issues and problems to distinguish gestures if many possible poses are available. Another advantage of dealing with ‘in air’ applications is represented by the possibility to integrated cameras with further sensors such as ToF (Time of Flight) and IR (Infra Red) systems, that are not employable in water. Examples of user-computer interaction within a gaming context are presented in [9], which facilitates the integration of full-body control with virtual reality applications and videogames using OpenNI-compliant depth sensors, and in [10], [11] and [12], works that employ data from both RGB and ToF cameras to realize hand gesture recognition aiming to implement some specific simple games. Developments in the virtual reality and computer games’ research branch have also given pulse to other sectors that have derived from them the idea to employ IR and ToF systems, as demonstrated by many works in the literature, dealing with different applications, such as [13], [14], [15], [16] and [17]; a good review of ToF and IR sensors as well as of gesture recognition methods is given in [18]. More specifically, within a robotic context, many papers have been presented and many different vision techniques have been employed: for example in [19], the hand motion detection and reconstruction is obtained through a monocular camera and exploiting an articulated model with hand kinematics constraints; each finger is modeled as a planar robot arm with 3 joints and 3 links. In this work, the main goal is that of tracking and reconstruction of the human hand motion. The idea is interesting, above all for the employed articulated hand model and the derived constraints for the hand motion that reduce the problem complexity; however, no clues about the processing time is provided. As a matter of fact, this system seems to be computationally heavy, given all the techniques (silhouette extraction, pose estimation both for the global palm posture and for each finger joints, reprojection and model fitting) exploited to detect the hand motion. Finally, dealing with motion, it is not clear if the system is able to correctly discriminate different hand static postures and with which

accuracy level. Multiple cameras can be employed to track the full hand motion, both from 3D points reconstructed using a stereoscopic system as in [20] and building up the whole hand 3D model like in [21]. Again, work in [20] aims to track the hand motion through the reconstruction of the 3D trajectories followed by the hand. The described system employs a huge amount of hand models (articulated hand, skin deformation, smooth surface) and has to handle a huge amount of information. Moreover, many processing steps are required (stereo correspondences and feature matching, 3D reconstruction based on Expectation-Maximization Iterative Closest Point and Surface algorithm, hand tracking), thus seeming as well too computationally heavy even if authors don’t provide any details on performance. Even speaking about ‘in air’ applications, there are many peculiar problems of computer vision that are also relevant to hand recognition systems, essentially in relation to robustness and repeatability of the recognition procedure. They can be addressed by exploiting a wide number of different techniques such as geometric classifiers, Principal Component Analysis (PCA), silhouette recognition, feature extraction, Haar classifiers, learning algorithms and so on. To this aim, work presented in [22] applies a hand gesture recognition algorithm to the control system of an intelligent wheelchair for people with physical accessibility problems. The proposed system is based on classical Haar-like feature and AdaBoost learning algorithm, thus needing to be suitably trained. This work does not present an actual hand posture recognition: only one hand pose (closed punch) is considered and only the hand (relative) spatial position is used to retrieve the user required command. In fact, depending on the user hand position with respect to the monitor mounted on-board the wheelchair, the system understands ‘turn left/right’, ‘go forward/backward’ and ‘stop’ commands. No discussion on system processing time and robustness to changing illumination conditions or changing background due to the wheelchair motion is reported. In [23] a two-level approach is proposed, implementing the posture recognition with Haar-like features and AdaBoost learning algorithm at the lower level and then proposing (but not yet implementing) a context-free grammar at the higher level for the linguistic gesture recognition. The system provided realtime performance but only for the lower level that, being based on a classifier, needs the usual large training dataset. Moreover, very few hand postures have been included in the system database and they are very different from each other, thus simplifying the recognition task. Finally, the system has not been tested in a relevant environment, since a very uniform background has been employed. Many other techniques can be applied to the gesture recognition problem, such as hidden conditional random fields for the recognition of dynamic human arm and head gestures in [24], support vector machine for real-time hand detection in [25], artificial neural networks in the very recent [26]. In the first one, head and arm gesture recognition is addressed, exploiting a 3D cylindrical body model (made up of head, torso, arms and forearms) estimated from each image frame by a stereo camera system. The system performance in terms of accuracy seems interesting but there are some issues: first of all, the human figure is in the foreground. Furthermore, as in the most model-based methods, the required computational resources seem high (situation made even worst by the

stereo processing). Actually, authors neither claim real-time performance nor provide processing times relatively to the performed tests. Finally, the system needs a lot of iterations for the training phase. In [25] bag-of-features (using SIFT keypoint detector) and support vector machine are combined to recognize hand postures. Hand detection and tracking are achieved through skin detection and contour comparison and aided by a face subtraction step that adds robustness to the system (by removing a big skin-colored blob from the input image). Furthermore, a grammar to generate commands to control and interact with a computer application or a videogame. The proposed approach is interesting and achieves real-time performance and high accuracy, including a certain flexibility to in-plane rotations thanks to the employment of the SIFT feature extractor. However, beside requiring a very heavy training step for the SVM classifier, both the SIFT keypoint extraction and the bag-of-feature clustering (and the resulting vocabulary building) are demanding in terms of computational load and require heavy offline pre-processing. Apart from the presence of the face in the image (solved by the face subtraction step), the background appearing in the experiments is not very challenging; moreover, issues such as camera image quality, size of the training dataset and number of clusters in the bag-of-features badly affect the accuracy and robustness of the system. Work in [26] evaluates the possibility to employ an artificial neural network for recognition of different types of gestures (hand and arm, body, head and face). The work is very general and mostly focused on the neural network exploitation. Such an approach needs a heavy training phase and the suggested procedure turns out to be a supervised learning technique; hence, additional initialization efforts are required. Experimental results are completely missing and no discussion about illumination, background, processing time or accuracy is reported. Finally, a survey on techniques involving hidden Markov models, particle filtering and condensation, finite-state machines, optical flow, skin color and connectionist models is provided in [27]. Such a survey, besides reviewing a high number of different techniques, suggests to combine different approaches to increase reliability in gesture recognition tasks. However, it further underlines problems suffered by statistical methods such as hidden Markov models: they are computationally expensive, require a very huge amount of training data and their performance is strongly affected and limited by the characteristics of such training datasets. The very huge variety of the above referenced works testifies how important and widespread the development of a robust and effective natural man-machine interaction system would be and how the solution to this problem is still far from being found. Moreover, all the works presented in the literature deal with applications ‘in air’, where the operational conditions are much simpler (even if the environment is complex in these cases): the underwater environment poses many furhter problems such as visibility, illumination, cloudy water, bubbles occluding the captured scene, constraints in the range to be kept between the diver and the robot. All these issues underline the innovation and utility of the system proposed in this paper. This said, it is clear that the objectives, that the project CADDY proposes, provide a quantity of research work that can not be easily summarized in this paper, without losing the many facets emerged: for these reasons, this article explains the initial part of it, namely the creation of language by alphabet,

syntax, semantics and communication protocol that the diver must follow to communicate with the AUV. In the last part of the present work a translation table between CADDIAN and semantics is provided: after that an initial list of gestures and an example of mapping gestures to syntax and syntax to semantics are given. II.

T HE CADDY L ANGUAGE

A. A human-robot interaction language based on gestures The creation of the language of human-robot interaction (hereinafter referred to as H) was based on the use of sign language, however, for better readability, signs have been mapped with easily writable symbols such as the letters of the Latin alphabet: this bijective mapping function translates from the domain of signs to our alphabet and vice versa, as depicted in Figure 1. The gesture or sequence of gestures of language H and the corresponding characters or sequences of characters of the alphabet Σ are also mapped to a semantic function that translates them into commands / messages. The encoding and decoding of gestures are assigned to a classifier: the cardinality of the alphabet and therefore of gestures depends on the ability of classification. The more dimensions the classifier can discern, the more gestures / symbols of the alphabet we can have: however, the gestures should be feasible in the underwater environment and should be as intuitive as possible to cope both with the learning phase of the language and divers’acceptance. In the construction of the language, the issued statements have been sequentially defined thus allowing recipient synchronization with the issuer, and with boundaries to ensure efficient interpretation. The language has been defined so that the interpreter may rely on a good level of redundancy to prevent from misunderstanding of issuer commands. In addition, the elements of the syntax, i.e. the gestures, are mapped into the elements of the “natural” syntax in order to be easily learnt. B. A language strictly context dependent CADDIAN is a language for communication between the diver and robot, then the list of messages/commands identified is strictly context dependent. The first step has been the definition of a list of commands/messages to be issued to the CADDY language: during this activity the environment and possible tasks (that could be entrusted to CADDY) were examined. Currently the number of messages is around fifty-one units. The commands/messages are divided into five groups: Problems (9), Movement (at least 13), setting variables (9), feedback (3), works/tasks (at least 13).

Fig. 1.

Example of written representation of gestures

environment, a very careful control of communication errors is mandatory for the communication between the underwater vehicle and the diver. The communication protocol ensures a strict cooperation between the diver and the robot: the diver is able to understand whether the task/mission required to the vehicle has been completed and its progress.

Fig. 2.

Connections among gestures, syntax and semantics TABLE I.

Group

L IST OF COMMANDS Commands/messages

Problems

I have an ear problem

I’m out of breath

I’m out of air [air almost over]

Something is wrong [diver]

I depleted air

Something is wrong [environment]

I’m cold

I have a cramp

•

Green = finished work, waiting for orders STATE IDLE

Take me to the boat

You lead (I follow you) I lead (you follow me)

•

Orange = work in progress STATE BUSY

Take me to the point of interest Go X Y X ∈ Direction Y ∈N

Return to/come X X ∈ P laces

•

Red = FAILURE/ERROR STATE

Stop [interruption of action]

Abort mission

Let’s go [continue previous action]

General evacuation

Slow down/Accelerate

Set point of interest (henceforth any action may refer to a point of interest)

Level Off (CADDY cannot fall below this level, no matter what diver says: the robot interrupts any action, if the action forces him to break this rule)

Keep this level (any action is carried out at this level)

I have vertigo Movement

Interrupt

Setting variables

Free level (“Keep this level” command does not apply any more)

Feedback

To develop the communication protocol and therefore its language, we took into consideration that the issued statements have to be sequential, thus allowing synchronization between human and robot, and have to be defined with boundaries to ensure efficient interpretation. We suppose that CADDY has got three light emitters (red, green, orange) or something similar with the same attributes (i.e. we can discern three statuses from it). Let it be:

The protocol development takes into account these three possible scenarios: 1)

2)

Give me air (switch on the on board oxygen cylinder)

Give me light (switch on the on board lights)

No more air (switch off the on board oxygen cylinder)

No more light (switch off the on board lights)

No (answer to repetition of the list of gestures)

I don’t understand (repeat please)

Ok (answer to repetition of the list of gestures)

3) Works

Wait n minutes n ∈ N

Tessellation X * Y area X, Y ∈ N

Tell me what you’re doing

Photograph of X * Y area X, Y ∈ N

Carry a tool for me

Stop carrying the tool for me [release]

Do this task or list of task n times n∈N

Photograph of point of interest/boat/here

Tessellation of point of interest/boat/here Direction = {ahead, back, lef t, right, U p, Down} P laces = {pointof interest, boat, here}

C. Communication protocol and errors handling After defining the commands, a communication protocol with errors handling has been proposed: in fact, given the harsh

CADDY does not understand a gesture of the command: the classifier identifies the gesture as not belonging to the alphabet. An error signal is emitted and the gesture has to be repeated by the diver: the sequence/command is not aborted. In this scenario, the error could provide information on the effectiveness of the classifier (i.e. was it a false negative?) and the state of health of the diver (if I am feeling wrong I have a higher probability of missing a gesture). CADDY understands the gesture (i.e. the gesture belongs to our alphabet), but the message is not semantically true. An error signal is emitted (different from the one of the first case): the sequence of gestures is aborted and must be repeated. In this scenario, the error could provide information on the effectiveness of the classifier (i.e. misclassification) and on the health status of the diver (if I am feeling bad I have a higher probability of a wrong gesture). CADDY understands the gesture (i.e. the gesture belongs to our alphabet), and the message is semantically true (it has a meaning consistent with our environment), but it is not what the diver wanted (i.e. diver changes his mind or did a wrong gesture).

To treat this third case we have to introduce the definition of “mission” such as a series of arbitrary commands. The diver teaches CADDY the mission to be performed and CADDY, before starting it, repeats what it understood waiting for confirmation from the diver. More specifically, once the mission is taught, Caddy repeats every command (syntactically or semantically true) and waits for confirmation of correct reception: in case of error, the diver does not confirm the command and repeats the series of gestures that form the command. For optimization and disaster recovery reasons every mistake and every action of the robot are logged.

In the definition of the language two aspects of interrogability of the robot were also taken into account: 1) 2)

The diver must understand if the task / mission entrusted to CADDY has been terminated. The diver must be able to query CADDY on the progress of a lengthy mission.

In the first case, CADDY just turns on a green light and remains stationary. In the second case we suppose that, when CADDY is in operation and executes a task, if the diver approaches within a predefined range of X meters (i.e. 5 meters), it stops, remaining however in the BUSY state. In this situation it can:

with some words, Greek letters, math symbols and natural numbers (also used as subscripts) defined as follows: Σ = {A, B, C, D . . . Z, ?, const, limit, check, . . . , 1, 2 . . .} (2) The grammar that we have built is a context-free grammar by definition because on the left side of the productions we find only one non-terminal symbol and no terminal symbol[30] and [31]. In addition, with the current productions the resulting language is an infinite language given that the first production (i.e. S) uses recursion: this can also be seen from the dependency graph of the non-terminal symbols (the graph contains a cycle). E. Syntax

•

Be questioned on the progress of the work (see “CHECK” command ).

•

Return to the IDLE state, erasing the current mission (see “ABORT MISSION” command).

S ::= A α S | ∀

•

Return to the IDLE state and report an emergency (see subsection II-F-Problems and the semantic value of ă= clear mission).

α ::= agent m-action object place | ăfeedback p-action problem | set-variable | feedback | interrupt | work | ∅ |

•

Return to the assigned mission if the diver moves away at a distance greater than X meters.

agent ::= I | Y | W

D. Language definition The developed diver-robot language is defined as a set of strings of finite length constructed over a finite alphabet Σ and consequently can be defined as a formal language. A formal language can be described by a formal grammar [28] and [29]. A formal grammar is a quadruple < Σ, N, P, S > as follows: - a finite set Σ of terminal symbols (disjoint from N), the alphabet of the language, that are assembled to make up the sentences in the language; - a finite set N of non-terminal symbols or syntactic categories or variables, each of which represents some collection of subphrases of the sentences; - finite set P of productions or rules that describe how each nonterminal is defined in terms of terminal symbols and nonterminals. The choice of non-terminals determines the phrases of the language to which we ascribe meaning. Each production has the form A → β, where A is a non-terminal and β is a string of symbols from the infinite set of strings (Σ U N ); - a distinguished nonterminal S, the start symbol , that specifies the principal category being defined - for example, sentence or program or mission. We can formally define the language LG generated by grammar G as the set of strings composed of terminal symbols that can be derived from the start symbol S. LG = {w/w is in Σ∗ and S →∗ w}

(1)

The creation of an alphabet, or better the choice of symbols of the alphabet is irrelevant as long as the symbols that compose it can be mapped to the available gestures and in turn it is possible to give these gestures an unambiguous meaning (unambiguous semantics). In this article the signs of the alphabet Σ are the set of letters of the Latin alphabet mixed

Syntax has been given through BNF productions as follows:

m-action ::= T | C | D | F | G direction num direction ::= forward | back | left | right | up | down object ::= agent | Λ place ::= B | P | H | Λ problem ::= E | C1| B3 | Pg |A1 | K | V | Λ p-action ::= H1 | B2 | D1 | Λ feedback ::= ok | no | U | Λ set-variable ::= S quantity | L level | P | L1 quantity | A1 quantity quantity ::= + | level ::= const | limit | free interrupt ::= Y feedback D work ::= Te area | Te place | Fo area | Fo place | wait num | check | feedback carry | for num works end | Λ works ::= work works | Λ area ::= num num | num num ::= digit num | Ψ digit ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 By applying the syntax to messages and commands identified we obtain the translation table II.

TABLE II.

Problems

Movement

Interrupt

Setting variables

Feedback

Works

T RANSLATION TABLE

Message/command

CADDIAN

I have an ear problem

A ăH1 E∀

I’m out of breath

A ăB2 B3 ∀

I’m out of air [air almost over]

A ăB2 A1 ∀

Something is wrong [diver]

A ăH1 Pg ∀

I depleted air

A ăD1 A1 ∀

Something is wrong [environment]

A ăPg ∀

I’m cold

A ăH1 C1 ∀

I have a cramp

A ăH1 K∀

I have vertigo

A ăH1 V ∀

Take me to the boat

AY T M B∀ AIF Y AY CB∀

You lead (I follow you)

AIF Y ∀

Take me to the point of interest

AY T M P ∀ AIF Y AY CP ∀

I lead (you follow me)

AY F M ∀

Go X Y X ∈ Direction and Y ∈ N

AY G f orward n∀ AY G back n∀ AY G lef t n∀ AY G right n∀

Return to/come X X ∈ P laces

AY CP ∀ AY CB∀ AY CH∀

Stop [interruption of action]

AY no D∀

Let’s go [continue previous action]

AY ok D∀ AY D∀

Abort mission

A∅∀

General evacuation

A∀

Slow down

AS − ∀

Accelerate

AS + ∀

Set point of interest

AP ∀

Level Off

AL limit ∀

Keep this level

AL const ∀

Free level

AL f ree ∀

Give me air

AA1 + ∀

No more air

AA1 − ∀

Give me light

AL1 + ∀

No more light

AL1 − ∀

No (answer to repetition of the list of gestures)

A no ∀

Ok (answer to repetition of the list of gestures)

A ok ∀

I don’t understand (repeat please)

AU ∀

Wait n minutes n ∈ N

A wait n ∀

Tessellation X * Y area X, Y ∈ N

AT e n m∀ AT e n∀ [square]

Tessellation of point of interest/boat/here

AT e P ∀

Tell me what you’re doing

A check∀

Photograph of X * Y area

X, Y ∈ N

AF oP ∀

Carry a tool for me

A carry∀

Do this task or list of task n times

n∈N

Problems: all the commands inside the “Problems” set refer to troubles happening to the diver or to the environment around the area of action of the mission. All the productions contain the ă symbol which denotes that there is a problem and that the mission must be aborted: when the classifier identifies this gesture it emits an alert to the surface boat/segment and it aborts the mission. Movement: all the commands inside the “Movement” set make the robot move or tell the robot how to move ( this second case refers to the “I follow you”/“You follow me” commands). Two productions (“Take me to the boat” and “Take me to the point of interest”) can be replaced by other productions with a similar meaning, however the replacement extends the length in term of signs of the mission (see for clarification the second translation relating to these commands in table II). Interrupt: all the commands inside the “Interrupt” set make the robot stop doing the current task/mission. The “Stop” and “Lets go” commands are used when the diver wants to stop the robot from doing something and currently can be used only when the robot is following the diver [we assume that the robot stops if the diver approaches within a predefined range of meters]. “Abort mission” cancel the current mission, while the “General evacuation” tells the robot to abort the mission and to issue any possible warning signals to the boat and to any diver all around (i.e. switches on flashing lights).

Setting variables: all the commands inside the “Setting variables” section set an internal variable inside the robot. At this moment we have eight internal variables: only seven can be set by the diver (the boat position is excluded). •

Speed: the robot speed has discrete value. With the “+” or “-” signs the diver increases or decreases this variable by a quantum.

•

Level: ◦ off: the robot cannot fall below this level, no matter what diver says: the robot interrupts any action, if the action forces him to break this rule (very useful in case of archaelogical sites). ◦ const: any following command is carried out at this level. ◦ free: “Keep this level” (i.e. “const”) command does not apply anymore.

•

Point of interest: set the point of interest.

•

Light: ◦ + switches on the on board lights. ◦ − switches off the on board lights.

•

Air: ◦

AF o n m∀ AF o n∀ [square]

Photograph of point of interest/boat/here Stop carrying the tool for me [release]

commands in “Problems” section refer to the communication of problems to the robot while the “Movement” section refers to commands that make the robot move. In the following sections we explain more in detail these commands set.

A no carry∀ A f or n . . . end

Direction = {ahead, back, lef t, right, U p, Down} P laces = {pointof interest, boat, here}

F. Semantics All commands/messages can be grouped into set: each set addresses tasks which refer to a common topic. So all

+ switches on the on board oxygen cylinder.

Fig. 3.

Initial list of gestures:natural numbers

◦ − switches off the on board oxygen cylinder. We assume that for any need CADDY is equipped with a backup oxygen cylinder. •

Here: set the point where CADDY stays when the mission is entrusted.

•

Boat: this is the boat position, cannot be set by the diver.

•

Point of interest: set the point where CADDY stays as a Point of interest which you can refer to inside later commands.

Communication feedback: all the commands inside the “Communication feedback” set refer to the acceptance of the mission. With these command the diver can accept or not a command (see subsection II-C). The diver can also ask the robot to repeat the command if he did not understand it (by distraction or by accident). Works: all the commands inside the “Works” set refer to the possible task the robot can executes. The “Tell me what youre doing” command (i.e. mission progress) is used when the diver approaches the robot (and the robot consequently stops anything is doing). The “Wait X minutes” command tells the robot to float and wait X minutes then proceed with the next command (useful when we have sandy bottom). The “Carry a tool for me” command tells the robot to carry equipment upon diver request: after the equipment has been placed into the compartment, the robot waits for a physical confirmation (we assume that for this special case there is a button to press to give confirmation). The “Do this task or list of task n times” allows repeating a task or a list of tasks a number of times: it is very useful combined with the “Wait” command to monitor an area or a point of interest. The commands “Make a photograph” and “Tessellation of an area” are overloaded, taking two different set of parameters: the first one is an area (described with length and width or only lenght for a squared area); the second one is an area around a given point such “Point of interest”, “Here” or “boat” [the area, in this case, is defined a priori]. III.

all diving agencies or organizations around the globe teach their diving hand signals, making some of them to vary from region to region: we chose the most famous and common ones[32][33][34][35][36]. The list in Fig.3 and Fig.4 contains only static gestures. Dynamic gestures will be introduced lately according to classifier performance. In fact the encoding and decoding of the gestures are assigned to the classifier: the cardinality of the alphabet and therefore of gestures depends on the ability of classification. The more dimensions the classifier can discern, the more gestures/symbols of the alphabet we can have: movement or better dynamics can be seen as one of these dimensions and also the more massive use of two hands instead of only one (see for example the “Get with your buddy” signal in [33]). As can be seen in Fig. 4, some of these gestures are not assigned, yet. In most cases the gestures have been chosen so as to be affine to natural/instinctive meaning. In other cases, the memorization of the meaning of a gesture is possible, through the association of objects pertaining to the action you want to perform such as in the “Take a photograph” case: the diver shows three fingers which can be associated to the tripod used to stabilize and elevate a camera. A. Mapping gestures to syntax, syntax to semantics As already said, a bijective mapping function translates from the domain of signs to our alphabet and vice versa (Fig. 2. Accordingly a gesture or a sequence of gestures and the corresponding characters or sequences of characters are also mapped to a semantic function that translates them into commands/messages. Example of translation into CADDIAN of the message “I

I NITIAL LIST OF GESTURES

Gestures were chosen in part from those common to divers and in part from everyday ones: in fact the gestures should be feasible in the underwater environment and should be as intuitive as possible to make the language effective. However

Fig. 4.

Initial list of gestures: some of the chosen signs

Fig. 5.

Example of command expressed in CADDIAN

have an ear problem” can be seen in Fig. 5.

IV.

C ONCLUSIONS

A description of a human-robot interaction language based on gestures, called CADDIAN, has been developed: CADDIAN is made up of syntax, semantics and an initial list of gestures. A transcription of the gestures is also given: for better readability, signs have been mapped with easily writable symbols. Also a communication protocol has been given. There is still much work to do however: the next steps are the introduction of dynamic gestures, the completion of mapping of all written signs to gestures and gestures recognition task. Afterwards it will be taken into account the evaluation of the acceptance by divers of the language and through the study of their feedback CADDIAN might be changed accordingly. ACKNOWLEDGMENT The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 611373. R EFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

J. H. Cu, J. Kong, M. Gerla, and S. Zhou, “The challenges of building mobile underwater wireless networks for aquatic applications,” IEEE Network, vol. 20, no. 3, pp. 12–18, 2006. D. Kilfoyle and A. Baggeroer, “The state of the art in underwater acoustic telemetry,” IEEE Journal of Oceanic Engineering, vol. 25, no. 1, pp. 4–27, 2000. J. Neasham and O. Hinton, “Underwater acoustic communications how far have we progressed and what challenges remain,” in Proc. of the 7th European Conference on Underwater Acoustics, Delft (The Netherlands), 2004. P. Garg, N. Aggarwal, and S. Sofat, “Vision based hand gesture recognition,” World Academy of Science, Engineering and Technology, vol. 49, no. 1, pp. 972–977, 2009. K. Manchanda and B. Bing, “Advanced mouse pointer control using trajectory-based gesture recognition,” in IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the. IEEE, 2010, pp. 412–415. A. A. Argyros and M. I. Lourakis, “Vision-based interpretation of hand gestures for remote control of a computer mouse,” in Computer Vision in Human-Computer Interaction. Springer, 2006, pp. 40–51. L. Roverelli and N. Zereik, “Movements classification on the Laban’s theory of effort time axis,” in Sonic Synesthesia - 19th Colloquium on Music Informatics, Trieste, 2012, pp. 46–50. C. Manresa, J. Varona, R. Mas, and F. Perales, “Hand tracking and gesture recognition for human-computer interaction,” Electronic letters on computer vision and image analysis, vol. 5, no. 3, pp. 96–104, 2005.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

E. A. Suma, B. Lange, A. Rizzo, D. M. Krum, and M. Bolas, “Faast: The flexible action and articulated skeleton toolkit,” in Virtual Reality Conference (VR), 2011 IEEE. IEEE, 2011, pp. 247–248. K. Biswas and S. K. Basu, “Gesture recognition using microsoft R kinect,” in Automation, Robotics and Applications (ICARA), 2011 5th International Conference on. IEEE, 2011, pp. 100–103. Z. Ren, J. Meng, and J. Yuan, “Depth camera based hand gesture recognition and its applications in human-computer-interaction,” in Information, Communications and Signal Processing (ICICS) 2011 8th International Conference on. IEEE, 2011, pp. 1–5. Z. Ren, J. Meng, J. Yuan, and Z. Zhang, “Robust hand gesture recognition with kinect sensor,” in Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011, pp. 759–760. M. B. Holte, T. B. Moeslund, and P. Fihl, “View-invariant gesture recognition using 3d optical flow and harmonic motion context,” Computer Vision and Image Understanding, vol. 114, no. 12, pp. 1353–1361, 2010. M. Van den Bergh and L. Van Gool, “Combining rgb and tof cameras for real-time 3d hand gesture interaction,” in Applications of Computer Vision (WACV), 2011 IEEE Workshop on. IEEE, 2011, pp. 66–72. Z. Ren, J. Yuan, and Z. Zhang, “Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera,” in Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011, pp. 1093–1096. J. Shukla and A. Dwivedi, “A method for hand gesture recognition,” in Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on. IEEE, 2014, pp. 919–923. E. Ohn-Bar and M. M. Trivedi, “Hand gesture recognition in realtime for automotive interfaces: A multimodal vision-based approach and evaluations,” IEEE Trans. Intelligent Transportation Systems, 2014. J. Suarez and R. R. Murphy, “Hand gesture recognition with depth images: A review,” in RO-MAN, 2012 IEEE. IEEE, 2012, pp. 411– 417. S. U. Lee and I. Cohen, “3d hand reconstruction from a monocular view,” Pattern Recognition, International Conference on, vol. 3, pp. 310–313, 2004. G. Dewaele, F. Devernay, and R. P. Horaud, “Hand motion from 3D point trajectories and a smooth surface model,” in 8th European Conference on Computer Vision, ECCV 2004, May, 2004, ser. Lecture Notes in Computer Science, T. Pajdla and J. Matas, Eds., vol. 3021, no. 1. Prague, Tchéquie: Springer, May 2004, pp. 495–507. E. Ueda, Y. Matsumoto, M. Imai, and T. Ogasawara, “A hand-pose estimation for vision-based human interfaces,” Industrial Electronics, IEEE Transactions on, vol. 50, no. 4, pp. 676–684, 2003. Y. Zhang, J. Zhang, and Y. Luo, “A novel intelligent wheelchair control system based on hand gesture recognition,” in Complex Medical Engineering (CME), 2011 IEEE/ICME International Conference on. IEEE, 2011, pp. 334–339. Q. Chen, N. D. Georganas, and E. M. Petriu, “Real-time vision-based hand gesture recognition using haar-like features,” in Instrumentation and Measurement Technology Conference Proceedings, 2007. IMTC 2007. IEEE. IEEE, 2007, pp. 1–6. S. B. Wang, A. Quattoni, L. Morency, D. Demirdjian, and T. Darrell, “Hidden conditional random fields for gesture recognition,” in Com-

[25]

[26]

[27]

[28] [29]

[30]

[31] [32]

[33]

[34]

[35]

[36]

puter Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. IEEE, 2006, pp. 1521–1527. N. H. Dardas and N. D. Georganas, “Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques,” Instrumentation and Measurement, IEEE Transactions on, vol. 60, no. 11, pp. 3592–3607, 2011. K. Arora, S. Suri, D. Arora, and V. Pandey, “Gesture recognition using artificial neural network,” International Journal of Computer Sciences and Engineering, vol. 2, 2014. S. Mitra and T. Acharya, “Gesture recognition: A survey,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 37, no. 3, pp. 311–324, 2007. N. Chomsky, “Three models for the description of language,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 113–124, 1956. J. W. Backus, “The syntax and semantics of the proposed international algebraic language of the zurich acm-gamm conference,” in Proc. of the International Conference on Information Processing. Paris (France): UNESCO, 1959. J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation (2nd ed.). Addison-Wesley, 2001, ch. Context-Free Grammars and Languages, pp. 169–217. D. Jurafsky and J. H. Martin, Speech and language processing. Prentice Hall, 2014, ch. Context-Free Grammars, pp. 395–435. C. M. des Activités Subaquatiques, “Segni convenzionali cmas,” Online pdf, accessed 25th november 2014. [Online]. Available: http://www. cmas.ch/downloads/it-Codici%20di%20comunicazione%20CMAS.pdf R. S. T. Council, “Common hand signals for recreational scuba diving,” Online pdf, accessed 25th november 2014. [Online]. Available: http://www.neadc.org/CommonHandSignalsforScubaDiving.pdf S. diving fan club, “Most common diving signals,” HTML page, accessed 25th november 2014. [Online]. Available: http://www. scubadivingfanclub.com/Diving Signals.html fordivers.com, “Diving signs you need to know,” HTML page, accessed 25th november 2014. [Online]. Available: http://www.fordivers.com/ en/blog/2013/09/12/senales-de-buceo-que-tienes-que-conocer/ www.dive links.com, “Diving hand signals,” HTML page, accessed 25th november 2014. [Online]. Available: http://www.dive-links.com/ en/uwzeichen.php