Humanoid Robot Control by Body Gestures and Speech S. Noman Mahmood, Noman Ahmed Siddiqui, M. T Qadri, S. Muhammad Ali, M. Awais Electronic Engineering Department, Sir Syed Univ.of Engg. and Tech.,Karachi, Pakistan
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] Abstract— In this paper, an application of humanoid robot control by means of body gestures and speech interaction is proposed. The speech and gesture based interface allows more natural and user-friendlier interaction and can broaden the usability of robots. Unlike most existing systems which use high-end equipment and expensive sensors to recognize body gestures and speech, it uses inexpensive and easily available hardware. The software used for programming is open source and cross platform. The proposed gesture recognition method uses a Kinect sensor to capture depth image information of the scene. The data is then combined with different tracking algorithms to recognize different body gestures. The custom body gestures are defined which are based on the relative positions of the joints. To take control of the robot, the person has to stand in a certain pose in order to allow the software to track the skeleton of user. The user can then control the robot by just giving arm directions or by simple body gestures. In addition to gesture recognition, the robot is also able to recognize speech. The speech recognition feature provides easy and efficient natural interaction by allowing user to control robot through voice commands. The speech recognition system requires no or very little training and makes use of the grammar files to match the spoken word. By using speech synthesizer, the robot is also able to speak words and can chat with users.
Keywords: Gesture control, Gesture control, humanoid robot, speech interaction, kinect sensor
I. INTRODUCTION In the recent years, gesture recognition is gaining very much interest [1] and the recent developments in this field have provided more natural and easier ways of communicating and interacting with machines. Gesture recognition technology has the potential to enhance the user’s way of interacting with machines and to provide easy, fast, an efficient and more user friendlier environment. In addition to the gesture based man-machine interaction, speech recognition capability allows users to communicate with machines naturally using spoken words. This work allows user’s body to give commands to the system through different body movements and easy gestures like finger pointing, right arm up etc. The proposed work does not require any special equipment like glove as was required in [2] or other devices to be attached to the body to sense the movements. Instead, it is totally based on the image processing techniques. The camera reads the full body movements which are then further processed to detect different gestures. This data is then can be used to control devices or applications. The work mainly focuses to build a smart humanoid robot capable of interacting with humans in a more natural way and to understand our hand and body gestures. The robot will also be able to recognize voice commands and exhibits an effective talking behavior. In simple words, it can see, hear and can also talk. The gesture recognition system works in four basic steps, the first step is to detect human body in image frames. The camera used in the work is a low cost 3d depth sensor namely ‘Kinect’ [3]. It uses structured light 3d scanning technique to provide depth information of the pixels. The raw
depth image data is processed further to detect human body in the scene. An open source framework namely OpenNI [4] is used for processing depth images. The second step is to extract the joint vectors from the detected body shapes. The technique requires the person to stand in a certain calibration pose in order to detect the joints of the user’s body. The third step involves defining the gestures into programming based on the information of joints positions. The custom gestures will be defined according to our own needs. After that, fourth step comprises of checking the body movements with the programmed custom defined gestures. If the body pose matches the condition, a command is sent to microcontroller for controlling servo motors. The voice recognition system is also introduced which makes use of the grammar files which define the strings recognized during the recognition process. The speech command is converted to text which is then used to control the robot movement. The talking feature of robot involves speech synthesizing process. The text is converted to voice output which enhances the robots usability and results in more efficient communication with users. The next section explains the complete system model of the work.
II. SYSTEM MODEL The proposed system is divided into 4 stages. The first stage is the Kinect sensor and microphone interfacing to the BeagleBoard-XM [5]. The second stage is the processing of image sequences inside BeagleBoard-XM. The third stage is the processing of Audio data inside BeagleBoard-XM. The fourth stage is the AVR microcontroller based control system. The complete block diagram of the proposed model is illustrated in figure 1.
Kinect + Microphone
Processing of Image sequences Processing of Audio data
AVR Microcontroller Based Control System
BeagleBoard-XM Fig 1. Block Diagram of the proposed model
A. Interfacing Kinect Sensor and Microphone with BeagleBoard-XM The first stage is the Kinect Sensor and Microphone interfacing with BeagleBoard-XM. As BeagleBoard-XM is an open source linux based board therefore the required driver installation packages should be linux supported. The PrimeSense Company [6] has released an Open source framework namely OpenNI (Open Natural Interaction) for interfacing depth sensor devices like Kinect. This allows developer to easily access the kinect data streams on any platform. OpenNI package is installed. The driver “SensorKinect” [7] is also required to install. For hand/body tracking capabilities, NITE (Natural Interface Technology for End-User) [8] package is also required to develop the desired system. For Microphone interfacing, open source ALSA sound drivers are also installed. The next section describes the procedure to process image data inside BeagleBoard.
B. Processing of Image sequences inside BeagleBoard-XM The second stage comprises of processing kinect data streams inside BeagleBoard-XM. The whole programming is done through the software named ‘Processing 1.5.1’ available online at [9] which is an open source, Java based programming language and IDE. The implementation of gesture recognition process in Processing 1.5.1 is discussed below.
OpenNI Framework
NITE Middleware
SimpleOpenNI Library
Defining a gesture based on joint positions and angles
Serial data out to AVR Microcontroller
Detect body shapes in Depth image
Calculate angles between joints
Skeletonize User ID
Track body joint vectors
Processing 1.5.1
Fig 2. Gesture Recognition process in Processing 1.5.1 The first step in gesture recognition process is to access the kinect image data streams through OpenNI framework. Kinect sensor generates depth image with a precision of 11 bit or 2048 levels [10]. The depth image consists of levels of grey. The parts that are closer to the sensor are shown as lighter parts and the parts that are farther away are shown as darker. In order to develop the gesture recognition algorithm more efficiently, we need more information rather than just a raw depth image. The NITE is a high level middleware which provides algorithms for effective hand and body joints tracking. Then we used the SimpleOpenNI library[11], which is OpenNI and NITE wrapper for Processing 1.5.1. Since OpenNI and NITE are implemented in C++, the SimpleOpenNI library contains all the functions and methods which can be accessible from Java environment. When the processing sketch starts running, OpenNI will not track anybody. As the user enters the scene, the OpenNI detects the user’s presence and then watches for the calibration pose. The person has to stand in a certain calibration pose in order to allow the programming framework to track the user’s limbs. When the calibration process is successful, the user’s body joints data is now available. After the calibration process, the programming window will show the tracking symbols on the user’s body. Then the next step is to skeletonize the user. The position of joints are stored in vectors which contains the x,y and z coordinates of the tracked joint. Then a line is drawn between two vectors to make a limb. In the same way, the whole body skeleton is drawn.
Fig 3. Calibration pose and Skeleton tracking The information of the position of joints is constantly stored in vectors. The programming algorithm includes a series of functions to use the skeleton to calculate the angles between joints.
Fig 4. Angles between joints The custom body gestures are defined into programming on the basis of position and angles of joints. The joints data is constantly checked and compared with each other to detect a specific gesture. If a body pose matches the defined condition, a serial command is sent to microcontroller which triggers the movement of motors inside robot. For example, let us consider two joints, one is right hand joint vector and the other is neck joint vector. If the value of y coordinate of right hand joint increases than the value of y coordinate of neck joint, “Right Arm Up” gesture function is triggered. The serial command “RAU” (Right Arm Up) is then sent to AVR. In the same way, many custom gestures are defined into programming and are integrated with different tasks
like right arm out, right arm in, tilt body left, tilt body right, walk forward etc. Also the angles between the joints are sent to AVR which directly controls the angles of the servos of the robot.
C. Processing of Audio data inside BeagleBoard-XM The third stage deals with the processing of Audio data inside beagleboard. Processing of Audio data is done by Voce library available at [12] which is an open source speech interaction library based on Java that provides speech recognition and synthesis methods for many computing tasks. In Voce, the speech recognition is handled by CMU Sphinx4 [13] and the speech synthesis is done by FreeTTS (Free Text To Speech) available at [14]. The incoming audio data is constantly listened and processed. When the audio samples of the spoken word matches the grammar file word, the word is put into a single string which is then further used to trigger functions. For example, if we say “right arm up” then the serial command “RAU” is sent to AVR which handles the joint motor movements. In the same way, many voice commands are integrated with different tasks. The talking ability of the robot involves speech synthesizing process which is dealt with FreeTTS. The text is converted into speech output. It supports number of voices with different pitch. Using this feature, the robot is able to tell time, day, weather, tell jokes etc and chat with users. The architecture of VOCE library is illustrated in figure 6.
Fig 5. Architecture of VOCE Library taken from [15] The following table shows some of the speech commands mapped to the robot functions, Table 1 Speech Commands Mapped to theRobot Functions Speech Commands Robot functions Follow my movements Initiate body control mode Stop Trigger stop mode What day is it Tell the day What time is it Tell the current time How is the weather Tell the current weather The next section is the microcontroller based controlling system to control the servo motors.
D. AVR Microcontroller based control system The third stage involves controlling the robot joint motors with AVR Microcontroller. As the data of joint angles and gestures are available to the AVR, it is easy to make switch case statements based on the serial data acquired from beagleboard. The communication betwen beagleboard and the microcontroller is acheived using RS-232 serial commuincation protocol. The
gesture data and angles are translated into servo positions and pulses are sent to joint motors of robot which will result the movement of robot limbs. When the user is fully calibrated and tracked, the software looks for the speech command to switch in between operating modes. When user says “follow my movements”, a serial command with joint angles information is sent to the microcontroller which in turn drives servo motors. This results in robot copying all body movements of the user. When user says “stop”, the microcontroller stops the servo motors. When the user performs a specific gesture, the microcontroller turns the servos according to the condition. The following table lists some of the custom defined gestures with the robot actions. Body Gesture Right arm extended towards right Right arm extended towards right Right arm forward Left arm forward Right arm up Left arm up
Robot Action Turn right 90° Turn left 90° Walk forward two steps Walk backward two steps High Five pose and say Hi! Go to Listen Mode
The system is developed and implemented successfully with many types of gestures and speech commands. The known issues are outdoor sunlight causing poor depth image generation. Also the voice synthesizer takes some time to produce first audio output which can be overcome by synthesizing a silent null string during initialization. The next section concludes the paper with some future enhancements. Conclusion The proposed system has shown the implementation of the humanoid robot control through gestures and speech. Many systems have been created and implemented in robotics field which use high-end equipment and expensive devices to perform the speech and gesture recognition operations. The project provides strong basis of effective gesture tracking by using low cost 3d depth sensor which has the potential to replace the existing high cost imaging systems. The speech recognition system does not need any dsp kits or advance audio systems but it uses simple yet effective methods to match the spoken words. All code is written in open source programming language which allows the system to be run in any platform without having to worry about operating system. To deal with the required computational load, Beagleboard-XM is used which provides laptop-like performance. As the technology develops, we will have more smart robots which will resemble humans. Autonomous navigation and 3d mapping features of robot are subject to future research. REFERENCES 1. Weinland Daniel, Remi Ronfard, and Edmond Boyer.,“A survey of vision-based methods for action representation, segmentation and recognition", Computer Vision and Image Understanding , Vol no. 2, pp. 224-241, 2011. 2. Sturman David J., and David Zeltzer. “A survey of glove-based input", IEEE Conference on Computer Graphics and Applications, Vol no. 1, pp. 30-39, 1994. 3. Kinect Sensor Description. Available at http://www.xbox.com/en-US/kinect 4. OpenNI Framework Description. Available at http://www.openni.org/ 5. Official BeagleBoard Website. Available at http://beagleboard.org/Products/BeagleBoard-xM 6. PrimeSense Website. Available at http://www.primesense.com/ 7. Github Repository of SensorKinect. Available at https://github.com/avin2/SensorKinect 8. NITE Middleware Support. Available at http://www.primesense.com/solutions/nite-middleware/ 9. Java based Open Source Programming Software, Processing 1.5.1. Available at http://processing.org/
10. Raheja Jagdish L., Ankit Chaudhary, and Kunal Singal, "Tracking of fingertips and centers of palm using kinect", Third International Conference on Computational Intelligence, Modelling and Simulation (CIMSiM), Vol 1, pp. 248-252, 2011. 11. SimpleOpenNI, OpenNI and NITE Wrapper for Processing 1.5.1. Available at https://code.google.com/p/simple-penni/ 12. VOCE, Open Source Speech Interaction Java based library. Available at http://voce.sourceforge.net/ 13. CMU Sphinx, Open Source Speech Recognition Project. Available at http://cmusphinx.sourceforge.net/ 14. Free TTS, Java based Speech Synthesizer. Available at http://freetts.sourceforge.net 15. Tyler Streeter, “Open Source Speech Interaction with the Voce Library”, Virtual Reality Applications Center, Iowa State University, Ames,US. White paper available at http://voce.sourceforge.net/files/VoceWhitePaper.pdf