Oct 26, 2007 - Huggable is a robotic teddy bear with 8 degrees of freedom (DOFs), one ..... Peek-a-boo: focusing on a target of younger children or infants. 2.
Affordable Gesture Recognition Based Avatar Control System: A Puppeteering System for the Huggable
By Jun Ki Lee Thesis Proposal for the degree of Master of Science at the Massachusetts Institute of Technology October 26, 2007
Thesis Advisor:
__________________________________________ Dr. Cynthia L. Breazeal LG Group Career Development Professor Associate Professor of Media Arts and Sciences MIT Media Laboratory
Thesis Reader:
__________________________________________ Dr. Rosalind W. Picard Co-Director of Things That Think Professor of Media Arts and Sciences MIT Media Laboratory
Thesis Reader:
__________________________________________ Dr. Joseph Paradiso Sony Corporation Career Development Professor of Media Arts and Sciences Associate Professor of Media Arts and Sciences MIT Media Laboratory
ABSTRACT We are developing a novel interface for controlling the behavior of physical (e.g., a personal robot) or virtual (e.g., an animated agent such as in Second Life) avatars. As the morphologies of these avatars become more sophisticated, it becomes more difficult to convey, compellingly and effectively, the remote human's communicative intent while mitigating cognitive load. Puppeteering devices such as motion-capture systems can control all joints of a robot, but are too expensive for personal use; gamepads are affordable, but are often unintuitive and difficult to learn and master. We are developing an intuitively understandable and affordable device to control personal robots such as the Huggable, as well as sophisticated avatars in virtual worlds like the Second Life. A new puppeteering device can control an avatar by capturing a human operator's motion directly through an IR vision-based technology as well as other wearable sensors such as low-cost, 6-axis inertial measurement units. This multi-modal, real-time data can be used to recognize the intentions of an operator's movements to evoke compelling animations or sound effects.
TABLE OF CONTENTS ABSTRACT ..................................................................................................................................... 2
INTRODUCTION............................................................................................................................ 4
RELATED WORK .......................................................................................................................... 6
PROPOSED APRROACH............................................................................................................... 7
EVALUATION.............................................................................................................................. 10
TIMELINE ..................................................................................................................................... 14
DELIVERABLES .......................................................................................................................... 14
RESOURCE REQUIRED.............................................................................................................. 14
REFERENCES............................................................................................................................... 15
INTRODUCTION In the past, we could only see robots in factories. Now, we see personal robots everywhere. Robosapien’s robot series from Wowwee are one of the most popular Christmas gifts for children. Many people buy vacuum cleaner robots such as Roomba for cleaning their home floors. Though Sony has stopped selling their AIBOs to the market, it created a sensation when it was first released as a robot pet for people’s homes. A step motor-based humanoid type robot like KHR-1 from Robo-One is also a popular tool for high school and university students who want to study robotics. Moreover, there are contests for these robots. Personal robots have come closer to our lives than ever before. However, such robots are controlled by hard-to-learn interfaces. They are mostly based on button controls. Interfaces are either a joystick controller or software in a personal computer that shows a screen, displaying what a robot sees, and many buttons to trigger movements. Robots are morphologically complicated so that it is hard to control them by such button-based interfaces. The number of animations (continuous sequence of movements) that can be played via a robot is normally more than the number of buttonsin one joystick. Sometimes, a user needs to memorize dozens of key combinations to trigger a certain animation. Such is also the case for virtual avatars. Virtual environmentsare always considered to be an alternative for physical presence due to its convenience. As communication environments get much better together with the progress in the speed of the Internet, concerns about virtual environments where people play their avatars to create their own virtual life, are also growing. Second Life by the Linden Lab is gaining popularity and seeking possibilities in both business and education. Colleges are trying to hold classes in the Second Life and companies are trying to use the virtual environments as a test-bed for their pilot products. Another issue is using such environments to create a new type of workspace. Sun is developing a virtual workspace called the MPK 20 to be a future workspace example and a virtual conferencing environment (Yankelovich, 2007). In these environments, driving or controlling an avatar in an effective way is also essential. However, since such an avatar has almost the same as morphology as a human has, an avatar may have a variety of motions it can play through various parts of its body. In the virtual environment, typing a keyword in a chat box triggers animations for body gestures. It is also difficult to evoke natural body movements without remembering all the necessary keywords for rich sets of animations. In this thesis, I will raise and solve the problems relating to developing an interface that can control both a physical robot avatar and a virtual avatar from a remote place more effectively and intuitively. To achieve this goal, I will try to find the most effective way to capture user’s intention to drive an avatar. Later in this paper, three pilot proto-types will be described. Basically, proto-types will aim to mimic a user’s body and facial gestures and users may wear a wearable sensor device on both hands and head. The device may capture user’s gesture through one or more cameras. Moreover, the device will be affordable to be usable in people’s homes. Therefore, using as few devices and technologies as possible to meet the same expressive power will be the core goal. To test an interface with a physical robot, I will use the Huggable, which is being developed by the Personal Robots group in the MIT Media Lab (Stiehl, et al., 2006). The Huggable is a robotic teddy bear with 8 degrees of freedom (DOFs), one inertial measurement unit (IMU) sensor, and touch, temperature, and proximity sensors all over its skin. Eight DOFs
include two for each arm, three for its neck, and one for ears and eyebrows together. The Huggable is designed for use in children’s hospitals, elder care facilities, schools, and people’s homes with a specific focus onhealthcare, education, and family communication. The new interface will seek an essential role in applying the Huggable to the above cases. To test the interface within a virtual environment, I will use the Second Life. Most users in the Second Life control human-like avatars and communicate through keyboard, typing in a chat box, or via their own voice with the help of a voice-tunneling feature. To control the body and facial gestures of an avatar, a button-based panel interface called Heads Up Display (HUD) is often used. Users can also type in keywords to trigger certain gestures. The current client plays one gesture at a time. One gesture is a combination of body gesture, facial in-built animations, and animations for hand posture. Therefore an avatar cannot move its upper torso and its lower part independently. Linden Lab has initiated tactical efforts to improve its capacity in animating avatars inside a user side client(Puppeteering, 2007). Limitations will be fixed as soon as it launchesits plan to build a better puppeteering device. Considering avatar’s morphology is much more complex than the morphology of the Huggable, the challenge in designing the interface will be how to capture subtle gestural changes of a user when he/she actually intends to interact in the virtual environment through his/her avatar. The below table describes differences between the Huggable and the environment and avatar of the Second Life. Table 1: Comparison between two environments
The Huggable First Person Point of View (1P POV)
A Virtual Avatar in Second Life Third Person Point of View (3P POV)
User sees through the robotar.
Most of the time, user watches the scene from above. Human-like morphology (63 DOFs w/ separate built-in animations for eye, eyebrows and hands) All gestures encompassing itsfull body Virtual interaction + Interaction with virtual objects + Enhanced situation awareness (full awareness of its surroundings) + Voice Tunneling
Simplified morphology (8 DOFs) Arm gestures and Neck gestures (Upper torso) Physical interaction: Interaction with physical objects + Interaction through touch + Voice Tunneling
Human
The Huggable
Expressive power decreases via the interface. Morphology of the Huggable makes the robot less expressive than the human. Human needs to act like a bear.
Human
The Second Life Avatar
Expressive power decreases less then the case of the Huggable via the interface. Mostly, Second Life avatars are almost humanlike. They do not need to change their characteristics to control an avatar. They can change like the case of the Huggable.
As seen in the above table, the environments have different aspects and they will influence design decisions in building a generic puppeteering interface for both systems. Nevertheless, they share common procedures in handling inputs from the user and outputs to avatars.
RELATED WORK Related works can be divided into three subcategories: embodied conversational agents, telexistence and artificial reality, and facial expression and gesture recognition. We can see avatars in the Second Life and the Huggable as semi-autonomous agents, which are designed to communicate with humans and have intelligence to interact like a human by sensing the environment and choosing the right actions to take in an anthropomorphic way. In that sense an autonomous agent, which is anthropomorphic in its appearance and has similar expressive power to the avatars, is almost identical to robots except in its non-physical existence. These agents are called ‘embodied conversational agents (ECA)’ (Cassell, 2000). Cassell and her colleagues’ work includes understanding expressive power of a human and finding ways to implement communicative abilities of human in an agent. Her system displays an anthropomorphic figure on a screen using three-dimensional computer graphic technologywith various multi-modal interfaces. Her work in embodied conversational agents includes understanding verbal parts of communication, nonverbal parts of communication, and applications of placing an agent in a pedagogical situation for children. The most important components of communication, when a human is interacting with a human face-to-face, are spoken language and nonverbal languages. Spoken language can be transmitted via the voice-tunneling feature in both the Huggable and the Second Life environments. Nonverbal language is more of an issue. Nonverbal language comprises body and facial gestures. She categorized gestures into three categories: emblems, propositional, and spontaneous gestures. She also subcategorized spontaneous gestures: Iconic, metaphoric, deictic, and beat gestures (Cassell, 2000). This categorization will be helpful in better understanding different types of gestures whichdo and do not need to be recognized through an interface. Her work also includes animated pedagogical agents. In developing such an agent, it is necessary to put an agent in a specific situation and write a sequence of interactions and agent’s according behavior through the interaction. Tartaro’s work on children with autism shows a good exemplar of applying a system in a specific environment. It also shows possibilities of the Huggable applying to pedagogical situation with children with autism (Tartaro & Cassell, 2006). To recognize gestures in a machine, a certain mathematical model needs to exist. Since gesture data can be considered as a stream of sequential data, hidden Markov models (HMMs) fits the mathematical model for this case (Dietterich, 2002). Bobick and his group did relevant research (Bobick & Ivanov, 1998). However, if the number of input values increases,HMMs may not work correctly. To resolve this problem Yin worked on combining two different methods into one (Yin, Essa, & Rehg, 2004). Blumberg’s synthetic characters are another example of embodied conversational agents. Blumberg’s synthetic characters, ‘*(void)’ were partially controlled by Benbasat’s inertial measurement unit (IMU) sensors inside two buns with forks fixed on each bun’s top (Benbasat & Paradiso, 2002). Testers hold the forks to move the buns around. Buns with forks implicitly meant to be the legs of the characters and gestures expressed through the device were used to influence an avatar in a screen to mimic recognized patterns of movements captured through the interface (Blumberg, 1999). Blumberg and Benbasat’s work is analogous to the proposed
puppeteering interface in the sense that the interface enforces for avatars to mimic implicit or explicit gestures captured through the device. Compared to Blumberg and Banbasat’s, Wren’s Dyna is closer to direct puppeteering interface (Wren & Pentland, 1999). The system tracked a face of a person first and then using the color information of the detected face, it tracked locations of hands. The system mapped threedimensional position information retrieved from a set of stereo cameras into a model of an avatar. In the field of robotics, researchers worked to realize telexistence through the medium of robots for many years. Goza and Sakamoto’s works on robotic telexistence can be considered as primary examples. NASA’s Robonaut used VR helmet displays, PolhemusTM sensors tracking body position and motion, andVirtual Technologies, INC (VTI) CybergloveTM tracking finger motion. (Goza, Ambrose, Diftler, & Spain, 2004). Positions of sensors and measures of joint angles from the Polhemus sensors directly mapped into the robots DOFs. For the fingers, data glove were used. To decrease the discrepancy between the data glove and the robot’s real hands, a technique relating to inverse kinematics (IK) was adapted. Sakamoto used infrared reflective markers captured via infra-red Vicon cameras to locate human’s facial feature points and mapped these points to the actual robot’s face (Sakamoto, Kanda, Ono, Ishiguro, & Hagita, 2007).
PROPOSED APPROACH Two prototypes for the interface were first proposed and tested. Descriptions below summarize tested systems. They were both first tested on the Huggable platform. Gesture Recognition Wii Remote Controller and Nunchuk
Pre Process for Accelerometers Outputs
Gesture Recognition (Forward Algorithm for the trained HMMs)
Triggering Classified Animations
Figure 1: Gesture Recognition Diagram
A puppeteering system that only uses two Wii remote controllers was first proposed and was tested on the Huggable. Since the Huggable itself was designed to be deployable to general people’s homes, its puppeteering system also needs not to be expensive. The possibility of using just two 3-axis accelerometers to trigger various sort of animations that the Huggable can play was first sought. We found Wii Remote Controllers were intriguing since they contain one 3-axis accelerometer for each controller and have a Bluetooth wireless feature. For the real Figure 2: A person using the Wii Remote experiment, one Wii Remote and one Nunchuk were to control the Huggable. used; they were connected through a wire and used only one Bluetooth connection to a computer. The device sent a total six of acceleration values in three different orthogonal directions for two controllers. The data stream was recorded and used to train six different continuous hidden Markov models (HMMs) for each different gesture that we aimed to recognize. The six gestures were “arms up, arms forward, bye-bye (one arm up and waiving), crying (wiggling both hands in front of eyes), ear flickering (wiggling both hands above
the head), and clapping (with both arms forward repeating moving both hands left to right in an opposite direction)”. Data was gathered by eight different MIT undergraduate and graduate students who worked in the Personal Robots Group at the MIT Media Lab. They each repeated one session of six gestures seven times. They were given a page of instruction with figures showing the sequence of each gesture and also watched a demonstrator who performed the respective gestures. Even though they were told to follow gestures that were taught by both instruction and demonstration, real participants did not follow exactly what a demonstrator tried to gesture. Each gesture was tagged by the kind of gestures and the number of repetitions when it was recorded. The human participant used the button on the bottom of the Wii Controllers to let the recording program know when to start and stop recording for one performance of each gesture. Eight different sets of data with seven different sessions of six gestures were used to train six different continuous HMMs separately for each gesture. Baum-Welch’s estimation maximization (EM) algorithm learned parameters of HMMs for each set of the given data (Rabiner, 1989). HMMs were trained using Kevin Murphy’s hidden Markov model (HMM) Toolbox for the MATLAB (Murphy, 2005). To utilize learned parameters for each different HMM, the forward algorithm was re-implemented to the C language and used for real-time classification. The forward algorithm implementation was based on Kevin Murphy’s Toolbox (Murphy, 2005). The forward algorithm calculates likelihood probabilities of all six HMMs for the given data and chooses the most plausible hypothesis (gesture). Classified data is sent to the behavior control system of the Huggable to trigger the respective animations. However most of the gestures were classified correctly on varying human users, “arms forward” and “arms up” were sometimes misclassified, and there were too many false positives for “ear flickering”. We came up with a conclusion that using continuous HMM does not help increase the number of categories that can be classified if it is only allowed to use the given data. Based on the data from the accelerometers, it is hard to distinguish where different gestures was performed. To solve this problem, using a camera to support location information was proposed. Direct Puppeteering Baseball Cap with Reflective Markers Infra-red Webcam Wii Remote Controllers with Reflective Markers
Blob Extraction & Tracking (OpenCV)
Extraction of Angles for the Neck and Shoulder Joints.
Figure 3: Diagram for the Direct Puppeteering System
The idea to use the camera to provide location information grew to be a development of a direct puppeteering device for the Huggable. The location of hands can be directly used for providing joint angles for the shoulders. One of the goals of the first proposed system for puppeteering was to provide an intuitive interface, which can be seen as driving a robot mirroring its user’s gestures; all the animations were similar to gestures that human participants performed. If providing extracted joint angles can directly mimic movements of the shoulder’s joint angles, mirrored movements of a robot through an interface are possible. In particular, since the Huggable cannot bend his elbows, it is trivial to choose shoulder joint angles (shoulder rotate, shoulder up and down). For the direct puppeteering, it was decided to control both shoulders and the head. To track the head’s movements, a baseball cap with four reflective spherical markers in a shape of tetrahedron was used to extract three joint angles, yaw, roll, and pitch, for the neck.
The Pose from Orthographic and Scaling with Iterations (POSIT) algorithm was used to calculate the data(DeMenthon & Davis, 1995). To track hand movements, for the second prototype, two Wii Remote controllers with reflective tapes on their fronts were used. However, for this direct puppeteering interface, accelerometers were not used. Most current marker-free body detection systems are not robust enough to the noise and the changes of light conditions. To exclude this problem, the infrared camera and the infrared light reflective markers were chosen instead. For the infrared camera, the Logitech Quickcam Pro 5000 was used; the lens with infrared filter was replaced with a 2mm wide-angle lens, and a visible light blocking filter made from exposed color film was attached in front of the lens. Instead of high-resolution stereo cameras, a regular web-cam with the 640x480 resolution was Figure 4: A Person Controlling the selected. The real resolution used in the software to Huggable via the Direct Puppeteering minimize the latency of processing was 320x240. The Interface web-cam was modified to accept infrared lights. Most marker-free detection algorithms that can track hands and head postures only without any markers are not robust to changes in light conditions. People’s homes are more vulnerable to such changes compared to environments in research labs. Images of infrared light reflective markers are noiseless and stable.
Figure 5: Equipments for the Interface
Joint angles were sent to the behavior control system of the Huggable. The threedimensionalavatar of a robot moved accordingly and motor potentiometer position were sent to motor control software in a series. Motor controller software controlled hardware arms of the Huggable v3.0: the left and right shoulders with 2 DOFs on each side were built as a prototype for the Huggable v3.0. The limitation of the direct puppeteering system is that it cannot control all degrees of freedom (DOFs) of a robot. In the proposed system above, only neck joints and shoulder joints were controllable. To improve the second prototype, the gesture recognition system of the first prototype was adopted. Some parts cannot be controlled by direct puppeteering; they are eyebrows and earsand they controlled by triggering actions (animations). Even sound effects can be triggered. For example, if a user wants to point at something and also to wiggle his/her robot eyebrows, by recognizing the pointing gesture, it is possible to trigger a wiggling animation to a robot at the same time. A robot may even play sounds relating to curiosity.
However, HMMs are slow and there is latency in detecting or recognizing intended gestures. A user willnot notice much latency for the above case if a user holds his arm long enough that he/she can wait for the action to be triggered. However, there may be other situations where a user wants to initiate actions immediately. In particular, when a user wants to express subtle social gestures, sometimes movements performed by direct puppeteering interface are not sufficient enough to convey all the subtleties. Recognizing or making decisions on early motion movements can solve this. If we can recognize lifting gestures of both hands, it is possible to trigger the both arms lifting animation. If the intended gesture turns out to be different, the animation can be blended into direct puppeteering joint angle positions or other animations. Current behavior and character animation system used the linear blending algorithm to blend between animations. However, it does not guarantee a smooth transition from one animation to the other.Another algorithm should be considered. Early motion signs can be detected by setting certain rules such as both hands location changes over plus 30 pixels mean sudden lifting motion. Combining such rules and including the results from long-term gesture classifier will require the adaption of Bayesian inference systems for the decision making process of animation triggering. The next diagram shows how this Bayesian decision-making system can intervene in the process of puppeteering. Overall Structure for the Puppeteering System Baseball Cap with Reflective Markers
Infra-red Webcam
Wii Remote Controllers with Reflective Markers
Pre-process for Accelerometer Outputs
Blob Extraction & Tracking (OpenCV)
Gesture Recognition (HMM) Motion Detection (Conditions)
Extraction of Angles for the Neck and Shoulder Joints.
Overall Decision Making System for Driving Actual Motor Joints
Figure 6: Revised Gesture Recognition Setup
EVALUATION The puppeteeringinterface will be applied to two different environments: the Huggable and the Second Life. To evaluate two implemented systems and compare two environments, control groups must be setup. As for control groups, we can setup two cases where users are not situated in a given environmental setting, but in an every day face-to-face conversational setting users are situated in an environment with existing technologies such as a button based puppeteering system. In a case of face-to-face conversation, their engagement in interaction will be much richer than mediated cases where we use the interface. However, it will be useful to compare them in determining the short falls. Comparison with existing interfaces will show improved user experience via the new puppeteering system. We can summarize the design goals for the system and examples of questionnaires as below. Design Goals 1. Maximize communicative competence. 2. Increase the audience’s degree of comprehension.
3. Enrich physical interaction and engagement of interaction. Questions to Ask 1. Whether or not the interface gives users different experiences Compare it with situations where the interface is not given. Select face-to-face conversation group as a control group. Make a control group using existing interfaces such as button-based animation triggering system. 2. To the presenter: a. Were you able to successfully convey what you have intended through the interface? b. If not, what part did you have the most difficulty with? Arms, heads, faces? c. If not, what types of gesture did you have the most difficulty in expressing? d. If not, what context was the most difficult in explaning? 3. To audiences: a. Which part was least readable to you? Arms (body gestures), neck (eye gazing), other parts? b. How much were you able to understand of the presenter’s intention? (Audiences’ feeling) c. Questions on the context of the presentation or the story told. (Direct Questions) 4. To the both: a. Functionality: i. Latency: 1. Was the latency annoying when operating the interface? 2. How much of the latency can you tolerate? ii. Evaluation on different types of gestures. 1. How readable were different types of gestures? 2. How easy was it to express different types of gestures? b. Comparisons: Let one user experience both systems and fill out the questionnaires on comparing two systems. 1. Differences between physical interaction and virtual interaction 2. Differences in implemented systems themselves. In the case of the Second Life, users lose their expressive power less than in the case of the Huggable. Is this true? Or, the differences may not be very significant. Situational Differences in Applying the System To evaluate communicative competence of the system, three possible situations can be considered. Sincethe interface was built to aid every day conversation in two environments, it is essential to test the functionality of the system in the same situation. On the other hand, as a possible application for the interface, both a storyteller’s and a presenter’s case were considered. In both cases, a storyteller and a presenter are given a topic and utilize the puppeteering interface to deliver the content.
Figure 7: Huggable tells a story of a book to a child (on the left). A person is presenting at a virtual conference in the Second Life (on the right) (Bretag, 2007). Table 2: Evaluation setup for presentation cases (story-telling vs. presentation). Test Platforms Target Content (Presenter’s Topic) Setup
The Huggable, The Second Life, Face-to-Face Conversation (No medium involved, Control) Children Adults Adults Adolescents Story teller is an adult Story Telling From a Big Presentation on a general Book topic
The Second Life, Face-toFace Conversation Adults Adolescents
Story Teller (Presenter) vs. Audience
One-to-one
Every day conversation inside the Virtual Life
Expressive Power and the Different Types of Gestures Specific design decisions made in developing the interface may influence expressive power of the system and they may vary amongdifferent types of gestures, which a user intends to convey. Paul Ekman and Erik Friesen divided gestures into five categories: emblems, illustrators, affect displays, regulators, and adaptors(Ekman & Friesen, 1969). Emblems are direct replacements for the words, and illustrators shape what is being said. Affect displays emotion, and regulators are used to control a flow of conversation. Lastly, adaptors are used to relieve self-oriented tension. In presentation, we can also divide gestures into four categories: emphatic, descriptive, locative, and transitional gestures(Wyatt & Bannerman, 2007). Emphatic gesture is used to emphasize a point in a conversation. Descriptive gestures are used to describe a specific object. Locative gestures are used to locate an object in a space. Finally, Transitional gestures are used to control the flow of conversation. They may include enumeration gestures. Above criteria for categorizing gestures can be used in presentation scenarios or short story book scenarios. For consistency in evaluation, a presenter or a storyteller can be given a script containing specific directions for presentation and short story; it can contain a specific timing to evoke a certain gesture. In evaluation, audiences may answer questions asking which gestures were used during the presentation, how many times it was used and how readable were the gestures. Presenters may also answer questionnaires such as how difficult it was to express different types of gestures using the interface. Various Types of Games that a User can play via the Interface
While users can tell a story or present their work in the virtual environment or the physical environment, users can also play games in such environments maximizing their physical interaction with others, which users were not able to do so applying existing interfaces or platforms (robots or virtual environments). In evaluation, it will be important to measure such physical interaction in a quantitative way. Qualitative evaluation can be added as a secondary measure. 1. Peek-a-boo: focusing on a target of younger children or infants.
Figure 8: Peek-a-boo Arm Gesture of the Huggable
2. Aerobics: Audiences following the Huggable’s rhythmic movements made by the presenter via the interface.
Figure 9: Various arm postures of the Huggable
3. Twenty Questions (guessing game): A presenter explains a topic given through a word to audiences through the interface and audiences has to guess the word. 4. Clapping Game: Users from a circle. One user initiates a clap with eye contacting the other user on one of his/her side. Clapping goes around in a circle.
Figure 10: Clapping Gestures of the Huggable
Table 3: Evaluation setup for game plays via the interface. Test Platforms Target Content (Presenter’s Topic) Setup
The Huggable, The Second Life, Face-to-Face Conversion (No medium involved) Children group Adults group Peek-a-boo Aerobics Aerobics Twenty Questions Twenty Questions Clapping Game Clapping Game One-to-One Many-to-Many
TIMELINE Table 4: Timeline for the entire development and evaluation period. Nov Dec Jan Feb
Mar
Apr
May
rd
3 Proto Type 1st Design Decision Hardware Development Data Collection Final Development First User Study Final User Study Thesis Writing
DELIVERABLES As a result of this thesis, both hardware and software platform of an avatar puppeteering system for both physical and virtual avatars will be delivered. The hardware system may include a web-cam like portable camera and few wearable inertial measurement unit (IMU) sensors for users’ hands and head. Wearables may be equipped with reflective markers. The software system will show avatar’s view (it can be either in first person view point or third person view point based on environments) and avatar’s status. The software interface will help user to better understand a given situation and control an avatar in an effective way.
RESOURCE REQUIRED The study will require 8 DOFs fully operable robot, the Huggable, with its Camera feature enabled. The study will not require the functionality of skin sensors for its evaluation purpose. The first proto-type platform of the Huggable with such features will be built via a different project until the end of December 2007. For the virtual life platform, the existing online virtual life framework, the Second Life will be used. To fully implement necessary features to the Second Life, the collaboration with the Linden Lab will be required. The COUHES committee will need to approve the to use humans as experimental subjects. Participants are needed for acquiring training data and evaluation of the application. Twenty subjects are needed for each session. The number of sessions required for the study will be determined later.
REFERENCES Benbasat, A. Y., & Paradiso, J. A. (2002). An Inertial Measurement Framework for Gesture Recognition and Applications. In I. Wachsmuth, & T. Sowa (Ed.), Gesture and Sign Language in Human-Computer Interaction, International Gesture Workshop, GW 2001 (pp. 9-20). Berlin: Springer-Verlag. Blumberg, B. (1999). (void*): A Cast of Characters. In Visual Proceedings of SIGGRAPH 1999 (p. 169). Los Angeles: ACM. Bobick, A., & Ivanov, Y. (1998). Action Recognition using Probabilistic Parsing. International Conference on Computer Vision and Pattern Recognition. IEEE. Bretag, R. (2007). K-12 Panel Keynote On Point. Retrieved November 2007, from Second Life Best Practices International Conference 2007: http://slic2007.blogspot.com/2007/05/k-12-panelkeynote-on-point.html Cassell, J. (2000). Nudge Nudge Wink: Elements to Face-to-Face Conversation for Embodied Conversational Agents. In J. Casell, C. Justine, J. Sullivan, S. Prevost, & E. Churchill (Eds.), Embodied Conversational Agents (pp. 1-28). Cambrdige, MA, USA: MIT Press. DeMenthon, D. F., & Davis, L. S. (1995). Model Based Object Pose in 25 Lines of Code. International Journal of Computer Vision, 15, 123-141. Dietterich, T. (2002). Machine Learning for Sequential Data. In T. Caelli (Ed.), Lecture Notes in Computer Science (LNCS). Ekman, P., & Friesen, W. V. (1969). The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding. Semiotica 1 , 49-98. Goza, S. M., Ambrose, R. O., Diftler, M. A., & Spain, I. M. (2004). Telepresence control of the NASA/DARPA robonaut on a mobility platform. CHI '04: Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 623-629). New York, NY, USA: ACM. Matsui, D., Minato, T., MacDorman, K. F., & Ishiguro, H. (2005). Generating Natural Motion in an Android by Mapping Human Motion. Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, (pp. 1089-1096). Murphy, K. (2005). Hidden Markov Model (HMM) Toolbox for Matlab. Retrieved 2007, from http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html Puppeteering. (2007). Retrieved 2007, from Second Life Wiki: http://wiki.secondlife.com/wiki/Puppeteering Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE.77, pp. 257-286. IEEE. Sakamoto, D., Kanda, T., Ono, T., Ishiguro, H., & Hagita, N. (2007). Android as a telecommunication medium with a human-like presence. HRI '07: Proceeding of the ACM/IEEE international conference on Human-robot interaction (pp. 193-200). Arlington, Virginia, USA: ACM. Stiehl, W. D., Breazeal, C., Han, K., Lieberman, J., Lalla, L., Maymin, A., et al. (2006). The Huggable: a New Type of Therapeutic Robotic Companion. ACM SIGGRAPH (p. 14). New York, NY: ACM Press. Tartaro, A., & Cassell, J. (2006). Authorable Virtual Peers for Autism Spectrum Disorders. Proceedings of the Workshop on Language-Enabled Educational Technology at the 17th European Conference on Artificial Intelligence (ECAI06). Wren, C. R., & Pentland, A. P. (1999). Understanding Purposeful Human Motion. In Proceedings of Modelling People (MPEOPLE).13, pp. 19-25. IEEE. Wyatt, A., & Bannerman, L. (2007). Gestures in the Presentation. Retrieved November 2007, from Pathway to Tomorrow: http://www.longview.k12.wa.us/mmhs/wyatt/pathway/gest.html Yankelovich, N. (2007). Collaborative Environments. Retrieved 11 9, 2007, from Sun Microsystems: http://research.sun.com/projects/dashboard.php?id=85
Yin, P., Essa, I., & Rehg, J. M. (2004). Asymmetrically Boosted HMM for Speech Reading. International Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 755-761). Los Alamitos, LA, USA: IEEE Computer Society.