communication, with many local dialects and idioms. In the UK, British Sign ... the virtual human character to provide a 3D model of the manual and non-manual.
A Framework for Non-Manual Gestures in a Synthetic Signing System R. Elliott1, J.R.W. Glauert1, and J.R. Kennaway1
1 Background 1.1 Sign Language According to the RNID, 1 in 7 people in the UK have severe hearing loss. Among these, a much smaller number, around 1 in 1000, have been deaf from birth or have never understood speech. For these pre-lingually deaf people, their natural language is often sign. Sign language is a complete natural language, used for culturally-rich communication, with many local dialects and idioms. In the UK, British Sign Language (BDA, 1991) is officially recognised as a minority language. BSL has its own grammar and phonetic structure which is independent of English. Sign language users often have a relatively low reading age for written English, having never learned spoken language. BSL is their language of choice for social interaction. In addition to authentic sign languages, such as BSL, there are also classes of sign language that are closer to spoken forms. In the UK, Sign Supported English (SSE) uses BSL signs in simplified form, presented in English word order. Finger spelling uses BSL signs for the alphabet. SSE is often used in education, but is not regarded as an acceptable alternative to BSL.
1.2 Avatars and Signing Services are increasingly accessed from electronic sources, whether through broadcast television or over the Internet. If services for deaf people are to be provided in sign via digital media, an obvious option is to provide video. The EU 1
School of Computing Sciences, University of East Anglia, Norwich, U.K
128
Elliott, Glauert and Kennaway
Framework 5 project WISDOM (Kyle, 2002) has explored the provision of services, based in particular around the use of a mobile terminal based on 3G phone technology. Video is effective where content is fixed, but problems arise when signing sequences need to be edited into new forms and concatenated together. High quality blending of video is impractical unless sequences are created using the same signers under exactly the same conditions. An alternative is to use avatars or virtual humans. A stream of animation parameters is used to control the motion of the virtual human character to provide a 3D model of the manual and non-manual (principally facial) aspects of signing. This approach was followed by the EU project ViSiCAST (Elliott et al., 2000) which explored virtual human signing for applications in broadcast, Internet and face-to-face transactions (Cox et al., 2002, Glauert, 2002). Virtual human signing does not yet have the full quality and realism of video, but it has benefits such as potentially low-bandwidth communication, user-choice of avatar appearance, lighting, and position in 3D, and the ability to blend together signing sequences generated from different sources and at different times. It can be applied when content is ephemeral or generated interactively, when video could not be provided. As an example, a daily weather forecast in SLN (Sign Language of the Netherlands) is available online to those with the necessary client software. A current EU project, eSIGN, aims to provide eGovernment content in sign in Germany, Britain, and the Netherlands. A number of techniques can be used to generate motion data for virtual humans: • • •
motion capture techniques can be used to record the performance of a human signer; editing tools can be used to place the avatar in specific positions to provide key frames. A full animation is produced by interpolation between the key frames; signing gestures can be represented using a notation system from which motion data can be synthesised.
The work reported here follows the synthetic animation route. The ViSiCAST project also explored motion capture. The film and entertainment industries generally employ key frame techniques. In order to control the animation of an avatar, the position and rotation of bones in the virtual skeleton of the avatar are specified for each frame of motion. The skin, hair, and clothing of the avatar are animated by projecting textures onto a fine mesh of polygons that enclose the skeleton. Each mesh point is attached to one or more of the bones so that the mesh is deformed realistically as the avatar skeleton is moved. The H-Anim standard that includes specifications for skeleton representation is being developed by the Humanoid Animation Working Group. MPEG-4 Body Animation provides streaming standards for body animation (Preda and Prêteux, 2002). On the one hand, our requirements are simpler, since we are not concerned with animation of the lower half of the body, while on the other hand they are more complex, as the skeleton for the hand in H-Anim does not have quite enough of the flexibility that we require for manual gestures.
A Framework for Non-Manual Gestures in a Synthetic Signing System
129
The use of bones is effective for the body and arms of an avatar, but facial animation often uses different techniques. It is possible to animate lips and eyebrows using fictitious “bones”, but complex gestures are hard to reproduce in this way. A successful alternative approach is to use morph targets, or briefly morphs, to represent the change in the position of the neutral facial mesh needed to represent some facial gesture. By applying and removing the full change in stages, over several frames, the avatar performs the gesture in a natural fashion. An exaggerated gesture can be achieved by applying more than the normal change for the target gesture. MPEG-4 provides a framework for face animation parameters that enable facial animation similar to morphs. A promising alternative approach being explored at UEA is to take a statistical approach to identifying a comprehensive set of morphs that can be combined to produce the full range of observed facial expressions. This work has focussed on lip movements for speech, providing realistic animation not only of individual phonemes, but providing realistic co-articulation for complete words and sentences (Theobald et al., 2003).
1.3 SiGML Overview SiGML (Elliott et al., 2001) – Signing Gesture Markup Language – is an XML application language defined in the ViSiCAST project for the purpose of specifying signing sequences. The major component of SiGML allows sign language gestures to be defined at the phonetic level, based on the model of signing phonetics embodied in HamNoSys (Prillwitz et al., 1989), the longestablished notation system for sign language transcription developed by our partners in the ViSiCAST project at the University of Hamburg. This component of SiGML – “gestural SiGML” – is based on HamNoSys version 4 (Hanke et al., 2000), defined in the early stages of ViSiCAST. The definition of manual signing was well-developed in HamNoSys 3, and so in the area of manual signing HamNoSys 4 makes only minor modifications to its predecessor. However, HamNoSys 3 treated non-manual signing only in a very constrained fashion. The most prominent development in HamNoSys 4 was the introduction of a much more systematic treatment of non-manual signing. SiGML has been developed from the outset with a view to supporting synthetic animation, and for this reason manual SiGML provides a slightly more regular, mechanistic, view of manual sign articulation than does HamNoSys, while supporting the same semantic model. Non-manual HamNoSys on the other hand has a much simpler structure than manual HamNoSys, and so at this stage non-manual SiGML is essentially a presentation as XML of non-manual HamNoSys. We describe nonmanual SiGML in Section 2.
130
Elliott, Glauert and Kennaway
1.4 SiGMLSigning Software Architecture Architecturally, SiGMLSigning, the synthetic signing system developed for the eSIGN project can naturally be viewed as a pipeline of processing stages: each stage receives a stream of signing data from the previous stage, transforms it in some way, and forwards the transformed data stream to the following stage. The pipeline is shown in Figure 1.
Figure 1. Processing Pipeline for Synthetic Signing Animation
The input to the rendering software is an animation stream, that is, a sequence of animation frame definitions. Each frame definition contains data describing a static pose of the avatar, together with a time-stamp specifying when the avatar should be placed in that pose. By placing (and displaying) the avatar in the specified sequence of poses at the specified times the rendering software produces the required signing animation. In the present context, the central component in the processing pipeline of Figure 1 is the most important: this is the animation engine, called Animgen, whose function is to transform a sequence of SiGML signs into the corresponding sequence of animation frame definitions. As explained above, each frame definition describes a static pose of the signing avatar. The definition of this pose resolves into two components: •
•
the configuration of the bones in the avatar’s virtual skeleton; this configuration determines the configuration of the avatar’s surface mesh, which is notionally bound to this virtual skeleton, thereby determining the configuration of that mesh; the configuration of the avatar’s face: this can be specified as a weighted sum of facial morph targets, as described later in Section 3.
The first of these is defined as a large array of numerical data, defining the position and orientation of each of the avatar’s bones. As explained in Section 3, the second component is represented simply as an array of morph weights. In fact, these two components are not mutually exclusive: as mentioned above, some avatar definitions endow the avatar with (anatomically inauthentic) “facial bones” which are capable of moving parts of the avatar’s face, such as the eyebrows for example.
A Framework for Non-Manual Gestures in a Synthetic Signing System
131
2 Non-Manuals in SiGML As indicated in the previous section, the definition of non-manual signing features in SiGML is based on the corresponding definitions for HamNoSys 4. Thus nonmanual features are assigned to a hierarchy of independent tiers, corresponding to distinct articulators, according to the following structure: • • • • • •
shoulder movements; body movements; head movements; eye gaze; facial expression: Eye-Brows, Eye-Lids, and Nose; mouthing: Mouth Gestures and Mouth Pictures.
It should be understood that here “facial expression” covers only those expressive uses of the face which form part of the linguistic performance, that is, those which play a phonetic rôle in the given context; this excludes those facial expressions whose rôle is to convey the signer’s attitude or emotions to what is being articulated linguistically. The distinction between mouth gestures and mouth pictures in the taxonomy above, also deserves some explanation. A mouth picture is a motion of the mouth which is assumed to be derived from spoken language, in which the lips and tongue feature as visible articulators. A mouth gesture, on the other hand, is not related to spoken language, and involve use of the cheeks and teeth as visible articulators, and may possibly involve movement of the jaw from side to side. Mouth gestures may have quite complicated internal structure, which HamNoSys 4 and SiGML make no attempt to describe: each mouth gesture is in effect an unanalysed whole, the permissible repertoire of such gestures being determined by the individual sign languages under consideration. Mouth pictures and mouth gestures are distinguished also in their timing characteristics: a mouth picture has its own inherent duration, whereas the duration of a mouth gesture may be stretched to match that of the sign of which it is a constituent. In defining individual non-manual movements a distinction is made between static movements and dynamic movements. In the former, it is the static pose resulting from the motion, rather than any transition between poses, which conveys meaning. Typically, such a pose is held for longer than a static pose within the manual component of a sign. In a dynamic movement, on the other hand, it is the transition between poses (which typically is quite rapid), or a sequence of such transitions which conveys the meaning. A facial expression is usually represented statically, or as a sequence of states. For other non-manual tiers both kinds of nonmanual features occur. Synchronization of activity on the different tiers is specified at the level of the sign: in SiGML, as in HamNoSys 4, a sign may include activities on each of the nonmanual tiers identified above, in addition to activity on the manual tier. Normally the non-manual activity thus specified is co-temporal with the manual activity of the sign, but it is possible to attach hysteresis markers to activity on any nonmanual tier, indicating that the start (or end) of that activity should slightly precede
132
Elliott, Glauert and Kennaway
(or follow) that of the activity defined on the manual tier. These are not currently supported in the synthetic animation system described. The notation allows a sequence of non-manual actions to be specified on a given tier within a single sign, and where it makes physiological sense, an individual action in this sequence may consist of two actions in parallel; but in practice, these forms of sequential and parallel occur only very infrequently. We summarise below the elementary movements permitted on each non-manual tier: Shoulders: For either or both shoulders: shrugging and placing in a raised or hunched forward position. Body: Tilting, rotating, abnormal straightening or bending of the back, heaving of the chest, sighing. Head: Turning, tilting, nodding, shaking, movements aligned to eye-gaze. Eye Gaze: Rolling, directing up, down, left, right, or distantly, directing at signer’s own hand(s), directing to addressee. Face/Brows: Raise either or both, furrow both. Face/Eyelids: Making either or both eyes wide open, narrowed, closed, or shut tight, and also a dynamic blink at the end of a sign. Face/Nose: Putting the nose in a Wrinkled posture or one with the nostrils widened; also a dynamic twitch. Mouth Pictures: A mouth picture may consist of the visemes corresponding to an arbitrary phonetic (IPA) string. For convenience, this viseme string is expressed using the SAMPA (Wells, 2003) conventions for transcription of the IPA. Mouth Gestures: As explained above, the repertoire here is a potentially openended list of specialized and quite complex actions. For the currently defined repertoire, a set of video clips is available for reference. A correspondingly open-ended labelling scheme is used in which each action is identified by a tag in which an identifying numeral follows a single letter which indicates the most prominent articulator (cheek, jaw, lips, teeth, tongue).
3 Implementation of SiGML Non-Manuals An overview of the way in which the animation engine, Animgen, generates animations for manual SiGML can be found in (Kennaway, 2001). We describe here how non-manual SiGML is implemented. Animgen assumes that each avatar comes with a set of facial deformations, which are named morphs, which can be applied, singly or in combination, to varying degrees, in each frame of the animation. What a morph actually is, in concrete terms, is not of any concern to Animgen, although typically it is a set of displacements of a subset of the vertices of the avatar’s polygon mesh. In each frame of the animation data that Animgen generates, it specifies how much of each morph is to be applied, and the avatar renderer computes and displays the overall effect.
A Framework for Non-Manual Gestures in a Synthetic Signing System
133
To calculate which morphs are to be applied, each facial non-manual of SiGML is encoded as a combination of morph trajectories. A trajectory consists of a morph, the maximum amount of that morph which is to be applied, a time during which the amount of that morph is to be increased from zero to that maximum, a time to hold it at that level, and a time to decrease it down to zero. Morph trajectories can be combined in series and in parallel to build up an arbitrarily complex definition of a facial non-manual. The mapping of each facial non-manual to a combination of morph trajectories is provided by a configuration file specific to each avatar, and read by Animgen. The creator of the avatar creates the avatar’s morph set and the mapping of SiGML facial elements. The following is an example of such a mapping definition:
This is a specification for the L09 mouth gesture, which is defined as a (rather tense and tight-lipped) mouthing of “bem”. It is mapped to a sequence of three morphs corresponding to the three phonemes, of which the first and third have been given identical visual representations. The names of the morphs for this avatar are numerical indexes, but in principle the names are arbitrary. Timing is specified by a sequence of five values, representing respectively the attack time, attack manner, sustain time, release time, and release manner. The times can be specified numerically in seconds, or the most common values (slow, medium, fast, zero, or sustain to end of sign) can be given symbolically (s, f, m, –, or e). The symbolic tokens are mapped to times in seconds by another of Animgen’s configuration files. The “manner” components determine how the morph approaches its full value during the attack, and how it tails off during the release. The possible values for this are “t” (tense) and “l” (lax), which determine how sharp the accelerations and decelerations are. They are mapped by another configuration file to sets of parameters for a general model of such accelerations and decelerations. The other non-manual gestural components – trunk (body), shoulder, and head movements – are animated by manipulating the bones of the avatar’s skeleton. Some of these are static postures (such as UL: left shoulder raised), while others are dynamic (such as SL: left shoulder shrugging up and down). The static postures are defined as certain amounts of rotation of the relevant bones, with the transitions into and out of the posture generated automatically. The dynamic ones are mapped to combinations of static postures and timing information in a way similar to the mapping of the facial gestures to trajectories of avatar morphs.
4 Examples We show an example of the framework described above in use, via the following sequence of figures. Figure 2 shows the SiGML for the BSL sequence “MUG-
134
Elliott, Glauert and Kennaway
TAKE-I”, with SAMPA encodings for accompanying mouth pictures embedded in the appropriate tier of each sign definition, in accordance with the scheme described in Section 2 above. .... .... .... Figure 2. SiGML for Sequence MUG-TAKE-I with Mouth Pictures
Figure 3 shows a sample of three frames from the animation (61 frames at 25fps) generated from the SiGML sequence shown in Figure 2.
Figure 3. VGuido: Frames 12, 42 and 52 for MUG-TAKE-I with Mouth Pictures
Figure 4 shows the same set of frames as in Figure 3, but from a viewpoint allowing the mouth to be seen closer-up.
A Framework for Non-Manual Gestures in a Synthetic Signing System
135
Figure 4. VGuido: Frames 12, 42 and 52, Close-Up on Mouth
Figure 5 shows a sequence of consecutive frames: viewing a sequence like this using the animation player’s “single-step” mode allows the developer to study the generated mouthing in detail. All these figures are scaled down significantly from their normal viewing size.
Figure 5. VGuido: Consecutive Frame Sequence (Frames 50-52)
5 Conclusions A great advantage of the framework for non-manual signing described here is that it establishes a clean separation between the “mechanism” and the “policy” for the implementation of each non-manual SiGML primitive. The interface between the two aspects is the avatar-specific configuration file, in which each entry specifies everything the animation engine needs to know in order to animate a given nonmanual primitive using the morph set available for that avatar. The avatar developer is thus free to change the implementation of a non-manual primitive simply by editing its entry in the configuration file, as described in the previous section. Moreover, since morphs are known to the animation engine only by name, the range of feasible changes includes the exploitation of morphs newly introduced into the set available for the given avatar. In practice, this is important for the development of good-quality implementations of SiGML’s non-manual primitives. This is because many of these primitives – especially the mouth gestures, as explained in Section 2 – consist of highly dynamic and elaborate behaviours, albeit mostly of short duration. For these dynamic behaviours, it is by no means easy for an avatar supplier to determine in advance what set of static poses (i.e. morphs) will be sufficient to support a realistic implementation. For efficiency reasons it is desirable for this set to be kept as small as possible. The scheme described here
136
Elliott, Glauert and Kennaway
allows the avatar developer to adjust the morph set and the associated mapping definitions without intervention from the implementor of the animation engine, thereby facilitating significantly the development of an implementation of the nonmanual primitives that is both effective and efficient.
5 Acknowledgements We acknowledge with thanks financial support from the European Union, and assistance from our partners in the eSIGN project, in particular Televirtual Ltd. providers of the Virtual Guido avatar.
6 References British Deaf Association (1991) Dictionary of British Sign Language. Faber and Faber. Cox SJ, Lincoln M, Tryggvason J, Nakisa M, Wells M, Tutt M, Abbott S (2002) TESSA, a system to aid communication with deaf people. ASSETS 2002, Fifth International ACM SIGCAPH Conference on Assistive Technologies, pp 205-212. Elliott R, Glauert JRW, Kennaway JR, Marshall I (2000) Development of language processing support for the ViSiCAST project. ASSETS 2000, 4th International ACM SIGCAPH Conference on Assistive Technologies. Elliott R, Glauert JRW, Kennaway JR, Parsons, KJ (2001) D5-2: SiGML Definition. ViSiCAST Project working document. Glauert JRW (2002) ViSiCAST: Sign Language using Virtual Humans. ICAT 2002, International Conference on Assistive Technology, pp 21-22. H-Anim Working Group (2001) H-Anim 2001 specification. http://www.h-anim.org/ Hanke T, Langer G, Metzger C, Schmaling C (2000) D5-1: Interface Definitions. ViSiCAST Project working document. Kennaway JR (2001) Synthetic animation of deaf signing gestures. 4th International Workshop on Gesture and Sign Language in Human-Computer Interaction. SpringerVerlag, LNAI vol 2298, pp 146-157. Kyle J (2002) WISDOM: Wireless In formation Services for Deaf People on the Move. ICAT, International Conference on Assistive Technology ICAT. Prillwitz S, Leven R, Zienert H, Hanke T, Henning J, et al. (1989) Hamburg notation system for sign languages: An introductory guide. International Studies on Sign Language and the Communication of the Deaf, vol 5. Institute of German Sign Language and Communication of the Deaf, University of Hamburg. Preda M, Prêteux F (2002) Critic Review on MPEG-4 Face and Body Animation. 2002 IEEE International Conference on Image Processing. IEEE. Theobald B-J, Bangham A, Matthews I, Glauert JRW, Cawley CC (2003) 2.5D Visual speech synthesis using apprearance models. British Machine Vision Conference. British Machine Vision Association. Wells J (2003) SAMPA computer readable alphabet: http://www.phon.ucl.ac.uk/home/sampa/home.htm