Extraction and Coding of Face Model Parameters - Semantic Scholar

Linköping Studies in Science and Technology Thesis No. 747

Extraction and Coding of Face Model Parameters Jörgen Ahlberg

1 0 1 1 0 1 ...

Department of Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden

Linköping 1999

“Ready comprehension is often a knee-jerk response and the most dangereous form of understanding. It blinks an opaque screen over your ability to learn. Be warned. Understand nothing. All comprehension is temporary.” [Her85]

Linköping Studies in Science and Technology Thesis No. 747

Extraction and Coding of Face Model Parameters Jörgen Ahlberg

Licenciate thesis, Report No. LiU-TEK-Lic-1999:05 Image Coding Group Department of Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden

Linköping 1999

Extraction and Coding of Face Model Parameters

© 1999 Jörgen Ahlberg ISBN 91-7219-425-1 ISSN 0280-7971 LiU-TEK-Lic-1999:05

Printed in Sweden by UniTryck, Linköping, 1999.

ABSTRACT This thesis treats methods for model-based coding of image sequences containing human faces, typically videophone sequences. In a modelbased coding scheme, a face model is adapted to the contents of the image sequence, and the parameters needed to adapt it are transmitted over the channel. During the sequence, animation parameters telling how to change the model are transmitted, instead of new images. Such parameters can be transmitted at very low bit-rates, thus enabling video telephony over narrow channels, like GSM or a public switched telephone network. Several problems need to be solved before model-based video communication will become practical. This thesis treats some of them; how to detect and locate a face in an image, how to extract parameters from an image to adapt a face model, and how to compress the animation parameters. Many methods for face detection can be found in the literature. Some of them have been reimplemented and evaluated in this work, and a combined scheme using colour discriminants, statistical template matching and a directed, stochastic search heuristic is proposed. Methods for extracting facial features have been studied, and a new framework, combining the strengths of statistical template matching and deformable models has been implemented. The algorithm works satisfactory, using only a few seconds on a desktop computer to robustly estimate the locations of the facial features. Finally, a new compression scheme for facial animation parameters is described, showing that such parameters can be transmitted with reasonable quality at bit-rates lower than one kbit/s.


v

vi

Linköping Studies in Science and Technology. Thesis No. 747

ACKNOWLEDGEMENTS First of all, my thanks go to my supervisors, Dr. Robert Forchheimer, for the inspiration and support, and Dr. Haibo Li, for all ideas and suggestions. I also thank my colleague, Jacob Ström, for fruitful discussions, as well as the other members of the Image Coding Group for friendship and support in various ways. I would also like to thank Dr. Franck Davoine at Université de Technologie de Compiègne, for introducing me in the VIDAS project, and later for inviting me to Compiègne. And naturally my wife Hanna... The work described in this thesis has been funded by the national Swedish SSF project VISIT as well as by the European ACTS project VIDAS. The thesis was finished and published during the author’s stay as a visiting student at Laboratoire Heudiasyc, Université de Technologie de Compiègne, France, made possible by grants from the Swedish Institute and the Telephone Company L M Ericsson’s Foundation for Electrotechnical Research, both thankfully acknowledged by the author.


vii

viii


PUBLISHED MATERIAL Most of the material in this thesis is also included in Deliverable 3.5, Facial Feature Tracking Reference Model, of the European project VIDAS (ACTS contract AC057). The material in Chapter 3, “Facial Feature Extraction” has, to some extent, been presented at the Swedish Symposium on Automated Image Analysis (SSAB’99), March 1999. The material in Chapter 4, “Face Model Parameter Coding” is to appear as J. Ahlberg and H. Li: “Representing and Compressing MPEG-4 Facial Animation Parameters using Facial Action Basis Functions” in IEEE Trans. on Circuits and Systems for Video Technology in the first quarter of 1999. The material, in a somewhat earlier form, is also available in a report from Linköpings universitet (Report No. LITH-ISY-R-2010, February 1998). The same material (during development as a Core Experiment for MPEG4) has also been presented at the MPEG-meetings in Fribourg (October 1997), San José (January 1998), and Tokyo (March 1998).


ix

x


TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . vii

PUBLISHED MATERIAL . . . . . . . . . . . . . . . . . . . . . . . . ix

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapter

Chapter

1.

2.

INTRODUCTION

1

1.1

A Visual Communication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2

Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3

Model-Based Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4

Terminology for Face Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5

MPEG-4 Face Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6

Contributions of this Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

FACE DETECTION

13

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Fast Algorithms for Face Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3

Deterministic Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4

Heuristic Detection of Transformed Objects . . . . . . . . . . . . . . . . . . . . . . 18

2.5

Statistical Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


xi

TABLE OF CONTENTS

Chapter

3.

FACIAL FEATURE EXTRACTION 3.1

Chapter

Chapter

4.

5.

29

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2

Iris Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3

Deformable Line Templates and Dynamic Programming . . . . . . . . . . . 34

3.4

Combining Deformable Templates and Statistical Pattern Matching. . . 39

3.5

Deformable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6

Adapting a Wireframe Model to Extracted Facial Features . . . . . . . . . . 51

3.7

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

FACE MODEL PARAMETER CODING

55

4.1

Face Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2

MPEG-4 Coding of Local Motion Parameters . . . . . . . . . . . . . . . . . . . . . 56

4.3

Facial Mimic, Action Units and Basis Functions . . . . . . . . . . . . . . . . . . 57

4.4

The Encoding – Decoding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6

Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

CONCLUSIONS

65

5.1

Conclusions Regarding Investigated Tools. . . . . . . . . . . . . . . . . . . . . . . 65

5.2

The Construction of a Model-Based Coding System . . . . . . . . . . . . . . . 66

5.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xii


Chapter

1.

1

INTRODUCTION

This chapter gives an overview of the thesis and describes the concepts of visual communication systems and model-based coding. A terminology for different kinds of facial image processing is then defined. A brief description of MPEG-4 Face Animation, standardizing a bitstream format for modelbased coding, will be given, since it is highly related to this thesis.

1.1 A VISUAL COMMUNICATION SYSTEM A general communication system can be modelled as in Figure 1.1, where a sender sends a message to a receiver via a channel. The channel can, for example, be an ordinary telephone line, a radio channel, or a storage device like a magnetic disc or a piece of paper. If the input and output of the communication system are images, we call it a visual communication system. Today, the most widely used visual communication systems are the analog TV-broadcasting system and the printing of images on paper. Sender

Message

Channel

Receiver

Figure 1.1: A communication system.

If the capacity of the channel is limited, the need to compress the messages may arise. Typically, we want to store a lot of images on our hard disks, or transmit them via a network. According to Shannon’s source coding theorem [Sha48], it is possible to represent (code) a message with R bits and recover it without distortion if and only if the entropy H of the message is smaller than or equal to R. If we code the message with fewer than H bits, we will always get some distortion. However, Shannon’s theorem tells us nothing about how this should be done, only that there exists an absolute limit of the amount of compression without introducing distortion. Image coding algorithms can thus be divided in two different categories; lossy and lossless. The lossless algorithms, e.g., GIF and TIFF (see bottom of the next page), allows


1

Chapter

1.

INTRODUCTION

perfect reconstruction of the images, while the lossy ones, e.g., TIFF v.6, JPEG and MPEG, achieve higher compression at the cost of some distortion. The lossy algorithms concentrate on high subjective quality, i.e., the distortion should be has hard as possible for the human visual system to discern. Image coding has been an area of research with a lot of activity for about three decades. Recently it entered the lives of an ever increasing number of people through the spreading of the Internet and the World Wide Web. Probably, anyone who has been surfing on the Web has also wished for faster transmission of images. Here, the focus is on methods for model-based image coding. Model-based coding is usually aimed at lossy coding for very low capacity (bit-rate) channels, like GSM, PSTN or ISDN (see below). The target is to transmit moving facial images, i.e., image sequences containing human faces, at very low bit-rates in, for instance, mobile videophone applications. To achieve this, a computer model of the human face is constructed and adapted to the current image sequence. Parameters describing how to transform

Some Low Bit-rate Communication Channels • GSM is the European standard for mobile communications developed by Groupe Spéciale Mobile. It currently allows bit-rates of 8 kbit/s. • PSTN means Public Switched Telephone Network, i.e., the usual (analog) fixed telephone lines. Depending on the modem used and the quality of the line, bit-rates up to almost 60 kbit/s are allowed. • ISDN, Integrated Services Digital Network, is the digital telephone lines, allowing bit-rates of 64 or 128 kbit/s.

Some Image Compression Formats • GIF, Graphics Interchange Format, is a creation of Compuserve [Com89], supporting lossless compression of images and text. • TIFF, Tagged Image File Format, is defined by the Aldus Corporation [Ald92]. TIFF version 6 also incorporates JPEG compression. • JPEG is an international standard for still image compression (lossy and lossless) developed by the Joint Photographic Experts Group, a working group of ISO, the International Organisation for Standardization [Pen93]. • MPEG, the Moving Picture Experts Group, is a working group of ISO in charge of international standards for compression, decompression, processing, and coded representation of moving pictures, audio and their combination. Three standards has so far been defined: MPEG-1 for lowresolution (320 × 240) coding of video at bit-rates of 1 - 1.5 Mbit/s (typically for storing video on CDs); MPEG-2 for higher resolution video for digital TV (4 - 5 Mbit/s) and HDTV (18 Mbit/s); and MPEG-4 for interactive multimedia representation and coding at all bit-rates; see also Section 1.5, “MPEG-4 Face Animation”.

2


1.2

OVERVIEW OF THE THESIS

(adapt) the face model to look and move like the face in the image sequence are extracted from the images, compressed, and transmitted through the channel. At the receiver side the face model is transformed according to the decoded parameters, resulting in an image (reasonably) similar to the original.

The Scope of this Thesis This thesis treats how to extract parameters to adapt a face model to an image. Typically, the model is a wireframe model with a texture mapped onto it. The extraction includes the steps of detecting the face in the image, extracting the position and shape of important facial features (eyes, mouth, ...) and adapting the model to the extracted information. Using the extracted parameters, the model should be able to reconstruct the face image. The face model parameters should then be coded to admit efficient transmission in terms of bandwidth and reconstructability at the receiving side of the system. This kind of system is generally referred to as a model-based coding system, and will be explained in more detail in Section 1.3 below.

Limitations The application in mind has been a real-time videophone system running on standard hardware. Standard hardware is thought of as today’s PCs or tomorrow’s cellular phones and palmtops. This implies that algorithms that need minutes or hours on expensive computer systems have been disregarded. There are several issues in the creation of a model-based communication system which are not treated at all in this thesis. The most obvious is the tracking of faces and facial features through an image sequence; only processing of single images are treated here. Also, the synthesis of face images on the decoder side is omitted. Other things, more naturally falling outside the scope of this thesis, is error resilience and network aspects. For the face detection and feature extraction, camera technology and calibration are also of importance, but are omitted here.

1.2 OVERVIEW OF THE THESIS This chapter will introduce the main and related topics of the thesis. The following three chapters will each treat one part of a model-based coding system; namely face detection, extraction of facial features and compression of face model parameters. • Chapter 2, “Face Detection” discusses different methods found in the literature (or developed within the VIDAS* project) for detecting and locating a face in an image. Some methods have been reimplemented and evaluated, to give visual examples of their results.

*) In the V IDAS project [Vid] (within which the work described in this thesis has been performed) most components of a model-based coding system have been studied. Methods for face detection, facial feature extraction, tracking and animation of faces, as well as speech articulation and synchronization have been developed. Also, the V IDAS project has made significant contributions to MPEG-4 in the fields of face animation and mesh compression.


3

Chapter

1.

INTRODUCTION

• Chapter 3, “Facial Feature Extraction” describes how to view the facial feature extraction problem as an optimization problem in a multidimensional space, and addresses methods for solving the problem with reasonable complexity. Also, some tools for specific subtasks, like detection of irises, as well as a unified scheme for feature extraction are presented. Finally, a method for adapting a face model to extracted feature data is treated. • Chapter 4, “Face Model Parameter Coding” treats how to compress the extracted face model parameters for efficient transmission in a model-based coding system. • Chapter 5, “Conclusions” discusses the techniques described in the other chapters, what conclusions can be made from each of them, and how they can be combined into a simple model-based coding system. Some directions of future work are also given.

1.3 MODEL-BASED CODING Since the major application of the techniques described in this document is model-based coding, an introduction to that topic will follow here. We briefly describe the principle of model-based coding, its history and terminology. For more details, see [Aiz93, Li93, LiLF94, Pea95].

Basic Principle The basic idea of model-based coding of video sequences is as follows: At the encoding side of a visual communication system, the image from the camera is analysed, using computer vision techniques, and the relevant object(s), for example a human face, is identified. A general or specific model is then adapted to the object, usually the model is a wireframe describing the 3-D shape of the object. Instead of transmitting the full image pixel-by-pixel, or by coefficients describing the waveform of the image, the image is handled as a 2-D projection of 3-D objects in a scene. To achieve this, parameters describing the object(s) are extracted, coded and transmitted. Typical parameters are size, position and shape. To achieve acceptable visual similarity to the original image, the texture of the object is also transmitted. The texture can be compressed by some traditional image coding technique, but specialized techniques lowering the bit-rate considerably in ceratin applications have recently been published [Str97, Str98]. At the receiver side of the system, the parameters are decoded and the decoder’s model is modified accordingly. The model is then synthesized as a visual object using computer graphics techniques, e.g., a wireframe is shaped according to the shape and size parameters and the texture is mapped onto its surfaces. In the following images, parameters describing the change of the model are transmitted. Typically, those parameters tell how to rotate and translate the model, and, in case of a non-rigid object like a human face, parameters describing motion of individual vertices of the wireframe are transmitted. This constitutes the largest gain of the model-based coding; as will be shown in Chapter 4, the motion parameters can be transmitted at very low bit-rates.

4


1.3

MODEL-BASED CODING

Usually, but not necessarily, a model-based coder is implemented as an analysis-bysynthesis coder, i.e., the encoder contains a feedback loop, thus holding a copy of the decoder enabling it to use the synthesized image and compare it to the original image. By minimizing the residual image, the encoder can optimize the transmitted parameters. This also permits the encoder to transmit a residual image, to improve image quality. The principle of a analysis-by-synthesis coder is illustrated in Figure 1.2.

Encoder Image sequence

Decoder Residual image Synthesized image

Synthesized image

Analysis

Model parameters

Channel

Synthesis

Model

Model

Synthesis

Figure 1.2: The principle of an analysis-by-synthesis coder.

Terminology Depending on the situation, the model-based coding system can either be equipped with a general model, e.g., a 3-D wireframe being able to adapt to any object, or a more specific model, like a model of the human face, able to utilize the knowledge of this specific kind of object. Using the terminology from [LiLF94], the first case is called object-oriented coding, while the latter is called semantic coding, see Figure 1.3. There can also be reasons to differ between two kinds of semantic coding. One is when we know in advance which specific model to use. The other is when each object, during the image analysis, is recognized as a member of a class, and the specific model corresponding to that class is selected automatically. The first case is sometimes called knowlM o d el-b ased c o d in g Object-oriented coding

(Semantic coding)

Knowledge-based coding

Semantic coding

Figure 1.3: Terminology for model-based coding.


5

Chapter

1.

INTRODUCTION

edge-based coding, thus reserving the term semantic coding for the highest level of intelligence in the coder [Har89], see Figure 1.3. In other papers, the term semantic coding indicates that the transmitted parameters are in some way semantically meaningful, like telling a face model to perform a specific facial action (e.g., wrinkle the nose) in contrast to moving vertices only. In this thesis, only the case where it is known in advance that the relevant object is a human face will be treated, and the transmitted parameters are to some degree semantically meaningful.

History of Model-Based Coding Before the introduction of model-based coding as we know it today, models for computerized facial animation were created by F. Parke [Par82] for computer graphics purposes. Parke suggested that his facial models might be used for data compression, but he did not pursue this himself. Another important precursor is the Facial Action Coding System (FACS) developed by P. Ekman and W. V. Friesen [Ekm77] in the seventies (probably inspired by the somewhat earlier work done by C. H. Hjortsjö [Hjo69]). The goal of P. Ekman and W. V. Friesen was to investigate the relation between emotion and facial expression, and as a part thereof, the FACS was developed as a way to quantitatively measure the intensity of facial actions. Each minimal facial action, for example, lip stretch or nose wrinkle, was called an Action Unit, and by combining the about 50 different Action Units, every facial expression should be possible to analyze. The Action Units have later on become popular as parameters for animation of human faces, and is probably still the system most widely used. The modern concept of model-based coding of human faces, or “coding through animation” was introduced by R. Forchheimer and O. Fahlander [For83]. This was soon followed up by a system estimating Action Units from a video sequence [For84], using a simple wireframe face-model suitable for real-time animation known as CANDIDE [Ryd87], see Figure 1.4. CANDIDE is still a popular model, but nowadays facial animators also use more advanced models. Model-based coding made a big leap in 1987, when Figure 1.4: The face model W. J. Welsh [Wel87] and K. Aizawa [Aiz87] introCANDIDE. duced texture-mapping into the scheme. Until then, only artificial textures, if any, had been used. The texture-mapping technique opened the door to photorealistic model-based coding. The work of W. J. Welsh and K. Aizawa also made the concept of model-based coding more popular as a research topic, convincing people that good-looking image communication could be possible this way. The next major step in the history of model-based coding was the introduction of analysis-by-synthesis coding, as described above. During the nineties, the problem of image analysis for model-based coding has been addressed, and, as mentioned earlier, the face models have been developed further.

6


1.4

TERMINOLOGY FOR FACE IMAGE PROCESSING

1.4 TERMINOLOGY FOR FACE IMAGE PROCESSING Several kinds of image processing are needed to build a model-based coding system, and there are several kinds of analysis, synthesis and processing of images containing human faces. We shall here try to classify them and relate them to the main topic of this thesis; visual communication by model-based coding. The six classes of face image processing mentioned here are Face Detection, Facial Feature Extraction, Face/Facial Feature Tracking, Face Recognition, Face Parameter Coding (including Facial Texture Compression), and Face Synthesis. It should be noted that most of these classes of face image processing are closely connected, and in several cases the same techniques can be used in more than one class. For example, statistical pattern matching can be use for face detection as well as for facial feature extraction and tracking. Also, the purpose of one class may be to benefit the execution of another, for example, the purpose of facial feature tracking might be to allow a good face parameter coding, and a face parameter coding technique is never good if it does not allow high quality face synthesis. Each class of face image processing is described below. • Face Detection: Given an image, face detection algorithms try to answer two questions: a) Is there a face in the image? and b) If there is a face in the image, where is it located? The second question can also include the issues of size and orientation of the face. Naturally, a face detection algorithm is vital for model-based coding, and in fact for most face image processing applications. Face detection has been subject to research in many places of the world. Different approaches will be described in Chapter 2. • Facial Feature Extraction: A facial feature is an attribute in some way describing the face. In this context we will limit ourselves to two kinds of facial features; specific points and contours. A facial feature point is a specified point in the face, such as the left corner of the mouth. The position and motion of the feature points can well describe the shape, global motion and expression of the face. For example, the motion of the corner points of the mouth describe facial expressions, coordinates of feature points on the forehead can describe the head shape and movement of inner eye corners can describe global motion. On the other hand, a contour feature is the borderline between facial areas, for example, the borderline between lips and skin, the edges of the eyelids and the border of the eyebrows. Contour features can often be interpolated from a set of feature points. Given an image and the location of the face, facial feature extraction is consequently the process of extracting coordinates of feature points or parameters describing contours. From the extracted facial features, face model parameters can be calculated, for transmission over the channel. Facial feature extraction and model adaptation are described in Chapter 3.


7

Chapter

1.

INTRODUCTION

• Face Tracking or Facial Feature Tracking: Tracking of a face or facial features is the process of following a detected face or extracted facial features through an image sequence. The algorithm used in this process can be identical to face detection or facial feature extraction (by applying those techniques on each image in the sequence). However, it is usually possible to perform this task in a much less complex way, using the knowledge from the earlier image(s) in the sequence. Lowering the complexity in this way is often necessary to make the system run in real-time. The tracking problem is not addressed in this thesis; see instead [Li93]. • Face Synthesis: Face synthesis is the task to render photorealistic images of faces from a set of parameters, and should (for instance) be performed at the receiving end of a model-based communication system. Different face models as well as modelling techniques may be used, influenced by and influencing the choice of face coding algorithms. Using algorithms for face detection, feature extraction, tracking and synthesis a model-based coder can be constructed as shown in Figure 1.5.

Feedback

Face detection

Facial feature extraction

Face synthesis

Facial feature tracking

Face model adaptation

Face model parameters

Image sequence Figure 1.5: A model-based (analysis-by-synthesis) coder.

• Face Model Parameter Coding: With face model parameter coding we mean the task to represent a set of face model parameters in an efficient way for transmission over a channel of limited capacity. Typical face model parameters are the location of the face (given by the face detection algorithm), parameters describing facial shape and motion (given by the facial feature extraction/tracking algorithms) and facial texture (extracted from the image using facial feature information), see Figure 1.6. The purpose and the requirement is that the parameters should be possible to decode so that the face could be reconstructed with reasonable image quality. Face model parameter coding will be discussed in Chapter 4.

8


1.5

Shape parameter coding

Face model parameters

Animation parameter coding

Texture coding

MPEG-4 FACE A NIMATION

FDP points, Vertex coordinates, model deformation parameters, ...

FAPs, Action Units, ...

FDP texture, Eigenface coefficients, ...

Figure 1.6: Face model parameter coding.

• Face Recognition: Face recognition includes the tasks of, given an image of a face, identifying a person or verifying the person’s identity. Face recognition will not be treated in this thesis, however, many of the techniques described here can be used in face recognition systems. For a good collection of face recognition methods, see [Wec98].

1.5 MPEG-4 FACE ANIMATION In the newly set standard MPEG-4, which was made international standard in early 1999 (MPEG-4 version 1), definitions of how to decode parameters for facial animation are included. The MPEG-4 Face Animation standard is not intended for model-based coding only, but has also other purposes. Another application is the animation of synthetic faces (where the parameters are not extracted from a “natural” image sequence), such as virtual humans (avatars) and cartoons. However, the MPEG-4 standard is able to support model-based coding, if only in a limited sense. MPEG-4 Face Animation has been developed within SNHC, the MPEG Working Group for Synthetic /Natural Hybrid Coding, and is included in the Final Drafts of International Standard on Systems [MPS98] and Visual [MPV98a]. A very brief description of the MPEG-4 Face Animation and its terminology will follow here.

Face Models In the context of MPEG-4, a face model is usually a wireframe model, even if this is not explicitly specified. Each decoder compliant to MPEG-4 Face Animation is required to include a face model. This proprietary model can be of arbitrary complexity. The encoder is thus unaware of the nature of the model used by the decoder, which prevents predictability and thus transmission of relevant residual information (as in Figure 1.2). The solution to the predictability problem is to download a face model, which is done through a Face Animation Table (FAT). A FAT defines a number of vertices, how they are connected, and how they should be animated in response to the animation parameters (see below).


9

Chapter

1.

INTRODUCTION

Feature Points and Static Face Parameters MPEG-4 defines 84 facial feature points, which are regarded as crucial for controlling the

shape and motion of a human face. The feature points are illustrated in Figure 1.7, and a textual definition can be found in [MPV98a]. The static parameters describing a face are shape and texture. These are represented by Facial Definition Parameters (FDPs). The shape of the face, i.e., the shape of the face in neutral position (eyes open, mouth closed, etc.), is defined by the 3-D coordinates of the feature points – so called FDP points. When receiving FDP points, the decoder is required to reshape its face model accordingly, so that all the feature points positioned according to the FDP points are tangented by one of the surfaces of the model. Note that the exact

Figure 1.7: The facial feature points as defined by MPEG-4.

Picture (from [MPV98a]) by Roberto Pockaj, Univ. of Genoa.

10


1.5

MPEG-4 FACE A NIMATION

behaviour of the face model is unknown at the encoder side, as long as a FAT has not been received. In a point-to-point communication scenario, it is of course possible that the encoder knows which face model the decoder uses. The next step is to define the texture of the face. This is done by transmitting a set of textures (FDP texture), each describing the texture of the facial skin, the tongue, the teeth and the eyes respectively. In an FDP texture, texture coordinates, i.e., the positions of all feature points within the textures, are specified. These coordinates define how the texture should be mapped onto the model, and are necessary for the decoder to use the texture. Note that it is not possible to update the texture (for example by residual information), except by replacing it by a new texture.

Animation The most important function of MPEG-4 Face Animation is how to make the face move. This is carried out through the Facial Animation Parameters (FAPs). There are 68 parameters in the FAP set, out of which 66 are low-level FAPs, each denoting a movement of one facial feature point in one out of three dimensions. For example, FAP #12 (raise_l_cornerlip) denotes the vertical displacement of the left inner lip corner. The two remaining FAPs, FAPs #1 and #2, are high-level FAPs, describing visemes and facial expressions. A viseme is the visual correspondence to a phoneme, describing the shape of the mouth when pronouncing a specific phoneme. There is no one-to-one mapping between phonemes and visemes, though. For example, the phonemes [p], [b] and [m] all corresponds to the same viseme (called “pbm”). An MPEG-4 expression consists of a selected type and an intensity level, e.g., “surprise 44” or “sadness 18”. Two expressions and two visemes can be dealt with simultaneously. Exactly how the decoder deals with the high-level FAPs depends on the specific face model, and can consequently be completely unknown to the encoder.

Face Animation Objects Not every MPEG-4-compliant decoder has to be able to handle all the parameters described above. If someone wishes to build a very simple terminal, for example a cellular phone using MPEG-4 Face Animation, it would add significant and unwanted complexity to allow face models of high complexity to be downloaded and used. To allow inexpensive, MPEG-4-compliant terminals, two different object types are introduced: • A decoder compliant to the Simple Face Object is required to decode FAPs only. The main target for this object type is broadcasting applications, and thus downloading capabilities (FAT, FDP) are not required. • The Main Face Object includes all techniques mentioned above. That is, the decoder is required to reshape the decoder’s face model according to received FDP points, and to map FDP texture on the model. Downloading of face models by FAT is also required. Naturally, animation according to the FAPs is required as well. This object type is not included in MPEG-4 version 1, but is scheduled for version 2 (being international standard in the beginning of year 2000).


11

Chapter

1.

INTRODUCTION

Note that decoders compliant to the simple object type are perfectly allowed to decode and use data required by the main object type only. MPEG-4 and Face Image Processing

The MPEG-4 parameter set applies to all of the image processing techniques above, with the exception of face recognition. An MPEG-4 face object is represented as a video object of a certain location and extension, given by a face detection algorithm. FDPs are then given by facial feature extraction, and the FAPs by facial (feature) tracking. Regarding face coding, compression of FAPs and FDP texture is defined in MPEG-4. Face synthesis is, naturally, the decoding and rendering of the MPEG-4 face object. The coded face parameters are in MPEG-4 compressed FDP texture (using the scaleable wavelet still image coder included in MPEG-4 [MPV98a]) and FAPs compressed using predictive/arithmetic coding.

1.6 CONTRIBUTIONS OF THIS THESIS The main contributions of this thesis are contained in Chapter 3 and 4. In Chapter 3 a new approach to facial feature extraction is presented, combining statistical template matching with a combinatorial approach (deformable template matching). The method has been implemented, with promising results. To improve performance, a more general structure, a deformable graph, has been implemented as well. In Chapter 4 a new method for compression of facial animation parameters is presented. The method is shown to be superior to earlier methods in terms of PSNR vs. bitrate.

12


Chapter

2.

2

FACE DETECTION

In this chapter different methods for detecting and locating a face in an image are discussed. First, fast and quite simple methods using edge and colour information will be described. Then, deterministic and statistical template matching is treated. To reduce the complexity, a directed stochastic search method – simulated annealing – is described. Some of these methods are combined into the face detection scheme which is described last.

2.1 INTRODUCTION As stated in Chapter 1, face detection algorithms aim at, given an image, answer two questions: a) Is there a face in the image? and b) If there is a face in the image, where is it located? In some applications, it is assumed that there is a face in the image, and the face detection algorithm will then only locate the face. The second question can also treat the issues of size and orientation of the face. Naturally, a face detection algorithm is vital for model-based coding, and in fact for most face image processing applications. Although both these problems are usually simple tasks for the human visual system, they have proven to be difficult tasks for machine-vision. Considering the face as an image pattern, it is challenging to detect, since faces can be very different, but still have the same basic structure. Also, due to facial expressions, lighting and pose, the one and the same face can have quite different appearances at different times. Many different methods for face detection have been developed. The most reliable, so far, have been connectionist methods (based on neural networks) [Row96, Vie97]. However, these systems are very computationally demanding, and are therefore not very interesting in applications where real-time or near real-time performance is required. Such methods will therefore not be treated here, however a brief comparison is made in the end of the chapter. Instead, our focus will be on simple algorithms, with real-time capability, and statistical methods that run in near-real time on a desktop computer of today.


13

Chapter

2.

FACE DETECTION

Attribute-Based vs. Template-Based Methods There are two basic classes of face detection methods; attribute-based (or feature-based) and template-based methods. Using an attribute-based method, specific attributes like elliptic shape, pixels of a certain colour, or facial features are searched for in the image. Such methods are usually quite fast, and some attribute-based methods will be discussed in the next section. Also, the methods used for facial feature extraction discussed in Chapter 3 could be regarded as feature-based face detection methods – by extracting facial features from an image, we are likely to detect the face. Using a template-based method, we give an example (or a set of examples), and try to find objects similar to the example(s). Template-based methods tend to be more robust but also more computationally demanding than attribute based methods. The basic principle is that given an example as an image template T, we search for the part of the image I that best matches T. If there are geometrical distortions, like varying pose or facial expressions, we need to do some geometrical normalization. If we want to detect a class of objects, e.g., faces belonging to different persons, we give a set of examples and extend the template to a subspace. Methods for this are discussed in Sections 2.3 - 2.5.

2.2 FAST ALGORITHMS FOR FACE DETECTION In some situations, for instance in image analysis for video telephony, it is assumed that a face of certain size-range is present in the image. A fast way to detect the position of a face, is to look for an approximately elliptical object. This approach has been used within the European project SCALAR [Sca]. In this project, the purpose is to find the face area and define it as a region of interest for high quality coding in an H.263+ hybrid coding scheme. The process used in SCALAR is performed in three steps, as illustrated in Figure 2.1. First, the image is filtered with edge detecting filters (Sobel filters in four directions). Second, the filtered image is thresholded, with a threshold such that 80% of the image pixels are set to zero. Third, an ellipse that fits the detected edges is computed. There are also face detection methods using different kinds of Hough transforms to detect faces, described in, e.g., [Mai97].

Edge filter

Threshold

Ellipse fit

Figure 2.1: Face detection by edge detection and ellipse fitting.

The above methods aim to detect and locate the face in the image by searching for shape and facial features. A drawback with this approach is the rather large risk of failure in case of a rotated or partly occluded face. In contrast, colour cues are invariant to rotation and robust to (partial) occlusion.

14


2.2

FAST ALGORITHMS FOR FACE DETECTION

It is not easy to use the colour information though, mainly because different people have different facial colour, depending on race, make-up, and individual variations. One solution, adopted by [LiF93], is that each person should use his/her own colour information, i.e., the colour information of the specific person should be known in advance. Also, the classification must be calibrated for each camera used. Specifically, different gamma corrections must be compensated for. Another approach is presented in [Sob96], in which the shape as well as the colour information is used. First, skin-like regions are located by performing colour segmentation. Then, regions of nearly elliptical shape are searched for. Other combinations, like colour and motion, have also been used, see, e.g., [Sun97]. A more complex method, segmenting the image in small areas, inserting them in a tree structure and searching the tree for the merging most face-like, is described in [Vil98]. Two methods of colour classification are described below.

Colour Classification using Colour Histograms In [LiF93] colour is classified using a colour histogram. A typical face colour histogram is created, and then the histogram of each image area to be classified is projected on this histogram. The resulting coefficient measures the similarity of the area’s colour distribution to a typical face colour distribution. To create the typical face colour histogram f, a set of training pixels are coarsely quantized, and a colour histogram is calculated. The quantization is used to decrease the number of bins in the histogram – otherwise, using 24-bit graphics, the histogram would contain 224 bins, which would require an inconveniently large amount of storage space. Typically, the colour space is quantized to three bits per component, which leaves us with a histogram of 29 = 512 bins. This quantization also makes the method more robust to small colour changes. When locating a face in an image I , a confidence image C is computed using the histogram backprojection algorithm [Swa91]. The three-step algorithm is as follows. First, from I, a histogram h is created in the same way as f was. Also, h should be normalized so that the bins of f and h sum to the same number. Second, for each bin hk in h, a penalty value pk is created as  f  min ----k- , 1 hk  pk =   1   0

if h k > 0 if h k = 0 and fk > 0

fk -, 1 , = min ------------------------max [ hk, ε ]

[Eq. 2.1]

if h k = 0 and fk = 0

where ε is a small value preventing division by zero. Third, for each pixel in the input image, the corresponding penalty value p k is found, i.e., C ( x, y ) = p bin ( I ( x, y ))

[Eq. 2.2]


15

Chapter

2.

FACE DETECTION

where bin ( I ( x, y ) ) returns the index to the histogram bin which accounts for the pixel value I ( x, y ) . The output image C now contains confidence values for each pixel belonging to a face.

Colour Classification using Linear Discriminants It has been observed that the luminance is not very useful for classifying colour as face /non-face, since it highly dependant on lighting. Thus the classification can be made in the 2D [ Cr, Cb ] space (the chrominance plane in the [ L, Cr, Cb ] space). By collecting statistics of face and non-face pixels, the area in the chrominance plane corresponding to face-pixels can be characterized, e.g., using linear discriminants. To illustrate, two images (shown in Figure 2.2) were selected as training images. Some (almost 200) pixels in each face have been marked as face pixels, and these pixels have been plotted in the chrominance plane, see Figure 2.3. Four linear discriminants have then been placed around the clusters. The linear discriminants are  138 < Cr < 178   200 < Cb + 0,6 ⋅ Cr < 215

[Eq. 2.3]

These linear discriminants were then used to classify pixels in eight other images. The result after median-filtering (with filter size 7 × 7 ) can be seen in Figure 2.4. As can be seen, the method works fairly well, but is hardly stable enough for a final detection or segmentation. However, the algorithm is very fast and can be used as a first step, finding areas in the image that possibly contain a face. In that case, the discriminants should be very generous, making the risk of missing a face very small. In conclusion, this is a very useful method, and we will adopt it as a pre-processing step. An example of using a statistical method for colour classification is shown later in this chapter.

2.3 DETERMINISTIC TEMPLATE MATCHING The most obvious way to look for a face (or any other object) in an image is by template matching, where one uses a template, i.e., an image containing the face only, and match it to each part of the image. This is called deterministic template matching, and is very efficient when the exact appearance of the object is known. Generally however, the object may be transformed, for example by rotation, or the exact appearance of the object even without transformations may be unknown. In the following sections normalized template matching will be described and extended to handle transformed objects. Finally, methods for statistical template matching will be treated.

16


2.3

DETERMINISTIC TEMPLATE MATCHING

“Haibo”

“Hanna”

Figure 2.2: The two training images.

“Haibo”

“Hanna”

200

200

150

150 Cr

250

Cr

250

100

100

50

50

0

0

50

100

150 Cb

200

250

0

0

50

100

150

200

250

Cb

Figure 2.3: Colour distribution in the [ Cr , Cb ] plane and four linear discriminants.

Figure 2.4: Images with the pixels classified as non-face pixels set to zero (black).


17

Chapter

2.

FACE DETECTION

Normalized Template Matching The standard detection paradigm in image processing is that of normalized template matching (matched filtering), where a template T ( x , y ), 0 ≤ x < x T , 0 ≤ y < y T , is correlated with a part of the image I ( x , y ). The normalized correlation coefficient ρ ( x 0, y 0 ) will be used as a measure of how well the template fits the image at the position ( x 0, y 0 ) according to

∑x, y T ( x, y )IT

x0 y0

( x, y ) – µ T µ I ρ ( x 0, y 0 ) = ---------------------------------------------------------------------σTσI

[Eq. 2.4]

I T ( x, y ) = I ( x + x 0, y + y 0 ) , 0 ≤ x < x T , 0 ≤ y < y T

[Eq. 2.5]

where x0 y0

and µT , µ I, σT and σ I are the means and variances of T and I T respectively. The summation should go over all non-zero pixels in T . ρ is a measure of how well the image part I T can be approximated by the template T , and in case of a perfect match, ρ = 1, there exist a and b so that aT + b = I T .The object is found by looking for a large value of ρ. This approach is optimal to detect a known (deterministic) object in an image disturbed by additive white Gaussian noise, but in face detection applications there are two major drawbacks: First, if the object has been transformed in some way, for example, scaled in size, rotated or reshaped, the matched filter may not find the object. In that case, the same transformation has first to be applied to the matched filter. If the parameters of the transformation, e.g., the size of the object, are unknown, several versions of the template has to be created and matched to the image. In this way, the computational complexity will increase heavily. Second, if the exact (un-transformed) appearance of the object is unknown, the matched filter cannot be constructed. This is the case when looking for an instance of a class of objects where only some general properties of the class are known. The trivial solution of using a template for each possible member of the class, is certainly too computationally complex to be considered in any practical application. In conclusion, other methods must be used. Heuristics for detection of transformed objects and statistical pattern matching for detecting instances of classes of objects will be discussed below.

2.4 HEURISTIC DETECTION OF TRANSFORMED OBJECTS As mentioned above, the computational complexity of template matching with a variable template is very high. To reduce the complexity a search algorithm is needed, for example gradient descent. The problem with gradient descent is that it is a greedy algorithm which very likely will get stuck at a local optimum. Another way is to use an undirected stochastic search heuristic, for example by randomly testing template positions and then

18


2.4

HEURISTIC DETECTION OF TRANSFORMED OBJECTS

pick the best of the tested ones. For this to be likely to find even a local optimum, a large portion of the possible positions needs to be tested. The compromise is a directed stochastic search heuristic, such as simulated annealing, which is described below. Also, when trying to match deformed templates to the image, the probable ways of deformation need to be known. In case of a rigid object, we only need to deform it by scaling and rotation, which leaves us with quite few degrees of freedom. In the case of non-rigid objects, or object classes where the elements have different shapes (like human faces) principal deformations calculated from a database of object shapes could be used.

Simulated Annealing In physics, annealing refers to the process of slowly cooling molecules until they form a perfect crystal. The analogous search algorithm is iterative, and in each iteration a set (or several sets) of parameters is tested. If the new set of parameters is better, according to some evaluation function, than the set in the last iteration, it is accepted. However, there is a small probability that the new parameter set is accepted even if it is not better than the old one. This probability depends on how much better the old parameters are, and also of the simulated temperature. In this way, the algorithm is able to jump out of a local minimum. The temperature decreases as the process goes on, thus decreasing the probability of accepting worse parameters. In each iteration, a new set of parameters is created by randomly perturbing the old set according to some distribution that depends on the temperature. The process is iterated a pre-specified number of times, or until convergence has been reached. In a step-by-step manner, the algorithm can be described as follows: a parameter space Ω, and a parameter evaluation function f : Ω → R . The goal is to find the parameter set ω ∈ Ω that maximizes f ( ω ) .

1. Start with

a stochastic parameter perturbation function G : Ω × R → Ω , a convergence criterion, and an annealing schedule. An annealing schedule is a sequence of temperatures { T i } and a number j telling the time (number of tries) to spend at each temperature.

2. Define

an ω i ∈ Ω as ω i = G ( ω i – 1, T i ) and let a i = f ( ω i ) (i is the iteration count).

3. Pick 4. If

a i > ai – 1 , then accept ω i and go to step 8. (a

–a )⁄T

e i – 1 i i > ξ , where ξ i is a random variable drawn from a uniform distribution ranging from zero to one, then accept ω i and go to step 8.

5. If

6. Repeat steps 7. If

3 - 5 j times.

ω i was never accepted, then let ω i = ω i – 1 .

8. Let

i = i + 1 and repeat steps 3 - 7 until convergence.

The two extreme cases of T = 0 and T = ∞ corresponds respectively to greedy algorithms (such as gradient descent) and undirected, “purely random”, search.


19

Chapter

2.

FACE DETECTION

Example of Simulated Annealing A simulated annealing search will here be exemplified by a template matching process. The purpose is to find the coordinates in the image where the template matches the image best. In this example the position of the template is the only parameter; natural additional parameters are scale and rotation. The initial assumptions are as follows: • The parameter space is the ( x, y ) pixel coordinates of the possible locations of the template.

0.9

0.8

0.7

0.6 Correlation

• The parameter evaluation function is the normalized correlation of the template and the image at position ω i = ( x i, y i ) according to Eq. 2.4.

Correlation, max 1.00, final 1.00 1

0.5

0.4

0.3

• As convergence criterion, a maximum number of 500 iterations is defined. Also, if the correlation equals one, the algorithm is halted.

0.2

0.1

0

0

20

40

60

80

100 Iterations

120

140

160

180

200

Figure 2.5: The correlation coefficient • The stochastic perturbation function is accepted in each iteration. ω i = ( x i, y i ) = ( x i – 1, y i – 1 ) + ( X i, Y i ) , where X i and Y i are independent Gaussian variables with zero mean and variance c ⋅ T i . The constant c is, in our example, set to ten times the image width and height respectively. Since a zero perturbation is meaningless, it is not allowed.

• The annealing schedule is defined by T 0 = 5 , T i = T 0 ⁄ i and j = 20 . A 20 × 10 pixel template (the right eye) has been cut out from the image “Haibo”, and has then been matched to the same image. As shown in Figure 2.5, the correlation coefficient is unstable in the beginning of the process, but stabilizes as the temperature is decreasing and the correlation is increasing. After 184 iterations the algorithm halted, since the correlation equalled one. By then, 6% of the image had been searched. The positions that were checked and accepted are shown in Figure 2.7. Figure 2.6: The correlation coefficient If the same thing is done, but with white, accepted in each iteration (noisy image). Gaussian noise added to the image, the convergence is, of course, slower. Approximately 400 iterations were needed before the location was stable, as is shown in Figure 2.6. By then the algorithm had searched 18% (see Figure 2.8) of the image. Assuming that the correlation is the only time-consuming part of the algorithm, this means that the search is 0.9

0.8

0.7

Correlation

0.6

0.5

0.4

0.3

0.2

0.1

0

20

0


50

100

150

200

250 Iteration

300

350

400

450

500

2.4

HEURISTIC DETECTION OF TRANSFORMED OBJECTS

Figure 2.7: The positions in the image that were checked by the simulated annealing algorithm (left), and the positions that were accepted in (at least) one of the 184 iterations (right).

Figure 2.8: The positions in the image that were checked by the simulated annealing algorithm (left), and the positions that were accepted in (at least) one of the 500 iterations (right).


21

Chapter

2.

FACE DETECTION

sped up approximately by a factor of five (compared to exhaustive search), which is not too impressive. The large gain of using a heuristic search method such as simulated annealing, is achieved when the parameter space has several dimensions. The simulated annealing scheme can also be combined with other methods. For example, M. Betke and N. C. Makris [Bet94] reports computing times of 15 seconds using simulated annealing and gradient descent for finding an object in a noisy image, as compared to 10 hours using exhaustive search. In this example, simulated annealing is used for the position only, and gradient descent for the other parameters (rotation and scale). A fast face detection algorithm using simulated annealing will be discussed later in this chapter.

2.5 STATISTICAL PATTERN MATCHING In this section, we describe how to search for an object belonging to a specific class of objects, and also how to combine that with the heuristic search for deformed objects as described above. Only very basic theory is treated here; for a more thorough study of stochastic signal detection, see [Hel95]. For further treatments on eigenspaces for faces and non-faces, see [Cos96, Mog97, Str98, Sun94, Tur91].

Basic Signal Detection Theory Assume we have a scalar-valued observed quantity. On the basis of this observation, a decision of two hypotheses, H0 and H1, shall be made. For example, hypothesis H0 could be “no face is present in the image” and hypothesis H1 should then be “a face is present in the image”. Since noise and other unknown factors influence the observation, it is regarded as a random variable X. X can then be characterized by its probability density function under H0 and H1 respectively, so that pk(x) = Pr(X < x | Hk), k = 0, 1. Suitably, a value x0 is then defined, so that the hypothesis H0 is accepted if X < x 0, and H1 is accepted if X > x0. The probability Q10 of choosing hypothesis H1 when H0 is actually true (“false detect”) is then ∞

Q 10 =

∫ p0 ( x ) dx ,

[Eq. 2.6]

x0

and the probability Q01 of choosing hypothesis H0 when H1 is true (“miss”) is x0

Q01 =

∫ p1 ( x ) dx .

[Eq. 2.7]

–∞

The value x 0 to be used depends on how much these two different mistakes will cost. Assuming two values C10 and C01 to define the cost of the respective mistakes, the risk associated with hypothesis H0 can be defined as C10Q10 (and analogously for H1). The

22


2.5

STATISTICAL PATTERN MATCHING

average risk, C ( x 0 ) is then dependent on the risks and the à priori probabilities of H0, and H1, ( ξ and 1 – ξ respectively). C ( x 0 ) is calculated as C ( x 0 ) = ξC 10 Q 10 + ( 1 – ξ )C 01 Q 01 ∞

[Eq. 2.8] x0

∫

∫

x0

–∞

= ξC 10 p0 ( x ) dx + ( 1 – ξ )C 01 p 1 ( x ) dx Naturally, we want to choose the x0 that minimizes the risk. To find this level, Eq. 2.8 is differentiated with respect to x0 and set to zero. The result is p1( x0) ξC 10 --------------- = ------------------------, p0( x0) ( 1 – ξ )C 01

[Eq. 2.9]

from which x0 can be calculated when assuming specific probability densities p k ( x ) . The ratio p1( x ) Λ ( x ) = -----------p0( x )

[Eq. 2.10]

is called the likelihood ratio, and Λ 0 = Λ ( x 0 ) is called the decision level. The decision regions R0 and R1 consists of the points where Λ ( x ) < Λ 0 and Λ ( x ) > Λ 0 respectively. This strategy is known as the Bayes solution, and the minimum value of C ( x 0 ) is called the Bayes risk. All the above can of course be generalized to multivariate distributions. The only changes needed concerns the dimensionality of x and p k ( x ) , and consequently the integrals in Eq. 2.6 - Eq. 2.8.

Estimating Face and Non-Face Distributions We now return to the problem of detecting faces. In the following approach, it is assumed that the majority of the faces can be modelled by vectors in an M-dimensional subspace of the N-dimensional image space;* this subspace is called an eigenspace for faces or the face space. The face images are modelled as a stochastic process X , which outputs independent, identically distributed face image vectors. To estimate the face space, we start by collecting a training set of K ≥ M images of faces, normalized in terms of scale and position, and perform a principle component analysis PCA [Jol86] on this set. That is, the covariance matrix CX = E [ ( X – X ) ( X – X ) T ] ,

[Eq. 2.11]

*) An image of size n × m can be viewed as an n ⋅ m dimensional vector by lexicographic reordering of the pixels.


23

Chapter

2.

FACE DETECTION

where X is the mean face, is estimated as K 1 ˆ = ------------∑ C ( x – xˆ ) ( xi – xˆ ) T . X K–1 i=1 i

[Eq. 2.12]

The mean face xˆ is estimated as 1 K xˆ = ---- ∑ x K i=1 i

[Eq. 2.13]

An eigenvalue decomposition ˆ = ΦT C ˆ C X Y Y ΦY

[Eq. 2.14]

is then performed, giving the face space. ˆ will have If the faces in the training set actually do span M dimensions only, C Y M non-zero rows and columns. Even if this is not the case we create the truncated versions CM and ΦM of sizes M × M and N × M respectively. The rows of ΦM are the M first principal components of the training set. These can be used as basis vectors for the principal M-dimensional subspace, which is used as an approximation for the face-space. Orthogonal to this space, is the (N - M)-dimensional complementary subspace. Projecting a face image x onto the face-space, we get a new vector yM according to y M = Φ M ( x – xˆ ) .

[Eq. 2.15]

This is similar to performing a KL-transform on x; see [Loe55]. In the following it is assumed that the probability distribution of the face process X , p1(x), can be modelled as a multivariate Gaussian distribution, with variances along the principal components according to the eigenvalues of CX, i.e., 2 1 T –1 1 –1 ε ( x ) exp  – --- y M C M y M exp  – -----------exp  – --- ( x – x ) T C X ( x – x )  2   2ρ  2 p1 ( x ) = ------------------------------------------------------------------- = -------------------------------------------- ⋅ ------------------------------- .[Eq. 2.16] ( 2π ) M C M ( 2πρ ) N – M ( 2π ) M CX

ε ( x ) is the residual reconstruction error defined as 2

2 ε ( x ) = x – xˆ

2

= x–x

2

+ yM

2

,

[Eq. 2.17]

and ρ is the average of the N - M last eigenvalues (which will have to be estimated). In the same way, a K-dimensional principal subspace for non-faces can be built, esti* * * mating Φ K , C K and ρ * , and defining z K = Φ K ( x – x ) . The non-face space should be built of images as similar to face images as possible, without being faces. Inserting Eq. 2.16 in Eq. 2.10, the likelihood ratio Λ ( x ) can be calculated. This also takes us to the concept of the Mahalanobis distance.

24


2.5

STATISTICAL PATTERN MATCHING

Mahalanobis Distance The Mahalanobis distance from a test vector x to a random vector prototype Ω is defined as T

d M ( x, Ω ) = ( x – ω ) C Ω ( x – ω ) , –1

[Eq. 2.18]

where CΩ is the covariance matrix of the random vector Ω’s distribution. This means, that if Ω is modelled with the best-fitting Gaussian density, then d M ( x 1, Ω ) equals d M ( x 2, Ω ) if and only if p Ω ( x1 ) equals pΩ ( x 2 ) . Alternatively, the normalized Mahalanobis distance, which is defined as 1 –1 d M ( x, Ω ) = – ln p Ω ( x ) = --- ( ( x – ω )CΩ ( x – ω ) + N ln 2π + ln C Ω ) 2

[Eq. 2.19]

can be used. Note that the expectancy of the normalized Mahalanobis distance equals the entropy of the process. The relation to image compression is exploited in [Str98]. Naturally, this can be used for face images, and to measure the Mahalanobis distance from a test image to the face-space and the non-face-space. In fact, it is easy to see that maximizing the likelihood ratio is the same as maximizing the Mahalanobis distance to the non-face-space while minimizing the Mahalanobis distance to the face space, i.e., p 1( x ) arg maxΛ ( x ) = arg max -----------p 0( x )

[Eq. 2.20]

= arg min ( d M ( x, X ) – d M ( x, Z ) ) ε (x ) ε ( x )-  T –1 T * –1 = arg min   y M C M y M + ------------- –  z K CK z K + ------------- ρ   ρ*  2

*2

This has proven to be a good distance measure for face detection, as will be discussed next.

Application to Face Detection The statistical pattern matching described above will now be combined with the stochastic search heuristic from Section 2.4. * * Assume a face-space, a non-face-space and the corresponding Φ M , C M , Φ K and C K . Assume also an image, of which a vector I T ( θ ) is extracted. The parameter vector θ tells which part of the image to extract, and how to scale, rotate and translate it, before reshaping it to a vector. Using simulated annealing the optimal θ is searched for, i.e., the input to the simulated annealing algorithm is the parameter space Θ of all possible θ , and the evaluation function is f(θ) = Λ(I T ( θ )) .

[Eq. 2.21]


25

Chapter

2.

FACE DETECTION

This method has been used within the VIDAS project, where experiments with different kinds of transformations (similarity, affine and perspective) of rectangles have been performed, resulting in robust results in a approximately 0.2 seconds on a Sun SPARCSTATION 20; see Figure 2.9 [Ker97].

Figure 2.9: Detected faces using statistical template matching and simulated annealing. Images courtesy of the VIDAS project.

During this project, a experimental version has been developed as well, using only inplane rotation, widening/thinning, scaling and translation as transformations. This software is not as robust, and is more easily disturbed by complex backgrounds. Compared to the connectionist / neural network method described in [Vie97], the methods described here are inferior in performance and superior in computational complexity. [Vie97] reports a computation time of 20 seconds on 133 MHz PC to locate the face area, and could thus be an interesting competitor. Setting the “miss” rate to 2%, the “false detect” rate is lower than 1%, which is impressive keeping in mind that also highly rotated faces (face in profile) are located. The output provided seem not to be very precise, though, giving only the approximate location of the face.

Application to Colour Classification Naturally, this statistical method can be used on colour data as well. Using the data sets shown in Figure 2.3 as training set, the covariance matrix of skin colour can be estimated, and for each pixel in a test image, the Mahalanobis distance can be calculated. This has been done to eight images (the same images as shown in Figure 2.4), and the results are shown in Figure 2.10. From the Mahalanobis distance, p Ω ( x ) has been calculated and normalized so that p Ω ( ω ) is shown as white, i.e each pixel is computed as p ( x ) = exp ( –( x – ω )C Ω ( x – ω ) ⁄ 2 ) , –1

[Eq. 2.22]

where x is the 2-D vector containing the Cr and Cb values of the current pixel. The results do not look very robust; the pictures are taken with different cameras with different sensitivities to colour. To use this method, calibration of the statistics is needed for each camera.

26


2.6

CONCLUSION

2.6 CONCLUSION Algorithms based on colour discrimination locate probable regions containing human skin very quickly. However, the simple methods described here are quite easily fooled, and not robust enough for practical applications. Using generous discriminants, though, a large part of the image can be excluded from further search. Methods based on transformed face spaces are robust and precise, but computationally quite heavy. A directed stochastic heuristic, like simulated annealing, can speed up the search considerably. Such a method should be combined with a colour discriminator, thus avoiding to search parts of the image clearly not containing a face. Connectionist methods are impressive in terms of performance, but are so far too computationally expensive.

Figure 2.10: Confidence values for face colour.


27

28


Chapter

3.

3

FACIAL FEATURE EXTRACTION

In this chapter, methods for extracting facial features are discussed. The approach of regarding facial feature extraction as an optimization problem is discussed, as well as methods for solving this problem with reasonable complexity. A scheme combining deformable graphs, dynamic programming and statistical pattern matching is described, as well as a supporting technique for iris detection. Experimental results are presented, and finally, a way to adapt a wireframe model to the extracted feature data is presented.

3.1 INTRODUCTION Facial feature extraction is defined as the process of locating specific points or contours in a given facial image (like mouth corners) or to align a face model to the image (i.e., extract the face model parameters). In both cases, a set of two- or three-dimensional point coordinates is usually the output from the facial feature extraction. Sometimes, a set of deformations to a face model is the output, but since it is possible to express those as a set of points coordinates, we lose no generality by this definition. When discussing facial feature extraction, it is assumed that a face is present in the image, and that its approximate location is known, i.e., it is preceded by a face detection process. There is no clear definition on what constitute a correct output. A widely used quality measure in facial feature extraction is the average error in relation to a manual feature extraction process. Naturally, this is a quite impractical quality measure to compute, since it involves manual work. Another measure to use is the distance from the extracted face to a face space. The concept of a face space is also used for face detection (see Section 2.5) and facial texture coding [Str98]. Usually the set of all face images is modelled as an M-dimensional Gaussian distribution, but as pointed out by J. Ström et al. [Str97] this is misleading since the face set is obviously non-convex. However, there is a non-linear operation called geometrical normalization that transforms the face set to a convex set. The geometrical normalization is the process of aligning a wireframe face model to an image, map the texture onto the model and reshape the model into a standard shape, e.g., the average human face


29

Chapter

3.


shape. The resulting texture is called a normalized facial texture [Str97] or a shape-free face image [Cra87]. The geometrical normalization is dependent on the face model used as well as the standard shape, but also the quality of the alignment of the model to the image. Given the model and the standard shape, the optimal alignment of the model could be defined as the one resulting in the normalized facial texture with the minimum distance to the (normalized) face space. Since a face model typically consists of a quite large number (K) of vertices, finding the optimal deformation of the model is a search in a 2K-dimensional space (since each vertex has 2 coordinates in the image). This search can be done with, for instance, a stochastic search method like simulated annealing (see Section 2.4, [Bet94, Ker97]) or, as suggested by Ström [Str98], gradient descent refining an initial manual alignment to an optimum.

Formulation of the feature extraction problem Based on the above, the feature extraction problem can be formulated as a model adaptation problem as follows: • Given a set of parameters ω ∈ Ω , assume a model M that produces an image I ( M, ω ) . • Given the model M and an image I, the optimal parameter set ω* is the one that best describes the image I, i.e., ω∗ = arg min δ(I, I ( M, ω )) ω

[Εθ. 3.1]

where δ() is a discrepancy measure. In our context, the model M is a wireframe model, and the parameter set ω consists of shape parameters of the wireframe model (i.e., vertex coordinates) and a texture to be mapped on the wireframe (i.e., pixels values or eigenface coefficients). By defining the model such that it only produces images within the face space, ω* is optimum described above. The discrepancy δ() is usually expressed as a the squared Euclidian distance.

Reducing the Computational Complexity The drawback with a global deformation (i.e., a full search in the 2K-dimensional space) is the high computational complexity due to a) the high dimensionality of the search space and b) the computational complexity of projecting a remapped face image onto the face space. There are some different ways to reduce the complexity, as described below: • Reduce the search space. The assumption that a wireframe model with K vertices produces a 2K-dimensional search space is true only if all possible deformations of the model are allowed. This is quite unnecessary, since the human head does deform in all ways. Replacing the vertex position parameters with a limited set of deformations, the search space is reduced considerably. • Use a heuristic search method. Instead of searching all of the space, a directed and /or stochastic heuristic search method could be used to find ω*. Even if a heuristic method does not guarantee us to find the global optimum, a local opti-

30


3.2

IRIS DETECTION

mum delivered in time is usually preferable. There are a number of possible heuristics, like simplex, gradient descent, simulated annealing, genetic algorithms and combinations thereof; see [Tha96] for a tutorial. • Reduce the calculation time of δ(). In the examples above, the normalized texture is projected on a set of eigenfaces to extract eigenface coefficients (texture parameters). This computation is quite heavy, but may be approximated by projecting parts of the image only. For example, the skin on the cheeks does not contribute much compared to eye and mouth regions [Str98]. Also, the exact placing of feature points/model vertices are more important (to the visual quality of the decoded image) in eye/mouth regions than in cheek regions. It is therefore reasonable to assume that some parts of the face can be omitted in the calculation. If projection on an eigenspace is used, the dimensionality of the space may also be changed to improve complexity or performance of the algorithm. The implementation described in [Ker97] attacks the first two problems; the search space is reduced by allowing a small set of eigendeformations (principal deformations) only, and simulated annealing is used to search the space. The Active Appearance Models described by T. F. Cootes et al. [Coo98] use principal appearance variations, combining shape and texture, and, moreover, trains the system to utilize the relationship between the residual image and the parameter error. Later in this chapter a scheme using deformable templates to reduce the search space and dynamic programming to efficiently search the allowed space will be described. The calculation time of δ() is reduced by omitting all but some important parts of the face area. First, a method for detecting particular facial features, i.e., iris position and size will be described. Then, dynamic programming and deformable templates are explained, and how to combine them with statistical pattern matching. Finally, a scheme for model adaptation to extracted feature data is discussed.

3.2 IRIS DETECTION In this section is shown how to perform iris detection under two assumptions of the irises’ appearance. The method, first described by J. Daugman in [Dau93], is mostly suitable when the approximate locations of the eyes are already known. The assumptions about the irises are a) An iris is approximately circular, and b) An iris is dark against a bright background (the white part of the eye). The algorithm below determines in a fast and accurate way how well a part of the image fulfils those assumptions.

Algorithm Description Only three parameters are needed to describe an iris; the centre coordinates and the radius ( x 0, y 0 , r ) , and we define the function f Θ ( x 0, y 0, r ) =

∫

I ( x 0 + r cos θ, y 0 + r sin θ ) dθ ,

[Eq. 3.2]

θ∈Θ


31

Chapter

3.


where I ( x, y ) is a graylevel image and, for example, Θ = ]0, 2π]. For a given ( x 0, y 0 ) , the most likely radius of an iris centred around ( x 0, y 0 ) , will then be the value for which ∂ f(x , y , r) ∂r 0 0

[Eq. 3.3]

is large. Consequently, ( x 0, y 0 , r ) are chosen according to ( x 0, y 0, r ) = arg max ( iris(x 0, y 0, r) ) = arg max  G ( r ) * ∂ f ( x 0, y 0, r ) [Eq. 3.4] ∂r x 0, y 0, r x 0, y 0, r where G ( r ) is a Gaussian convolution kernel. To achieve a fast implementation, the order of differentiation and convolution is switched, and the differentiated Gaussian is approximated as 1 1 ∂ G ( r ) ≈ G ∆r ( n ) = ------ G ( n∆r ) – ------ G ( ( n – 1 )∆r ) . ∆r ∆r ∂r

[Eq. 3.5]

The integral is approximated with a sum, f ( x 0, y 0, r ) ≈ f ∆θ ( x 0, y 0, r ) =

∑ I ( x0 + r cos ( m∆θ ), y 0 + r sin ( m∆θ ) )

,

[Eq. 3.6]

m

as is the convolution, iris(x 0, y 0, r) ≈ ∑ G ∆r ( n – k )f ∆θ ( x 0, y 0 , k∆r ) .

[Eq. 3.7]

k

The function iris(x 0, y 0, r) can be used as a confidence measure telling the likelihood of ( x 0, y 0, r ) being the appropriate position and radius of an iris. No To improve the performance and decrease the computational complexity, the circular summation can be limited, i.e., make the Yes interval Θ in Eq. 3.2 smaller. Typically, the integration is per- Yes formed over the left and right 90° sectors, but not over the top No and bottom sectors (see Figure 3.1). This is due to the fact that the iris contour is often occluded by eyelids in the top and bot- Figure 3.1: The sectors over which to integrate. tom.

Implementation and Experimental Results The computational complexity of the iris-function seems high at a first glance, but actually requires only a few hundred multiplications per pixel (depending on the parameters ∆r and ∆θ ) using bilinear interpolation of I ( x, y ) (to calculate f∆θ ( x 0, y 0, r ) according to Eq. 3.6. This is in the same magnitude as a matched filtering of an iris-sized filter kernel. Combining this technique with a multiresolution search and gradient ascent or simulated annealing, the iris search is fast. The implementation reported in [Dau93], using a RISC processor and images of size 640 × 480 , was able to locate the irises in approximately a quarter of a second.

32


3.2

IRIS DETECTION

Also, the results are very precise, delivering coordinates and radii like the ones illustrated in Figure 3.2. However, if no constraints are given to the algorithm, it can of course find objects in the image which better fulfils the criteria than the irises do. An example of such a failure is shown in Figure 3.4, where the woman in the picture wears ear-rings with objects that fulfil the criteria very well. Obviously, the algorithm should be constrained, so that the search is performed in the face area only. Typical constraints are also iris size and relative position in the face (approximately equal distance from the vertical mid-line, and approximately equal vertical positions).

Conclusion The algorithm is very accurate, but can be fooled by synthetic objects (usually outside the face). If given constraints on the area to search, the algorithm is robust as well as fast, and also gives a very exact position and width of the irises.

Figure 3.2: Located irises, with extracted centre coordinates and radii, marked with white circles.

Figure 3.4: Erroneous iris detection.


33

Chapter

3.


3.3 DEFORMABLE LINE TEMPLATES AND DYNAMIC PROGRAMMING In this section, dynamic programming will be used to find the optimal deformation of a line template to make it fit a line-drawing or the edges of an image. First, the concept of a deformable line template will be explained, followed by a description of how dynamic programming can be used to deform the template to fit an image.

Deformable Templates A deformable line template consists of a set of segments. The segments are ordered and connected to each other. A small displacement is allowed in each connection, thus allowing the entire template to deform; see Figure 3.5. The segments each consists of a set of elements, and are not allowed to deform; see Figure 3.6. We do the following definitions: • A deformable line template of length L is a tuple T = 〈 S, ∆ 〉 consisting of two ordered sets of segments S = { s 1, s 2, …, s L } , and deformation limits ∆ = { ( ξ 2, υ2 ), …, ( ξ L, υ L ) } . • A segment s l is an ordered set of elements; s l = { e l1, e l2, …, e lKl } , where Kl. is the length (number of elements) of the l:th segment. • An element corresponds to a position in a local coordinate system such that e l1 = ( 0, 0 ) for all l and e lk = ( x lk, y lk ) , where x lk and y lk tells the (discrete) distance from e l1 (see Figure 3.6).

Segment 0

Segment 1

Segment 2

Figure 3.5: Left: A deformable template consisting of three segments each consisting of four elements. Middle and right: The same template deformed in two different ways. To the right, the segments are slightly overlapping.

y

Element 4 = (2,3) Element 3 = (2,2) Element 2 = (1,1)

x

Element 1 = (0,0)

Figure 3.6: A segment consisting of four elements.

34


3.3

DEFORMABLE LINE TEMPLATES AND DYNAMIC PROGRAMMING

The goal is to deform the template so that it matches a binary image, meaning that each segment should be placed so that its elements cover as many pixels set to one as possible. However, we should limit the set of possible deformations to a set of allowed deformations, that is, there Figure 3.7: A deformation typically not allowed. should be a limited gap between each segment, as well as a limited overlap. For example, the deformations in Figure 3.5 are typical allowed deformations, while the deformation in Figure 3.7 might not be allowed. These concepts are defined as follows: • A deformation of a template is an ordered set of segment positions, i.e., D = { ( x 1, y 1 ), …, ( x L, y L ) } .

[Eq. 3.8]

• An allowed deformation of a template is a deformation such that   x l – x l – 1 – x ( l – 1 )Kl – 1 ≤ ξ l   y l – y l – 1 – y ( l – 1 )Kl – 1 ≤ υ l 

[Eq. 3.9]

holds for all l ∈ { 2, …, L } . Also, for all l ∈ { 2, …, L } and k ∈ { 1, …, K l } , ( x l + x lk, y l + y lk ) should be within the range defined by the image size. • A segment match ml is the number of set pixels that the l:th segment cover, i.e., Kl

∑ I ( x l + x lk, yl + ylk ) ,

m l(T, D, I) =

[Eq. 3.10]

k=1

where I ( x, y ) is a (binary) image. The goal can now be formulated to find the allowed deformation that maximizes the sum of the segment matches, i.e., L

D∗ = arg max ∑ m l(T, D, I) D

[Eq. 3.11]

l=1

If the image I contains P pixels, there are almost P possible positions for the first segment. Each of these positions should be combined with ( 2ξ 2 + 1 ) ( 2υ2 + 1 ) positions of the second segment. This means that when combining the first two segments, P ( 2ξ 2 + 1 ) ( 2υ 2 + 1 ) combinations has to be examined, each of which gives rise to ( 2ξ 3 + 1 ) ( 2υ 3 + 1 ) combinations with the third segment, and so on. In conclusion, the number of combinations to examine is L

l

C = P ∑ ∏ ( 2ξ j + 1 ) ( 2υj + 1 )

[Eq. 3.12]

l = 2j = 2


35

Chapter

3.


For the template do be able to deform at all segments, it is needed that ( 2ξ j + 1 ) ( 2υj + 1 ) is at least 2, and a lower bound for Eq. 3.12 is L

C ≥ P∑ 2 ≥ P ⋅ 2 , l

L

[Eq. 3.13]

l=2

i.e., the number of combinations to be examined increases exponentially with the number of the segments in the template. This is of course not an attractive property, but fortunately, it is quite easy to get around. One way to do this is by dynamic programming, as will be discussed next.

Dynamic Programming Dynamic programming is a solution to the problem of finding the shortest path through a weighted graph. In the telecommunications community, dynamic programming is more known as the Viterbi algorithm, introduced by Viterbi [Vit67] and later shown to be equivalent to dynamic programming by Omura [Omu69]. The algorithm is briefly described here; for more thorough discussions see [Bel62, Lin83]. Use of dynamic programming for deformable line template matching is also discussed in [Gei95, Noh96, Son99]. Consider an ordered set of sites T = { s 1, s 2, …, s L } , each consisting of a number of nodes s l = { n l1, n l2, …, n lN } . Each node is connected to nodes in the previous and the following site, and each connection has a cost. The cost of the connection from n ( l – 1 )j to n lk is denoted c ljk . The structure is usually illustrated as a trellis where each site corresponds to a column; see Figure 3.8. The problem is to find the cheapest path through the trellis from the first to the last site. Dijkstra’s principle (or Bellman’s) then says that if the cheapest path from a start node ( n 1j ) to an end node ( n Lk ) passes through a node n, then, the two parts, from n 1j to n and from n to σ Lk , are also optimal. This implies that if we want to calculate the cheapest path from site 1 to site l, we can start by calculating the cheapest paths from site 1 to the nodes in site l - 1. Call the resulting costs χ ( l – 1 )j , i.e., the costs of the cheapest paths from site 1 to node n( l – 1 )j . The cheapest path from site 1 to a node n lk will then pass through the node n ( l – 1 )j that minimizes χ ( l – 1 )j + c ljk . Doing this recursively through the trellis, the cheapest path from site 1 to site L can be calculated by examining Q ⋅ L paths, where Q is the number of paths between each site. s1

s5

n11

n51

n15

n55

Figure 3.8: A trellis with five sites and five nodes in each site. Note that not all possible connections are illustrated.

36


3.3

DEFORMABLE LINE TEMPLATES AND DYNAMIC PROGRAMMING

To solve the deformable template problem, a site s l is identified with a template segment sl, and a node nlk with a position of the l:th segment ( x l, y l ) . Consequently, there will be P nodes in each site (P being the number of pixels in the image, i.e., the number of possible segment positions). The cost c ljk is set to the match given by the l:th segment at the position given by node n lk . Since the goal is to find the deformation that maximizes the total match, the most expensive path is searched for (which is an equivalent problem). The number of combinations (paths) to examine will then be C = Q ⋅ L < max ( 2ξ j + 1 ) ( 2υ j + 1 ) ⋅ P ⋅ L , j

[Eq. 3.14]

i.e., it grows linearly with L. This is quite an improvement compared to Eq. 3.13.

Experimental Results Photographs of the people at the Image Coding Group at Linköpings universitet has been digitized and filtered for edge detection. A deformable line template with the approximate shape of a head with hair (Figure 3.9) has been created. The allowed deformation is set to one and two pixels in x and y-directions respectively. However, two of the 23 segments (In Figure 3.9, those segments are gray) are somewhat special in two senses: First, they are not matched to the image (i.e., ml, see Eq. 3.10, is always set to zero for those two segments), and, second, the allowed deformation above them is extra large (5 and 10 pixels in the x- and y-directions respectively). As can be seen in Figure 3.11, the template behaves fairly well. Typical errors are shown in the bottom row, where the edge between skin and clothing suits the template better than the chin, and (in one case) where the template fits to the hairline instead of the headline. This type of errors are probably not possible to avoid without using additional tech- Figure 3.9: A head template. niques to, for instance, locate the eyes and/or the mouth. To use this technique to find the shape of the head is of course only one of many possible examples. Using differently shaped line templates a variety of features can be localized, for example eyes/mouth or glasses by using templates as in Figure 3.10.

Figure 3.10: Using deformable line templates to locate different

features like eyebrows, eyes and mouth (left) or glasses (right).


37

Chapter

3.


Template

Figure 3.11: The head template and its deformation to fit

images. Typical failures are shown in the bottom row.

The problem is that the method is very dependent on the input data; it is not more robust than the edge detection. It is, for instance, usual to get results as in Figure 3.12, where the eye and eyebrow segments are well positioned, but the mouth segments is attracted to the nose (The template used is the same as in Figure 3.10, left). This error is because the template tries to match to any kind of edges without discriminating between “mouthedges” and “nose-edges”. Also, edges outside the face may Figure 3.12: Erroneous attract the template as well as edges within the face, which can facial feature location. be a problem when the background is more complex than in the images shown here. In the next section, a way to increase the robustness will be discussed.

Conclusion Deformable line templates is a useful tool for feature extraction. Combined with dynamic programming it is also of reasonably low computational complexity. The major drawback is that the method is heavily dependent on the pre-processing of the image, such as the edge detection and the location of the search area in which to extract the features (i.e., the location of the area containing the face).

38


3.4

COMBINING DEFORMABLE TEMPLATES AND STATISTICAL PATTERN MATCHING

3.4 COMBINING DEFORMABLE TEMPLATES AND STATISTICAL PATTERN MATCHING As mentioned, the method using deformable line templates described in Section 3.3 is very sensitive to the quality of the edge detection. A solution to this is to let each segment attract to different features in the face. For instance, the four segments covering the left eye in Figure 3.10 could be replaced by one segment of length one, where the only element contributes to the segment match according to some criterion specific to the left eye. To exemplify, this criterion could be a matched filtering using an eye template. Results of deforming a four-segment template where each segment’s starting point is attracted to a matched filter response is shown in Figure 3.13. However, matched filtering will not always suit our needs, since the features searched for are variable (see Section 2.3 - 2.5). Therefore, some different matching and distance measures will now be discussed. Then, some other useful attractors will be discussed, and finally all is put together to a unified scheme for facial feature extraction. The only modification needed to the deformable line template scheme described in Section 3.3, is an alternative segment match (Eq. 3.10). It is redefined as ml(T, D, I, Φ) = m T ( I T ( x l, y l ), Φ ) ,

[Eq. 3.15]

where I T ( x, y ) is the image in the neighbourhood of ( x, y ) and Φ is an image template. The function m T is some kind of matching/distance measure, as is discussed next.

Left eye Right eye

Mouth

Chin

Image templates

Deformable template

Result

Figure 3.13: Using image templates as attractors for starting points of the segments in a deformable template.

Matching and Distance Measures for Image Templates Using a deformable template, the question is, what kind of attractors should be used? In the example above, matched filtering (or deterministic template matching, see Section 2.3) is used, where the filter templates equal the averages of the corresponding parts of face images in a data base. Some different matching and distance measures that can be used are:


39

Chapter

3.


• Deterministic template matching (matched filtering). • Normalized deterministic template matching. • Distance from feature space. • Distance in feature space. • Mahalanobis distance. For all those measures, an image template to be matched to a part of the image is assumed. The image part is expressed as an N-dimensional column vector x. • Deterministic template matching: The simplest thing to do is deterministic template matching, i.e., m D(x, t) = x T t ,

[Eq. 3.16]

where t is a template vector. This is the fastest way to do the match, and it requires only N multiplications. • Normalized deterministic template matching: Usually, it is preferred to remove the DC-component of the test vector x as well as the template vector t. Also, scaling the vectors to norm one is preferable; both these actions decreases the dependency of luminance conditions. The result is the normalized deterministic template matching, i.e., ( x – x )T ( t – t ) T m D(x, t ) = m D(x, t) = x t = ---------------- -------------- . x–x t–t

[Eq. 3.17]

mD will be in the range [– 1,1] , where one indicates a perfect match (This is equivalent to Eq. 2.4).* This measure is more secure than the match above ( m D ), but it is not as fast; 2N multiplications and one square root computation has to be made (calculating mD2 instead eliminates the need of the square root). • Distance from feature space: Typically, the templates used in deterministic template matching are the average of a set of training vectors. For example, the mouth template in Figure 3.13 is the average mouth from one hundred face images. However, the appearance of features to detect is usually quite varying, making a statistical measure more useful than a deterministic one. Thus, the fact that the variance is concentrated to a low-dimensional subspace of the N-dimensional vector space is used. A simple way is to collect a training set of template vectors and perform a principal component analysis (PCA, see Section 2.5) and estimate the M dimensional feature space, i.e., the space spanned by the M first principal components N

1 T *) x denotes the DC level of x as a vector, i.e., x = ---- [ 1, … , 1 ] N

∑ xi i=1

40


3.4


ϕ 1, …, ϕ M . Then, the distance from feature space (DFFS) can be calculated as the reconstruction error when trying to express x as a linear combination of the principal components. This is expressed as d FFS(x, Φ) = x – ΦΦ x T

2

= x

2

– Φ x T

2

= x ( I – ΦΦ )x , T

T

[Eq. 3.18]

where Φ = [ ϕ 1, …, ϕM ] and ϕ i = 1 . An alternative measure is how large part of the signal energy that is within the feature space as a feature space match: m FS(x, Φ) = Φ x T

2

= x ΦΦ x = 1 – dFFS(x, Φ) . T

T

[Eq. 3.19]

Calculating m FS or d FFS requires slightly more than N ⋅ M multiplications*. • Distance in feature space: If M is large, almost all (or all, if M is large enough, i.e., M = N ) vectors will be reconstructible by ϕ 1, …, ϕM , i.e., d FFS will always be small and m FS close to one. This will be the case if the training set is large, and the training vectors are noisy or the feature space has high dimensionality. The sequence of ϕ ’s can then either be truncated, thus regarding the feature space as low-dimensional, or the distance in feature space (DIFS) can be measured. The test vector x is then projected on each of the ϕ ’s, and the results are divided with the variance of the training set in the direction of the corresponding ϕ , i.e., d IFS(x, Φ) =

ϕi x T –1 T - = x ΦΣ M Φ x , ∑i = 1 -------2 σi M

T

[Eq. 3.20]

where Σ is a M × M diagonal matrix with the variances in the respective directions. The computational complexity is proportional to N 2 . • Mahalanobis distance: If M = N , i.e., the feature space is of the same dimensionality as the vector x itself, then the DIFS in Eq. 3.20 equals the Mahalanobis distance, defined as, d M(x, Φ) =

N

ϕ x T

i ∑i = 1 -------σ i2

= x ΦΣ Φ x = x C x T

–1

T

T

–1

[Eq. 3.21]

where C is the covariance matrix of the training set (which is supposed to have zero mean). If M < N the Mahalanobis distance can be estimated as the sum of the DIFS and the normalized DFFS, d FFS -, d FFS = --------ρ

[Eq. 3.22]

T

*) If the matrix ΦΦ is sparse, the complexity can be reduced.


41

Chapter

3.


i.e., the DFFS divided by the average variance outside the feature space. The latter sounds like a contradiction, since the variance of the feature vectors outside feature space should be zero. However, if the feature space is supposed to be Ndimensional, but only the principal M dimensions can be estimated (for instance, when the training set spans only M dimensions), the DFFS will have to be used for estimating the Mahalanobis distance in the last N – M dimensions. The average of the variances outside the feature space can then only be guessed (within some limits). The expression for the estimated Mahalanobis distance is then d M(x, Φ) = d IFS(x, Φ) + d FFS(x, Φ)

[Eq. 3.23]

T ϕ i x dFFS(x, Φ) -------∑i = 1 σ i2- + -----------------------ρ M

=

1 T T T –1 T = x ΦΣ M Φ x + --- x ( I – ΦΦ )x ρ 1 T T –1 T = x  ΦΣ M Φ + --- ( I – ΦΦ ) x ρ which can be computed in O ( N 2 ) time (the matrix within parenthesis can be computed in advance). It is still assumed that the average feature vector (and consequently the training set) has zero mean. If this is not the case, the average should be subtracted before calculating the Mahalanobis distance, as well as from the training set.

Choosing Attractor How do one choose what distance/matching measures to use as attractor for the deformable template? As discussed earlier (Section 3.1) a suitable quality measure of a facial feature extraction is how well the remapped facial image fits into a face space. This is however not fully clear; does it mean that the Mahalanobis distance or the DFFS for the full face should be minimized? Depending on what definition we choose, we should make the choice of attractor accordingly. However, there are other criteria applying to the attractors as well. First, they should be of low computational complexity. Second, they should be robust. A very robust attractor, always guiding the template to almost the optimal deformation is preferable to a very accurate attractor that fails now and then. In conclusion, experiments have to be made to find out which is the best attractor to use in the specific application. If non-normalized deterministic template matching is the only one that can be computed fast enough, then there is, of course, no choice. The following observations should also be noted: • The Mahalanobis distance is optimal if the distribution of the face images is Gaussian.

42


3.4


• For non-Gaussian and/or non-convex processes, DFFS probably gives the best result.

Other Relevant Attractors We are not limited to the distance / matching measures described above. A powerful attractor to be used is the iris-detecting function described in Section 3.2. This is easily combined with the matching measures above, for instance by adding the iris confidence value to the match in a specific position, i.e., m ( x, y ) = α ⋅ m T(I T ( x, y ), Φ) + β ⋅ max ( iris ( x, y, r ) ) , r

[Eq. 3.24]

where mT can be replaced by any of the matching measures described above, I ( x, y ) is the image in the neighbourhood of ( x, y ) (reshaped to an N-dimensional vector), and iris ( x, y, r ) is given by Eq. 3.7. The constants α and β are used for weighting the attractors. To reduce the computing time, the iris-function should be called only when a reasonably high confidence of the present image neighbourhood being an eye is given by the matching (alternatively, the template matching is called only when the iris-function gives a reasonably high value). Another, very simple, attractor is the luminance, or, rather, darkness, of the image. Nostrils, irises, mouth corners and eyebrows are supposedly darker than their surroundings, and points in the deformable template corresponding to those features can be more securely attracted by removing the luminance value from the match, i.e., m ( x, y ) = α ⋅ m T(I T ( x, y ), Φ) + β ⋅ max ( iris ( x, y, r ) ) – γ ⋅ I ( x, y ) r

[Eq. 3.25]

Putting Things Together We can now choose from the matching measures above, as well as image gradients (edges) and darkness, to use as attractors. A complex deformable template is illustrated in Figure 3.15. Line segments are attracted to edges along the chin and eyebrows, and template matching is used to attract chin, mouth corner points, nostrils and eyes. By first letting a face detection algorithm (using statistical template matching and simulated annealing, see Sections 2.4 and 2.5) locate the face, and then letting the dynamic programming algorithm find the optimum deformation of the deformable template, the face and its important features are located. It should be stressed that even though the deformable template does not have a fixed size, it can still not deform to any size. If the face is more than, say, 20%, larger or smaller than the original size of the template, it is not likely that it will deform correctly. Another problem is to handle rotation; the current formulation of a deformable template does not allow large rotations, and neither will the matching measures. To solve these problems, a face detection algorithm must first be applied to the image. The face detection algorithm should find the face area, and remap it into an image of suitable size where the face is upright. This can be done by the VIDAS face detection algorithm


43

Chapter

3.


described in Chapter 2; see Figure 2.9, from which the marked areas are remapped to rectangles of a size suitable for the deformable template.

Experimental Results Experiments have been performed on image sequences from the VIDAS project, standard test image sequences (Miss America, Claire) and images from the face database at Linköpings universitet. The deformable template used follows the description in Figure 3.15, with the addition that nostrils, mouth corners and eyebrows are also attracted to low luminance. The matching measure used is the feature space match, m FS(x, Φ) , according to Eq. 3.19. The dimensionality of the feature space is set to a number between one and five, depending on the feature and the speed requirements. The feature dependability is due to experimental results; while a five dimensional feature space for the eyes results in better performance than a one or two dimensional feature space, the corresponding improvement is insignificant for the nostrils. Probably those dependencies are correlated to the rate at which the variances along the principal components decrease, but this has not yet been investigated. The face detection algorithm used is a simplified version of the VIDAS face detection scheme described in Chapter 2. Other deformations than scaling, widening / thinning of the face and rotation around the z-axis (in steps of five degrees) are not checked. The extracted face area is remapped into a rectangle of size 108 × 154 pixels, on which the deformable template is applied. The overall results are promising. The algorithm manages to output results such as the one shown in Figure 3.16. The accuracy of nostril, eye centre, and mouth corner locations are good, while the eyebrow detection is of lower quality. The accuracy of the chin contour is varying along the chin; to the left and right, where the edges are usually strong, the accuracy is good. In the middle, though, there are often edges in the face above the actual contour. Consequently, the chin contour is sometimes placed to high, typically as in Figure 3.16.

Problems There are a few problems associated with this method, as will be accounted for here: • The deformable template is not rotation invariant, and can therefore deliver erroneous results in case of a tilted face. The solution adopted here is a face detection algorithm that also detects the tilt angle of the face and remaps the face area into a new image, where the face is upright (as in Figure 3.16). This solution makes the method very sensitive to errors in the face detection step; in Figure 3.14 the case when the face is upright but the face detection algorithm – erroneously – Figure 3.14: Erroneous feature extraction has reported a tilted face is illustrated. due to erroneous face tilt angle detection.

44


3.4


h h

g

g e e

f

d

f

d

c b

c b

a

a

Segment endpoint

Filters for template matching (average vector or first principal component)

Segment attracted to edges Segment without attractor x

Segment where the starting point uses template x as attractor Figure 3.15: A deformable template using combined attractors for attracting

line segments to edges and points to specific locations using template matching.

Figure 3.16: Detected face and extracted facial features based on a deformable template

which uses edges, template matchings and an iris-detecting function as attractors.


45

Chapter

3.


• The inherent one-dimensionality of the deformable template does not suit the human face very well. To follow an edge, like the chin border, works fine, but problems can occur when, for example, the face area is high and narrow. Stretching the template vertically might force the right eye to be positioned higher than the left (and analogously for the mouth corners and nostrils)! The solution is another type of template structure, which will be discussed in the next section. • As long as the template structure is cycle-free, the Viterbi algorithm guarantees that the provided solution is optimal. However, if cycles, as the eye/eyebrow part of the template in Figure 3.15, are introduced this is not the case. Actually; there are also hidden cycles in the structure, e.g., connecting the chin eigenmatching site with the chin borderline. There are iterative, heuristic solutions in the literature, used for decoding of error-correcting codes. Since such solutions increase the computing time, we will instead try to eliminate the need for cycles using the structure described in the next section.

3.5 DEFORMABLE GRAPHS Inspired by [Wib96], a new type of deformable template, here called a deformable graph, has been implemented. The deformable graph does not have the intrinsic onedimensional structure of the deformable template. It is, though, equally hard to deform optimally if cycles are introduced. A simple graph with six sites is shown in Figure 3.17. The graph could also be illustrated as an extended trellis, see Figure 3.18; obviously this way of illustration is quite cumbersome, and will not be used further on. Although the correspondence between the trellis and the deformable graph is in principle the same as earlier, the definitions can be simplified as follows: • A graph is a tuple G = 〈S, A 〉 that consists of a set of sites S = { s 1, s 2, …, s n } and a set of directed arcs A = { a 1, a 2, …, a m } .

Figure 3.17: A simple graph.

• A site is a tuple si = 〈N, m〉, where N is an (2-D) array of nodes, and m is an attractor or matching function m = m ( x, y ) according to Eq. 3.24. • An arc is a tuple ei = 〈 s j, s k, x, y, ξ, υ 〉, where sj and sk are the two sites connected by the arc, x and y are the default distances between sj and sk in the image, and ξ and υ are the maximum deviations from these distances. The arcs are directed, i.e., by swapping sj and sk a different arc is created. sj is said to be the parent of sk, and sk to be a child of sj.

Figure 3.18: The graph illustrated

• A node ni = 〈x, y, v〉 corresponds to a position in as an extended trellis. Each site is the associated image and has a value v given by here illustrated as a column of nodes; in our application, it the site’s attractor at that position. should be a 2-D array.

46


3.5

DEFORMABLE GRAPHS

• An evaluated graph is a graph where the attractors have been computed and the nodes’ values set accordingly. In Figure 3.19, site s1 is a child of s2. This relationship indicate the order of the dynamic programming; a site cannot be processed until all its children are, and s4, the top site, must consequently be processed last. A site may have only one parent but several children. If the values given by the attractors are given in terms of likelihood ratios, the deformable graph is a special case of a Bayesian network [Jen96].

s5

s1

s2

s4

s3

s6

Figure 3.19: The graph

with directed arcs.

Computation Suitably, the processing order of the sites is calculated first. In Figure 3.19, the order of computation could, e.g., be 1, 3, 5, 6, 2, 4 or 3, 1, 2, 6, 5, 4. Then, limits for the position of each site can be computed. For example, if the arc from site s2 to site s1 indicates an xdistance of 15 pixels, with a maximum deviation of 5 pixels, then the position of s2 is limited to x-values greater than or equal or 10. By first inferring the limits from each site to its parent site (in computation order), and then (in reverse computation order) from each site to its children, the correct limits are calculated. When the limits are known, nodes outside the limits can be discarded, and the graph evaluated. The dynamic programming algorithm is then applied in the order of computation, as follows: • For each node ni in the current site sj: • For each child sk, find the node nl, within the area defined by the arc connecting sj and sk, with the maximum value. Add this value to the value of ni, and remember nl as a child node of ni. When all sites are processed, the deformation of the graph can be read as follows: • Find the node with maximum value in the top site. Output its coordinates and mark its child nodes. • For each site, in reverse order of computation, find the marked node, output its coordinates, and mark its child nodes, if any.

A Deformable Graph for Facial Feature Extraction The graph illustrated in Figure 3.20 has been designed for facial feature extraction. The sites attracted to edges have a matching criterion quite similar to a segment of the deformable template in Section 3.3. However, both horizontal and vertical edge detections are performed, and weighted together at each site according to the direction of the edge. The sites with statistical pattern matching attractors are identical to the sites in the deformable template used earlier (see Figure 3.15). There are also a few sites without attractors, used for keeping the other sites together and the mid-face line straight. The limits for the position of each sites are calculated as described above. The results are illlustrated in Figure 3.21 (with a detected and remapped face in the background).


47

Chapter

3.


Site with edge attractor Site with statistical template attractor Site without attractor

The graph suitably deformed to a face image.

Figure 3.20: A deformable graph for facial feature extraction.

Figure 3.21: Left: The limits for the mid-face sites. Middle: The limits for the remaining sites

with statistical pattern matching attractors. Right: The limits for the sites attracted to edges.

Experimental Results As test data, some of the publicly available images from the M2VTS database [M2V] as well as the classic image sequences “Claire” and “Miss America” have been used. All images are of high quality, and the graph deforms well, as shown in Figure 3.22. Tests have also been performed on (subsampled) images from the face database at Linköpings universitet. The images are of lesser quality, as are the results, shown in Figure 3.23. All the six images in the bottom row have critical errors. The three first, image a - c, are due to an error in the graph; the graph has been programmed not to allow wider chins. This error is easy to correct, but was unfortunately discovered too late. The two images d and e are worse; the cause is so far unknown. Image f represents an important problem yet to be solved; how to handle partial occlusions of the face (in this case by a moustache). In all images, the face has first been detected by the experimental face detection algorithm described in Section 2.5. The graph deformation algorithm runs in a few seconds on a desktop computer; the computing times are reported in Table 1. Most of the time, approximately 80 %, is used by the statistical pattern matching, and the remaing for the dynamic programming.

48


3.5

DEFORMABLE GRAPHS

Table 1: Computing times for graph deformation.

Computer Sun Ultra Sun SS 20 Sun 4

CPU, CPU / memory freq UltraSPARC 168/ 84 MHz SuperSPARC 75/ 50 MHz SuperSPARC 50/ 40 MHz

Time 1.5 s 3s 4s

Figure 3.22: Results of feature extraction using a deformable graph on images from the M2VTS

database (the top four) and the standard test sequences “Claire” and “Miss America”.


49

Chapter

3.

a


b

c

d

e

f

Figure 3.23: Results of feature extraction using a deformable graph on images from the Linköping

University face database. Images with critical errors have been placed in the bottom row; in image a - c the cause is a too limited deformability of the graph, in image d and e the cause is unknown, and in image f the large moustache causes trouble. The images are slightly cropped.

50


3.6

ADAPTING A WIREFRAME M ODEL TO EXTRACTED FACIAL FEATURES

3.6 ADAPTING A WIREFRAME MODEL TO EXTRACTED FACIAL FEATURES To be able to perform model-based coding, a face model – a wireframe – is usually adapted to the image. This section discusses why this is done, describes a wireframe model to be used, and, finally, treats a method for adapting the wireframe to extracted facial features as described in the earlier sections. So, why adapt a model? There are several reasons: • A face model will be used in the decoder. By using the same model in the encoder, the decoded image is predictable and model-based coding (including transmission of a residual image) as defined in Chapter 1 is made possible. • Even though it is possible to transmit the coordinates of facial features only, and thus animate a face at the receiving end, there is still reason to fit a model at the transmitting side. This is because there might be errors in the extracted feature data, for example a detected nose might move slightly, even though it does not move in the image. If a face model is fitted to the extracted feature data, the deformability of the model may be limited, so that, for example, the head shape is constant through the sequence. Also, the deformability can be limited so that highly unlikely feature motion data will be ignored. • The texture of the face should be transmitted as well. By adapting the model and remap the image into its surfaces, an unambiguous way of representing the texture is achieved. Many different face models exist, varying significantly in complexity. One of the earliest models is the CANDIDE model, developed at Linköping University in the eighties [Ryd87]. CANDIDE is a relatively simple model, see Figure 3.24, consisting of approximately 100 triangles, and even though more complex models are preferred by face animators and computer graphics people, CANDIDE will suit our needs here. CANDIDE is usually animated by Action Units [Hjo69, Ekm77], where each Action Unit describes a Figure 3.24: CANDIDE. deformation that is possible to perform by a human face. In CANDIDE, Action Units are implemented as a set of vertex displacements in a local coordinate system. A simple example is Action Unit 1, the “Inner brow raiser”, as shown in Table 2. Globally, the model is transformed by translation, scaling and rotation in three dimensions. Table 2: AU1 Inner brow raiser.

Vertex 17 50 15 48

X -10 10 -10 10

Y 30 30

Z -4 -4


51

Chapter

3.


Adapting the Model A model like CANDIDE can be completely described by a set of vertices, a set of surfaces*, and a set of deformation units. A vertex is described by a set of three coordinates telling the position of the vertex in a local 3-D coordinate system, and a surface is described by three integers telling which vertices to connect in the triangle that constitutes the surface. A deformation unit is either an Action Unit (as in Table 2) or a shape deformation, like changing the distance between the eyes. Shape deformations are only used when the model is fitted to a face, while Action Units are used to animate the face. Given the vertices of a model as a vector m and a set of feature point coordinates from an image, e.g., the ones from Figure 3.16, the goal of the adaptation is to find the deformed model m* that fulfils min m∗ – f , 2

[Eq. 3.26]

where f is a vector containing the feature point coordinates. The simplest way to find m* is simply to set each vertex’ coordinates to the coordinates given by the corresponding feature point. This method has several drawbacks, though: 1. If

only 2-D coordinates are given by the feature points, only two dimensions of the vertex coordinates will be changed.

2. Vertices

not corresponding to coordinates in the vector f have to be interpolated.

3. If

the feature point extraction process is not fully reliable, the model might be adapted in a very strange way. For example, the distance between the eyes may change during a sequence.

The solution is to allow only a limited set of deformations to the model, and using this constraint solve Eq. 3.26. The set of allowed deformations should include global motion (translation, scaling, rotation) as well as possible local motion, typically Action Units. Also, for the first frame in a sequence, Shape Units should be allowed, permitting the model to be adapted to the static shape of the face. In this way, m* can be expressed as m∗ = RS ( m + ∑ α i a i + ∑ δ i d i ) + t ,

[Eq. 3.27]

where S and R are scaling and rotation matrices respectively, and t is a translation vector. The constant vectors ai and di are Action Units and Shape Units respectively. To simplify, scaled parameters are sought for, i.e., the Action and Shape Units are applied to the scaled model: M∗ = R ( Sm + ∑ α i ai + ∑ δ i d i ) + t

*) Usually, the word “faces” is used. To avoid confusion, the word “surfaces” is used here.

52


[Eq. 3.28]

3.6

ADAPTING A WIREFRAME M ODEL TO EXTRACTED FACIAL FEATURES

Figure 3.25: CANDIDE adapted to the feature data according to Figure 3.13. In the left image, the

allowed deformations were scaling, translation and rotation around the z-axis. In the right image, the distance between the eyes and the distances eyes-nose and nose-mouth were also allowed to change.

For small rotations, the rotation can be approximated as follows:* RM = ( R x R y R z )M ≈ ( I + ρ 1 R1 ) ( I + ρ 2 R2 ) ( I + ρ3 R 3 )M ≈ ( ρ 1 R 1 + ρ 2 R 2 + ρ 3 R 3 + I )M

[Eq. 3.29]

R1, R2 and R3 are matrices for small rotations around the x, y and z-axes respectively. Finally, m* is approximated as ˆ ∗ = ( σ i S i )m + α i a i + δ i d i + ( ρ i Ri )m + τ i t i , m∗ ≈ m ∑ ∑ ∑ ∑ ∑

[Eq. 3.30]

where also the Action and Shape Units have been approximated as invariant to rotation. This is a quite dangerous step, but it is useful as long as the rotations keep small. Also, the rotation around the z-axis is often given by the face detection and/or the facial feature extracion algorithms, and the Action and Shape Units can be rotated accordingly. ˆ ∗ = Ap where A Eq. 3.30 can then be rewritten into a matrix/vector multiplication m defines the allowed deformations/motion, i.e., A = [ S 1 m, S 2 m, S 3 m, a 1, …, a a, d 1, …, d d, R 1 m, R 2 m, R3 m, t 1, t 2, t 3 ] , [Eq. 3.31] and p is the parameter vector according to p = [ σ 1, σ 2, σ 3, α 1, …, αa, δ 1, …, δ d, ρ 1, ρ 2, ρ 3, τ 1, τ 2, τ 3 ] T .

[Eq. 3.32]

*) This is due to the fact that a rotation matrix around the x-axis can be approximated as 0 0 0 1 0 0 1 0 0 0 cos ρ sin ρ ≈ 0 1 ρ = I + ρ 0 0 1 0 –1 0 0 – sin ρ cos ρ 0 –ρ 1


53

Chapter

3.


Consequently, Eq. 3.26 can be solved as follows: min m∗ – F

2

≈ min Ap – F –1

2

p = (A A) A F T

T

[Eq. 3.33] [Eq. 3.34]

The results of two such solutions are shown in Figure 3.25. Although this is a very direct method, a more practical way is to iterate in a scheme which alternates between global and local refinement. In such a scheme, a solution using only the global parameters is first searched for, i.e.,  A' = [ S 1 m, S 2 m, S 3 m, R 1 m, R 2 m, R3 m, t 1, t 2, t 3 ] .   p' = [ σ 1, σ 2, σ 3, ρ 1, ρ 2, ρ 3, τ 1, τ 2, τ 3 ] T

[Eq. 3.35]

Then, the local refinement is performed using  A'' = [ a1, …, a a, d 1, …, d d ]   p'' = [ α 1, …, αa, δ 1, …, δ d ] T

[Eq. 3.36]

 A''' = [ a 1, …, a a ]   p''' = [ α 1, …, α a ] T

[Eq. 3.37]

for the first frame and

for the following frames, thus allowing the face to reshape only initially. For each frame, the global and local adaptation are alternated in a few iterations. A more thorough discussion on tracking facial features through an image sequence can be found in, e.g., [Li93]. When the face model is adapted to the image, the face model parameters can be transmitted over the channel; i.e., p' , p'' , and the facial texture are transmitted for the first frame, and p' ; p''' , and (possibly) a residual image for the following. Alternatively, [ δ 1, …, δ d ] are converted to a set of FDP points and p''' to a set of FAPs (see Section 1.5) to conform to the MPEG-4 standard. As will be shown in the next chapter, there are, however, reasons to believe that the set of Action Units { a i } gives a more compact representation than the FAPs.

3.7 CONCLUSION In this chapter, the feature extraction problem has been presented as a multidimensional optimization problem. A method that works in near real-time has been implemented, showing that it is possible to use such methods for the initial feature extraction in a model-based coding scheme.

54


Chapter

4.

FACE MODEL PARAMETER CODING 4

In this chapter, the three different kinds of face model parameters needed to be transmitted in a modelbased coding scheme; shape, texture, and animation discussed. A scheme for efficient coding of local motion parameters will be described.

4.1 FACE MODEL PARAMETERS Three kinds of parameters, see Figure 4.1, need to transmitted in a model-based coding scheme; shape, texture, and animation parameters: • Shape parameters need to be transmitted in the first frame only, to tell the 3-D shape of the head. Specifically, these parameters can be 3-D coordinates of facial feature points (FDP points, see Section 1.5, “MPEG-4 Face Animation”), vertices of a face model, or a complete 3-D mesh. The amount of data to be transmitted is in most cases so small that a simple representation will suffice. However, if a complex mesh is used, there are algorithms for compression of 3-D meshes developed within the VIDAS project [Vid] and included in MPEG-4 version 2 [MPV98b].

Shape parameter coding Face model parameters

Animation parameter coding

Texture coding

FDP points, vertex coordinates, model deformation parameters, ...

FAPs, Action Units, ...

FDP texture, Eigenface coefficients, ...

Figure 4.1: Face parameter coding.


55

Chapter

4.

FACE M ODEL PARAMETER CODING

• The texture is an image to be mapped onto the face model. In its uncompressed form, the texture will by far constitute the largest part of the transmitted data, and compression is certainly needed. The most obvious thing to do is to use a waveform coder for the first frame and then only send differential or no images at all for the following frames. In MPEG-4, the general scaleable wavelet still image coder [MPV98a] will be used for the facial texture of the first frame, and no updating of texture data will be done. A specialized and more efficient method is described in [Str98]. • The animation parameters consists of global and local motion parameters: The global motion is the rotation and translation of the face. The amount of data to be transmitted is really small, but it needs to be transmitted every frame. Consequently, some compression scheme may be used, like predictive coding. In MPEG-4, only translation is regarded as a global motion parameter, telling the position of the video object. Rotations are treated as local motion parameters. In our notation, local motion consists of the mimics and facial expressions. Since the local motion is rich in variation and apply to a large part of the face model, there will often be a need for compression. In MPEG-4 the local motion parameters are represented by Facial Animation Parameters (FAPs), and compression of those will be discussed next. In the previous chapter, a set of Action Units { a i } was used as local motion parameters. In this chapter, we will show how to choose this set for efficient transmission.

4.2 MPEG-4 CODING OF LOCAL MOTION PARAMETERS As described in Section 1.5, “MPEG-4 Face Animation”, face animation in MPEG-4 is carried out using Facial Animation Parameters (FAPs), where each (low-level) FAP denotes a movement of one facial feature point (see Figure 1.7 on page 10) in one out of three dimensions. There are a total of 66 low-level FAPs in the FAP set. If each FAP is coded with 6 bits, then 400 bits are generated by a single frame. To drive a face animation, where a temporal resolution of 15 - 30 Hz is usually adopted, a transmission bandwidth of 6 - 12 kbit/s is required. In this thesis, we only consider very low bit-rate applications, such as a mobile terminal using a cellular network (with a capacity of typically 8 kbit/s) and where speech should be transmitted on the same channel. In this case, compression of the FAPs is obviously needed, and several techniques for this purpose have been suggested during the development of MPEG-4. The first step is a masking technique. This approach is based on the observation that not all FAPs are engaged in a natural talking head sequence, or, even if they are, they may not all be extracted by the image analysis program. By first transmitting a binary mask, only FAP coefficients for active FAPs need to be transmitted. In addition, techniques for doing interpolation between the FAPs are used. To compress the masked FAPs, a predictive coder that takes advantage of the temporal correlation of the FAPs is adopted in MPEG-4. In this coding scheme, the decoded parameter of the previous frame is used to predict the corresponding parameter of the current frame. The prediction error is then coded using an adaptive arithmetic coder.

56


4.3

FACIAL MIMIC, ACTION UNITS AND BASIS FUNCTIONS

Combining these techniques, the bit-rate for the FAPs can be reduced to 2 kbit/s with a reasonable reconstruction quality. To further reduce the bit-rate, a DCT-based compression technique is used as an alternative to the predictive/arithmetic coder. By grouping several frames and performing DCT in the temporal direction, high compression efficiency can be accomplished. A drawback with this approach is the large (inter-frame) coding delay, which makes it unsuitable for interactive applications. None of the above-mentioned approaches perform intra-frame coding, i.e., there is no attempt to decorrelate the FAP coefficients within a frame. In this thesis we suggest a new coding scheme, which not only gives low coding delay but also has high compression efficiency. This scheme is inspired by early work on model-based coding where Action Units (see below) are used to organise low level animation parameters.

4.3 FACIAL MIMIC, ACTION UNITS AND BASIS FUNCTIONS Although a single FAP specifies the movement of a particular point, the FAPs corresponding to a single frame are not independent of each other. In fact, the movements of key features are closely correlated; they are physically constrained by facial muscles and skin. According to the study of mimics [Hjo69, Ekm77], a facial expression is a high level description of facial motions, and it can be decomposed into Action Units. An Action Unit stands for a small change in the facial expression, dependent on a small conscious activation of muscles. The Action Unit is the lowest level of motion that is possible for a human face to perform, and will be helpful in the grouping of FAPs. For example, activating the chin raising muscle corresponds to one Action Unit but to several FAPs. Since it is not possible to activate only a part of the muscle, it seems natural to use one parameter only to describe the activation of one Action Unit. Unfortunately, Action Units can not be used directly since they are model-dependent; the Action Units defined in [Hjo69] or [Ekm77] are verbal and pictorial descriptions of facial motion. What can be done, though, is to implement the Action Units for a specific face model, e.g., as in Table 3 for the CANDIDE model [Ryd87], where an Action Unit is associated with vertex displacements. Table 3: AU 17 Chin raiser

AU 17

Vertex 9 8 7 40

X 0 0 0 0

Y -30 -20 -15 -15

Z 0 0 0 -10

Alternatively, we can use the Action Unit descriptions to group FAPs activated by the same muscle and estimate the FAP coefficient values for a fully activated Action Unit. We can express this as f i = a j b ij where the estimated constant b ij tells how much the i:th FAP is activated by the j:th Action Unit. The resulting vector b j = [ b1j, b2j, …, bMj ] T , where M is the number of low-level FAPs, describes how to translate one Action Unit to a set of FAPs. We will call such a


57

Chapter

4.


vector a Facial Action Basis Function (FBF). Our hope is that a small set of FBFs will be enough for a good description of the possible facial motion, i.e., we hope that facial motion is contained in a subspace of the 66-dimensional space spanned by the low-level FAPs. The FBFs should then constitute a basis for this subspace, thus the name Facial Action Basis Function. The simplest way to create the FBFs is by direct translation from an implementation of Action Units and from pictorial examples. That is, the constraints regarding movement of points specified by an Action Unit are used to quantitatively characterise the relative movement of the FAPs. The following is an example to show how such a translation is done. Table 3 shows an Action Unit associated with the CANDIDE model. Table 4 shows the corresponding FBF expressed in FAPs.* Let us now assemble the FAP-coefficients of frame k into a vector k ] T . The superposition principle [LiRF93] then says, that any nonf k = [ f 1k , f 2k , …, f M rigid motion can be synthesised by a weighted linear combination of basis functions { b j } jN= 1 , in our case N

fk =

∑ a jk bj ,

[Eq. 4.1]

i=1

where a jk is the coefficient associated with the j:th FBF in frame k.† The point of the superposition principle is that linear combination actually is a good way to combine two facial actions, i.e., if several Action Units are activated, the displacements of facial feature points (approximately) equal the sum of the displacements due to the individual Action Units. Table 4: FBF 9 Chin Raiser FBF

FAPs

FAP units

9

18 5, 52 10 - 11, 57 - 58

512 MNS -64 MNS -32 MNS

It should be noted that it is easier to extract higher-level descriptors like FBFs than FAPs (describing the motion of only one point) from an image sequence. This is because extraction and tracking of single points tend to be unstable, while Action Units or FBFs can be computed from, for example, optical flow data from a large part of the image, giving more reliable data (see, e.g., [LiRF93]).

*) The FAP unit MNS means mouth-nose distance (divided by 1024), and is a way to make the FAPs model-independent. Several such FAP units are defined in MPEG-4. In contrast, the Action Units implemented in C ANDIDE are measured in pixels and will therefore be dependent on the shape and size of the face. †) If N = M and all the FBFs are linearly independent, this is the same as to say that N linearly independent vectors span the N-dimensional space.

58


4.3

FACIAL MIMIC, ACTION UNITS AND BASIS FUNCTIONS

In a system which analyses and compresses a face image sequence, the FBF-coefficients should consequently be computed during the image analysis step, and not during the compression step.

Improved Facial Action Basis Functions As described above, the FBFs are created by simply converting Action Units. This ad-hoc process does in no way guarantee us to find an optimal set of FBFs. A straight-forward way to create FBFs for compression is to collect a database of statistics on facial motion and do a Principal Component Analysis (PCA). However, this would probably destroy the semantic and anatomical meaning of the FBFs, and also require a lot of data. The dependency on a large training set is disturbing, since available test data is quite sparse. A slightly different way is to split the M-dimensional FAP-space into subspaces. It is clear that FBFs having the same semantic meaning should belong to the same subspace, and by performing a PCA in each subspace, and then picking the most significant components, a new set of FBFs is created. Since each PCA is done in a low-dimensional space, it does not require as much data as in a high-dimensional space to be an equally good estimate in terms of variance. Table 5: The new set of FBFs

1 2 3 4 5 6 7 8

Name Outer brow raiser Brow lowerer Eyes closed Blink Look left Look down Mouth cap Mouth open

9 10 11 12 13 14 15 16

Jaw drop Lip stretcher Asymm. lip stretcher Lip corner pull Asymm. lip corner pull Head down Head turn left Head roll right

FBF

AUs

2 4 43

10, 15, 17 16, 25 26, 27 20 As FBF 10 12 As FBF 12

FAPs 31 - 38 31 - 38 19 - 22 19 - 22 23 - 24 25 - 26 4 - 13, 18, 51 - 60 4 - 13, 18, 51 - 60 3 4 - 13, 39 - 42, 51 - 60

4 - 13, 39 - 42, 51 - 60 48 49 50

In fact, the new basis functions are very similar to the original FBFs, but better tuned to human facial motion, or, at least, to the one in the training data. In this way, sixteen FBFs are created, shortly described in Table 5. The column “AUs” tells to which Action Units each FBF correspond, and the column FAPs tells which MPEG4 FAPs each FBF influence.


59

Chapter

4.


Note that not all FAPs are influenced by the FBFs. The reason is that the FBFs are designed for low bit-rate applications only, and that they are aimed at real human faces. Consequently there are, for example, no FBFs representing displacement in depth of the eyeballs (FAPs 27 & 28), since this is regarded as a very uncommon event in a videophone situation. It should be pointed out that many FBFs (1 - 3, 6 - 10 and 12) are left/right symmetrical. By skipping the asymmetrical FBFs (4, 11, 13) in the encoding process, additional compression can be achieved at the cost of having a symmetrically animated face. This does not generally decrease the visual quality very much.

4.4 THE ENCODING – DECODING PROCESS As mentioned above, the FBF-coefficients should ideally be computed during the image analysis step. However, in some situations the image sequence may not be available, and the FBF-coefficients are then computed from the FAP-coefficients as described below. The temporal coding and the decoding are naturally the same in either case.

Coding: Computation of FBF-coefficients When the basis functions { b j } and FAP-vector f k are available, the least square method can be used to compute the FBF-coefficients { a jk } . Eq. 4.1 can be written on matrix format, f k = Ba k ,

[Eq. 4.2]

where a k = [ a 1k, a2k, …, a Nk ] T and B = [ b 1, b 2, …, b N ] . The dimension of the reconstruction matrix B will be M × N , where M is the number of (low-level) FAPs and N is the number of FBFs. Due to the clustering property of FAPs, the matrix B is sparse, which will simplify the computation. The FBF-coefficients can then for each frame be computed as k

a k = B LS f k ,

[Eq. 4.3]

B LS = ( B T B ) –1 B T .

[Eq. 4.4]

where

Since the basis functions are available, B LS can be computed off-line. As mentioned, a FAP-mask, telling which FAPs are used, is also transmitted in MPEG-4. If the number of used FAPs equal m < M , only m of the rows in B are useful.Consequently, the rows of B corresponding to masked FAPs should be removed. FBFs influencing unused FAPs only will then be meaningless, and the corresponding (all zero) columns should be removed as well. Note that this procedure is perfectly reconstructible in the decoder, since it depends on B and the FAP-mask only. Unfortunately, this means that B LS must be recalculated each time the FAP-mask is changed (The alternative to store a version of B LS for each possible FAP-mask is unrealistic). Preferably, a more fast and stable algorithm than the matrix inversion in Eq. 4.4 is used, e.g., singular value decomposition.

60


4.4

THE ENCODING – DECODING PROCESS

Optionally, an FBF-mask could be transmitted instead of, or complementary to, the FAP-mask. All FAPs influenced by the transmitted FBFs will then be animated by the decoder, thus doing FAP-to-FAP interpolation. For instance, if only the left side FAPs are available at the encoder, the right side FAPs will be interpolated automatically and simultaneously with the decoding process, without increasing the complexity of the decoding. (In the experiments described in Section 4.5 below, only a FAP mask was transmitted though).

Temporal Coding and Decoding To achieve compression, we have to quantise the FBF coefficients and follow up by entropy coding. To increase compression efficiency, the FBF-coefficients should also be encoded to remove temporal correlation. The same temporal coding schemes as used on the FAPs can be applied to the FBFs, i.e., either predictive or DCT-coding followed by quantization and entropy coding (arithmetic or Huffman coding). A block schematic diagram of the encoder using predictive temporal coding is shown in Figure 4.2. The decoder first performs the temporal decoding and inverse quantization. The reconstruction matrix is then modified according to the received FAP-mask, by setting the masked rows to zero. The vector of incoming FBF-coefficients is then simply multiplied with the modified reconstruction matrix B, resulting in a vector of FAP-coefficients. A block schematic diagram of the decoder using predictive temporal coding is shown in Figure 4.3.

FAP-mask

Matrix modification

FAP-coeffs. Least square

Σ

Entropy coding

Quantization

Encoded FBF-coeffs.

Inverse quantization

Memory

Σ

Figure 4.2: A block schematic diagram of the encoder using predictive temporal coding.


61

Chapter

4.


FAP-mask

Matrix modification

Encoded FBF-coeffs.

Entropy decoding

Inverse quantization

Σ

Matrix multiplication

FAP-coeffs.

Memory

Figure 4.3: A block schematic diagram of the decoder using predictive temporal coding.

4.5 EXPERIMENTAL RESULTS Experiments have been performed to compress FAP-sequences each containing several hundred frames. A training set totalling 2000 frames, based on natural data from the MPEG-4 test data set was used. The FBF-coefficients have been computed using leastsquare fitting by a singular value decomposition routine from LAPACK [And90]. The coefficients have then been uniformly quantized (with variable quantization step to vary the bit-rate), predictively coded and Huffman-coded. One FAP-mask, but no FBF-mask, has been transmitted in each sequence. For fair comparison, the FBF-coder has been incorporated in the MPEG-4 scheme, thus transmitting the same overhead (start-codes, etc.) as the MPEG-4-coders. The FBF-coder has been compared to the predictive/arithmetic coder used in MPEG-4 for real-time FAP-compression. Also, the FBF-coder has been compared to FAPs coded with DCT. It should be noted though, that these techniques do not have the same application (real-time vs. non real-time). However, it is interesting to compare them since the DCT-method is the most efficient way of compressing the FAP s in MPEG-4. The results are shown in Figure 4.4 in terms of PSNR vs. bit-rate. As can be seen, the FBF-method competes well with the DCT-method, and is superior to the predictive/arithmetic coder.

4.6 FUTURE IMPROVEMENTS There are several ways to improve this scheme. For example, the FBFs should be analysed with respect to subjective quality of the animation so that different quantization steps could be used on different FBFs. A subjective quality analysis could also be used to influence the least square fitting. If some FAPs are much more important than others, a weighted least square fitting should be done.

62


4.6

aleta.fap, 300 frames

60

60

55

55

50

50 PSNR (dB)

PSNR (dB)

emotions.fap, 1050 frames

45 40

45 40

35

35

30

30

25

0

20

40 60 bits/frame

80

25

100

0

55

55

50

50

45 40

35 30 40 60 bits/frame

80

80

100

40

30 20

40 60 bits/frame

45

35

0

20

expressions.fap, 625 frames 60

PSNR (dB)

PSNR (dB)

Marco30.fap, 1542 frames 60

25

FUTURE IMPROVEMENTS

100

25

0

20

40 60 bits/frame

80

100

Figure 4.4: Rate/distortion for FAPs compressed using predictive/arithmetic coding (crosses), DCT (straight line), and FBFs

Further, to add the possibility for “nearly lossless” coding, even more FBFs should be added, to make them span the 66-dimensional space of low-level FAPs. This would turn the coder into an ordinary 66-dimensional transform coder, with the least-square fitting replaced by a change of basis. However, since the FAP mask redefines the subspace to be analysed in each intra frame, a different transform would be needed for each such subspace. It is not realistic to calculate a transform for each possible value of the FAP mask, and therefore a way to dynamically calculate the extension of the FBFs should be investigated. Also, the Huffman coding used on the predictively coded FBF-coefficients could be replaced by the adaptive arithmetic coder used for compressing FAPs in MPEG-4. This would decrease the complexity of the decoder, since the arithmetic decoder is already present in any decoder compliant with MPEG-4 Face Animation. If real-time operation is not required, the DCT-coding could be performed on the FBF-coefficients as well.


63

Chapter

4.


4.7 CONCLUSION In this chapter a method for representing and compressing facial animation parameters has been described. The method gives efficient compression and no inter-frame delay and is therefore well suited for real-time applications such as a video-phone system. Also, the coefficients that should be transmitted can be extracted during the image analysis step, thus adding no complexity to the encoder. The coding scheme does currently support lossy coding only, but can in the future be extended to perform nearly lossless coding as well. Different ways of augmenting the coding performance should be investigated as well. In conclusion we argue that the animation parameters in a model-based video system are possible to transmit at bit-rates around 25 - 50 bits/frame, i.e., around 500 - 1000 bit/s.

64


Chapter

5.

5

CONCLUSIONS

Here the conclusions drawn during the work described in the thesis are presented, along with a discussion on how to build a complete model-based coding system. Some directions of future work are also discussed.

5.1 CONCLUSIONS REGARDING INVESTIGATED TOOLS A number of tools for use in different parts of a model-based coding system have been investigated. The major conclusions will be discussed here.

Face Detection Algorithms based on colour discrimination have been implemented, and have shown to be very fast, but not very precise. However, used as a preprocessing step, excluding large parts of an image and thus making the search space smaller for a more precise method, the colour discriminator is very useful. Statistical template matching – based on face spaces – of deformed objects is more robust and precise, but also much more computationally demanding. Using different approaches to lower the computing time (colour discriminators and a non-greedy heuristic search method), it is possible to detect a face in near real-time.

Facial Feature Extraction The algorithm suggested by J. Daugman for detecting irises has been investigated and has shown to deliver precise results with short computation time. The method can, however, easily be fooled by other, typically synthetic, circular objects in the image. Deformable templates, using dynamic programming for calculation of the optimal deformation, is a popular and useful tool for facial feature extraction. It is not very intelligent, though, and can deform to fit to all kinds of objects, not at all looking like the object sought for. The deforming of templates has been combined with the scheme for statistical template matching, thus avoiding the high computational complexity of a search in the face space. A more general version of the deformable template, a deformable graph, has been


65

Chapter

5.

CONCLUSIONS

implemented with promising results, extracting the most important facial features in a quarter of a second on a desktop computer.

Face Model Parameter Coding Only the compression of local motion parameters (facial animation parameters) has been studied here. A method for facial texture compression has been presented elsewhere, and compression of global motion parameters and face shape are of less interest. Regarding local motion parameters, the conclusion is that choosing a good basis for facial action, thus removing the correlation between the motion of different feature points, it is possible to achieve high compression and reach bit-rates lower than one kbit/s.

5.2 THE CONSTRUCTION OF A MODEL-BASED CODING SYSTEM Using the techniques described in this thesis, a, rather simple, model-based coder can be constructed. What is missing is the tracking, synthesis and texture coding parts. All of these have been treated by others and shown to work reasonably well. The proposed system should then consist of the following components (illustrated in Figure 5.1): • A colour discriminator quickly discarding the largest part of the image as not containing human skin (see Chapter 2). • A face detector, working on the image area left by the colour discriminator, using simulating annealing and statistical pattern matching to find the location, rotation and approximate shape of the face (see Chapter 2). • A facial feature extractor using dynamic programming to find the optimal deformation of a deformable template using edge detectors, iris detectors, and feature space (DFFS) matches as attractors (see Chapter 3). • A model adaptation scheme using a limited set of allowed deformations to adapt the wireframe model to the extracted features (see Chapter 3). The allowed shape deformations (Shape Units) should preferably be calculated from a large database of face shapes (by principal component analysis). As Action Units, the Facial Action Basis Functions (FBFs) should be used, thus enabling efficient transmission (see Chapter 4). • Entropy coding of the FBFs; predictive coding and Huffman coding is used here (see Chapter 4). Image sequence

Colour discriminator

Face detection

Feature extraction

Model adaptation

Figure 5.1: A simple model-based coding system.

66


Entropy coding

5.3

FUTURE WORK

To this scheme, a tracking module should be added, to avoid the execution of the face detector and feature extractor for each frame. Also, the facial texture compression scheme from [Str98] should be integrated, resulting in the scheme in Figure 5.2. The tracking module should be active each frame (except the first), and the face detection and feature extraction modules should be active the first frame, and possibly on each n:th frame to support the tracking. To allow analysis-by-synthesis coding, naturally, a synthesis module needs to be added as well. Image sequence

Colour Discriminator

Texture Coding

Face Detection

Tracking

Model Adaptation

Feature Extraction

Entropy Coding

Figure 5.2: An extended model-based coding system.

Applicability The colour discriminator, the model adaptation and the entropy coding are non-critical parts in terms of computational complexity, and can most certainly be performed in realtime on most hardware of today. The face detection and the feature extraction can be run in near real-time, but as stated above, they should not operate on each frame. Even if the current performance needs to be improved somewhat, the goal seems to be within reach. In conclusion, a complete model-based communication system should be possible to implement within a near future, based on these techniques, with the addition of tracking, synthesis, and texture compression. It should also be noted that several decoders/synthesizers for MPEG-4 Face Animation are already under construction, ensuring that the synthesis part is soon easily available.

5.3 FUTURE WORK In Chapter 4, several ways of improving the compression of animation parameters are mentioned, and there are still an endless list of things to do on in the detection and feature extraction areas. However, some topics are more close at hand, as direct continuations of this work:


67

Chapter

5.

CONCLUSIONS

• Different search methods for the face detection problem should be tried out (simplex, optical flow based refinement, combinations of different methods). Also, the search could be guided by clues from the difference image. • The different distance measures (Section 3.4) should be evaluated. • In the scheme for feature extraction described in Section 3.5, the pattern matching templates do not reshape. Alternatively, a complete face model, where only parts of the texture is used, could be used already in the face detection step, thus integrating the face detection, feature extraction and model adaptation. For example, the model could be applied without texture on the least interesting parts (forehead, cheeks). On the more interesting parts, a low-order DFFS could be measured, and on the most critical parts (eyes, mouth), a full Mahalanobis distance measurement could be performed. • The feature extraction could be more tightly coupled to the face detection, letting the face detection results and confidence value determine the deformability of the graph. • Graphs with cycles should be studied. Such graphs are probably better suited for adaptation to the human face, but will increase the computation time. However, even if the Viterbi algorithm should be three times as complex (a reasonable value), this would increase the total computing time with approximately only 40% in our implementation. • Both the face detection and the feature extraction should work on a coarse-tofine image scale hierarchy (the VIDAS face detection algorithm already does). • The full system, as shown in Figure 5.2, should be implemented. • A test bench should be constructed, including the total system and with easily changeable parts, thus enabling experimentation and comparison of different algorithms. This should also be used to collect a large database of face model parameters, thus allowing the computation of principal deformations (Shape Units), improved Facial Action Basis Functions, and eigenfaces for texture compression.

68


REFERENCES [Ahl98]

J. Ahlberg and H. Li "Representing and Compressing MPEG-4 Facial Animation Parameters using Facial Action Basis Functions" To appear in IEEE Trans. on Circuits and Systems for Video Technology in the first quarter of 1999.

[Aiz87]

K. Aizawa, H. Harashima, and T. Saito “A model-based image coding system - construction of a 3-D model of a person’s face” Proc. Int. Picture Coding Symposium (PCS’87), Stockholm, Sweden, 1987, paper 3.11.

[Aiz93]

K. Aizawa, C. S. Choi, H. Harashima, and T. S. Huang “Human Facial Motion Analysis and Synthesis with Applications to Model-Based Coding” In Motion Analysis and Image Sequence Processing, Editors M. I. Sezan and R. L. Lagendijk, Kluwer Academic Publishers, 1993.

[Ald92]

Aldus TIFF Developer’s Toolkit, Revision 6.0 Aldus Corp., Seattle, WA, U.S.A., 1992.

[And90]

E. Anderson et al. LAPACK: A portable linear algebra library for high-performance computers Computer Science Dept., Technical Report CS-90105, University of Tennessee, Knoxville, U.S.A., 1990.

[Bel62]

R. E. Bellman Applied Dynamic Programming Princeton University Press, 1962.

[Bet94]

M. Betke and N. C. Makris Fast Object Recognition in Noisy Images Using Simulated Annealing A.I. Memo No. 1510, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, U.S.A., 1994.

[Com89]

Compuserve Graphics Interchange Format: Version 89a Compuserve Inc., Columbus, OH, U.S.A., 1989.


69

REFERENCES

[Coo98]

T. F. Cootes, G. J. Edwards, and C. J. Taylor "Active Appearance Models" Proc. European Conf. on Computer Vision (ECCV’98), Vol. 2, pp. 484 - 498, Editors H. Burkhardt and B. Neumann, Springer-Verlag, 1998.

[Cos96]

N P. Costen, I. G. Craw, G. J. Robertsona, and S. Akamatsu “Automatic face recognition: What representation?” Proc. European Conf. on Computer Vision (ECCV’96), Vol. I, No. 1064 in “Lecture Notes on Computer Science”, Editors B. Buxton and R. Cipolla, Springer-Verlag, pp. 504 - 513, 1996.

[Cra87]

I. Craw, H. Ellis, and J. Sishman “Automatic extraction of face features” Pattern Recognition Letters, Vol. 5, pp. 183 - 187, 1987.

[Dau93]

J. Daugman “High confidence visual recognition of persons by a test of statistical independence” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 15, No. 11, pp. 1148 - 1161, 1993.

[Ekm77]

P. Ekman and W. V. Friesen Facial Action Coding System Consulting Psychologists Press, 1977.

[For83]

R. Forchheimer and O. Fahlander “Low bit-rate coding through animation” Proc. Int. Picture Coding Symposium (PCS’83), Davis, CA, U.S.A, 1983, pp. 113 - 114.

[For84]

R. Forchheimer, O. Fahlander, and T. Kronander “A semantic approach to the transmission of face images” Proc. Int. Picture Coding Symposium (PCS’84), Cesson-Sevigne, France, 1984, paper 10.6.

[Gei95]

D. Geiger, A. Gupta, L. A. Costa, and J. Vlontzos “Dynamic Programming for Detecting, Tracking, and Matching Deformable Contours” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 17, No. 3, pp. 294 - 302, March 1995.

[Har89]

H. Harashima, K. Aizawa, and T. Saito “Model-based analysis-synthesis coding of videotelephone images conception and basic study of intelligent image coding” Trans. of IEICE, Vol. E72, No. 5, pp. 452 - 258, 1989.

70


REFERENCES

[Hel95]

C. W. Helstrom Elements of signal detection and estimation Englewood Cliffs, NJ, USA, 1995.

[Her85]

F. Herbert Chapter House Dune New English Library, 1985.

[Hjo69]

C.-H. Hjortsjö Människans ansikte och det mimiska språket (In Swedish, “Man’s Face and the Mimic Language”), Studentlitteratur, Lund, Sweden, 1969.

[Jen96]

F. V. Jensen An introduction to Bayesian Networks Springer-Verlag, 1996.

[Jol86]

I. T. Joliffe Principal Component Analysis Springer-Verlag, New York, 1986.

[Ker97]

C. Kervrann, F. Davoine, P. Perez, R. Forchheimer, and C. Labit ”Generalized Likelihood Ratio-Based Face Detection and Extraction of Mouth Features” Pattern Recognition Letters, Vol. 18, No. 9, September 1997.

[Li93]

H. Li Low Bitrate Image Sequence Coding Linköping Studies in Science and Technology, Dissertations No. 318, Dept. of Electrical Engineering, Linköpings universitet, Sweden, 1993.

[LiF93]

H. Li and R. Forchheimer “Location of Face Using Color Cues” Proc. Int. Picture Coding Symposium (PCS’93), March 1993.

[LiLF94]

H. Li, A. Lundmark, and R. Forchheimer “Image sequence coding at very low bit rates: A review” IEEE Trans. on Image Processing, Vol. 3, pp 589 - 609, September 1994.

[Lin83]

S. Lin and D. J. Costello “Maximum Likelihood Decoding of Convolutional Codes” Error Control Coding - Fundamentals and Applications, Chapter 11, pp. 315 - 349, Prentice-Hall, Inc., New Jersey, 1983.


71

REFERENCES

[LiRF93]

H. Li, P. Roivainen, and R. Forchheimer “3-D Motion Estimation in Model-Based Facial Image Coding” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 15, No. 6, pp. 545 - 555, June 1993.

[Loe55]

M. Loève Probability Theory Van Nostrand, Princeton, NJ, U.S.A., 1955.

[M2V]

M2VTS

Multi Modal Verification for Teleservices and Security applications European ACTS project, contract AC102.

[Mai97]

D. Maio and D. Maltoni "Fast Face Location in Complex Backgrounds" Proc. NATO ASI on Face Recognition: From Theory to Applications, Stirling, Scotland, June 1997, pp. 568 - 577.

[Mog97]

B. Moghaddam and A. Pentland “Probabilistic Visual Learning for Object Representation” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 696 - 710, July 1997.

[MPS98]

MPEG Working Group on Systems Coding of Audio-Visual Objects: Final Draft of International Standard, Part 1 (Systems) ISO 14496-1, Doc. No. N2501 of JTC1 /SC29 /WG11 Atlantic City meeting, October 1998.

[MPV98a]

MPEG Working Group on Visual Coding of Audio-Visual Objects: Final Draft of International Standard, Part 2 (Visual) ISO 14496-2, Doc. No. N2502 of JTC1 /SC29 /WG11 Atlantic City meeting, October 1998.

[MPV98b]

MPEG Working Group on Visual Visual Working Draft, Version 2 Rev 5.0 ISO 14496-2, Doc. No. N2473 of JTC1 /SC29 /WG11 Atlantic City meeting, October 1998.

[Noh96]

R. Nohre Recognition of Deformed Patterns by Dynamic Programming Report LiTH-ISY-R-1823, Dept. of Electrical Engineering, Linköpings universitet, Sweden, 1996.

72


REFERENCES

[Omu69]

J. K. Omura “On the Viterbi Decoding Algorithm” IEEE Trans. on Information Theory, Vol. 15, pp. 177 - 179, January 1969.

[Par82]

F. Parke “Parameterized models for face animation” IEEE Computer Graphics Applications Magazine, Vol. 12, pp. 61 - 68, November 1982.

[Pea95]

D. E. Pearson “Development in Model-Based Video Coding” Proc. of the IEEE, Vol. 83, No. 6, pp. 892 - 906, June 1995.

[Pen93]

W. B. Pennebaker and J. L. Mitchell JPEG Still Image Data Compression Standard Van Nostrand Reinhold, New York, 1993.

[Row96]

H. Rowley, S. Baluja, and T. Kanade “Neural Network-Based Face Detection” Computer Vision and Pattern Recognition, 1996.

[Ryd87]

M. Rydfalk, CANDIDE, a parameterized face Internal report LiTH-ISY-I-0866, Dept. of Electrical Engineering, Linköpings universitet, Sweden, 1987.

[Sca]

SCALAR Scaleable Architectures with Hardware Extensions for Low Bitrate Variable Bandwidth Real-time Videocommunication European ACTS project, contract AC077.

[Sha48]

C. E. Shannon "A mathematical theory of communication" Bell Systems Technical Journal, Vol. 27, pp. 379 - 423, 623 - 656, 1948.

[Sob96]

K. Sobottka and I. Pitas “Extraction of facial regions and features using color and shape information” Proc. 13th IAPR, 1996, pp. 421 - 425.

[Son99]

M. Sonka, V. Hlavac, R. Boyle Image Processing, Analysis, and Machine Vision PWS Publishing at Brooks/Cole Publishing Company, 1999.


73

REFERENCES

[Str97]

J. Ström, F. Davoine, J. Ahlberg, H. Li, and R. Forchheimer “Very Low Bit Rate Facial Texture Coding” Proc. Int. Workshop on Synthetic-Natural Hybrid Coding and 3-D Imaging (IWSNHC3DI’97), Rhodes, Greece, September 1997, pp. 237 - 240.

[Str98]

J. Ström Facial Texture Compression for Model-Based Coding Linköping Studies in Science and Technology, Thesis No. 716 (Licentiate Thesis), Dept. of Electrical Engineering, Linköpings universitet, Sweden, 1998.

[Sun94]

K.-K. Sung and T. Poggio Example-based Learning for View-based Human Face Detection A.I. Memo No. 1521, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, MA, U.S.A., 1994.

[Sun97]

Q. B. Sun, C. P. Lam, and J. K. Wu "A Practical Automatic Face Recognition System" Proc. NATO ASI on Face Recognition: From Theory to Applications, Stirling, Scotland, June 1997, pp. 568 - 577.

[Swa91]

M. Swain and D. Ballard “Color indexing” Int. Journal of Computer Vision, Vol. 7, No. 1, 1991, pp. 11 - 32.

[Tha96]

N. A. Thacker and T. F. Cootes “Vision Through Optimization” BMVC Tutorial Notes, Edinburgh, Scotland, September 1996.

[Tur91]

M. Turk and A. Pentland “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, Vol. 3, No. 1, 1991.

[Vid]

VIDAS Video Assisted with Audio Representation and Coding European ACTS project, contract AC057.

[Vie97]

E. Viennet and F. Fogelmann-Soulié “Connectionist Methods for Human Face Processing” Proc. NATO ASI on Face Recognition: From Theory to Applications, Stirling, Scotland, June 1997, pp. 124 - 156.

[Vil98]

V. Vilaplana, F. Marqués, P. Salembier, and L. Garrido "Region Based Segmentation and Tracking of Human Faces" Proc. European Signal Processing Conf. (EUSIPCO’98), pp. 311 - 314, 1998.

74


REFERENCES

[Vit67]

A. J. Viterbi “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm” IEEE Trans. on Information Theory, Vol. 13, pp. 260 - 269, April 1967.

[Wec98]

H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman Soulié, and T. S. Huang (editors) Face Recognition - From Theory to Applications NATO ASI Series F: Computer and Systems Sciences, Vol. 163, ISBN 3-540-64410-5, Springer-Verlag, Berlin Heidelberg, 1998.

[Wel87]

W. J. Welsh “Model-based coding of moving images at very low bit rates” Proc. Int. Picture Coding Symposium (PCS’87), Stockholm, Sweden, 1987, paper 3.9.

[Wib96]

N. Wiberg Codes and Decoding on General Graphs Linköping Studies in Science and Technology, Dissertations No. 440, Dept. of Electrical Engineering, Linköpings universitet, Sweden, 1996.


75

Licentiate in Technology Dissertations Image Coding Group, Linköping University

T. Kronander, Coding of Image Sequences with Consideration to the Human Observer, Linköping Studies in Science and Technology, Thesis No. 29, 1986. A. Källdahl, Simultaneous Estimation of Motion and Shape from a Sequence of Feature Points Projections, Linköping Studies in Science and Technology, Thesis No. 194, 1989. P. Roivanen, Motion Estimation in Model Based Coding of Human Faces, Linköping Studies in Science and Technology, Thesis No. 225, 1990. R. Nohre, Object Recognition Based on Data Compression, Linköping Studies in Science and Technology, Thesis No. 267, 1991. M. Novak, Attractor Coding of Images, Linköping Studies in Science and Technology, Thesis No. 382, 1993. J. Ström, Facial Texture Compression for Model-Based Coding, Linköping Studies in Science and Technology, Thesis No. 716, 1998.

ISBN 91-7219-425-1

ISSN 0280-7971