Very Low Bitrate Video Communication - DiVA portal

3 downloads 0 Views 1MB Size Report
Sep 1, 2008 - It is possible to encode HD video to very low bitrates ...... standard-definition video that is used in regular TV broadcasts and DVD-movies.
i

i “thesis” — 2008/9/1 — 22:31 — page 1 — #2

i

i

Very Low Bitrate Video Communication A Principal Component Analysis Approach

Ulrik S¨oderstr¨om

Department of Applied Physics and Electronics, Ume˚ a University Ume˚ a 2008

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 2 — #3

i

i

Digital Media Lab Department of Applied Physics and Electronics Ume˚ a University SE-901 87 Ume˚ a, Sweden

Thesis for degree as PhD. of engineering in applied electronics at Ume˚ a University Vetenskaplig uppsats f¨or avl¨aggande av teknologie doktorexamen i till¨ampad elektronik vid Ume˚ a universitet issn 1652-6295:11 isbn 978-91-7264-644-5 Author’s email: [email protected] Typeset by Ulrik S¨ oderstr¨ om LATEX 2ε Printed by Arkitektkopia, Ume˚ a University, Ume˚ a, September 2008

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 1 — #4

i

i

Till Leif. Du ¨ar den b¨aste.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 2 — #5

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 3 — #6

i

i

Acknowledgement

I want to thank Professor Haibo Li for all the advice and guidance he has provided me with over the years. Without his help and support I hadn’t been able to pursue my research. Thanks to everyone who has been associated with Digital Media Lab, both in the past and present, and have helped me in one way or another. Special gratitude goes to Zhengrong Yao who helped me several times in the beginning of my research career. Tack till Sara, Ville, Peter, Shafiq och Nazanin. V˚ ara raster och luncher tillsammans har gjort att min tillvaro k¨ants b¨attre flera g˚ anger. Ett stort tack till all personal p˚ a TFE. Ett extra tack till Annemaj, Mona-Lisa och Valentina som alltid har tid och ett hj¨alpande r˚ ad. Tack till Leif och Sven som byggde en utrustning ˚ at mig inf¨or mitt exjobb v˚ aren 2001. ”Hj¨almen” skulle bara anv¨andas till mitt examensarbete men den har f¨orf¨oljt mig under hela min forskarkarri¨ar. Tack till alla som jag har samarbetat med under min tid som forskare, speciellt Greger Wikstrand och Anders Broberg. Tack till Tomas Andersson Wij. Nu vet jag att jag ¨ar v¨ard s˚ a mycket mer. Tack till mina v¨anner som finns d¨ar f¨or mig i b˚ ade goda och d˚ aliga tider. Tack till min familj som st¨ottat och hj¨alpt mig genom m˚ anga saker i livet. Det gl¨ommer jag aldrig. Tack f¨or allt, Elin och Smilla. 3

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 4 — #7

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 5 — #8

i

i

Abstract

A large amount of the information in conversations come from non-verbal cues such as facial expressions and body gesture. These cues are lost when we don’t communicate face-to-face. But face-to-face communication doesn’t have to happen in person. With video communication we can at least deliver information about the facial mimic and some gestures. This thesis is about video communication over distances; communication that can be available over networks with low capacity since the bitrate needed for video communication is low. A visual image needs to have high quality and resolution to be semantically meaningful for communication. To deliver such video over networks require that the video is compressed. The standard way to compress video images, used by H.264 and MPEG-4, is to divide the image into blocks and represent each block with mathematical waveforms; usually frequency features. These mathematical waveforms are quite good at representing any kind of video since they do not resemble anything; they are just frequency features. But since they are completely arbitrary they cannot compress video enough to enable use over networks with limited capacity, such as GSM and GPRS. Another issue is that such codecs have a high complexity because of the redundancy removal with positional shift of the blocks. High complexity and bitrate means that a device has to consume a large amount of energy for encoding, decoding and transmission of such video; with energy being a very important factor for battery-driven devices. Drawbacks of standard video coding mean that it isn’t possible to deliver video anywhere and anytime when it is compressed with such codecs. To resolve these issues we have developed a totally new type of video coding. Instead of using mathematical waveforms for representation we use faces to represent faces. This makes the compression much more efficient than if waveforms are used even though the faces are person-dependent. By building a model of the changes in the face, the facial mimic, this model can be used to encode the images. The model consists of representative facial images and we use a powerful mathematical tool to extract this model; namely principal 5

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 6 — #9

i

6

i

ABSTRACT

component analysis (PCA). This coding has very low complexity since encoding and decoding only consist of multiplication operations. The faces are treated as single encoding entities and all operations are performed on full images; no block processing is needed. These features mean that PCA coding can deliver high quality video at very low bitrates with low complexity for encoding and decoding. With the use of asymmetrical PCA (aPCA) it is possible to use only semantically important areas for encoding while decoding full frames or a different part of the frames. We show that a codec based on PCA can compress facial video to a bitrate below 5 kbps and still provide high quality. This bitrate can be delivered on a GSM network. We also show the possibility of extending PCA coding to encoding of high definition video.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 7 — #10

i

i

List of publications

Peer review papers Ulrik S¨ oderstr¨ om and Haibo Li, Very low bitrate full-frame facial video coding based on principal component analysis, Signal and Image Processing Conference (SIP’05), Honolulu, August 2005 Ulrik S¨ oderstr¨ om and Haibo Li, Full-frame video coding for facial video sequences based on principal component analysis, Proc. of Irish Machine Vision and Image Processing Conference 2005 (IMVIP 2005), pp. 25-32, Belfast, August 2005, Paper I Ulrik S¨ oderstr¨ om and Haibo Li, Representation bound for human facial mimic with the aid of principal component analysis, under review, submitted December 2007, Paper II Ulrik S¨ oderstr¨ om and Haibo Li, Eigenspace compression for very low bitrate transmission of facial video, IASTED International conference on Signal Processing, Pattern Recognition and Applications (SPPRA’07), Innsbruck, February 2007, Paper III Hung-Son Le, Ulrik S¨ oderstr¨ om and Haibo Li, Ultra low bit-rate video communication, video coding = facial recognition, Proc. of 25th Picture Coding Symposium, Beijing, April 2006, Paper IV Ulrik S¨ oderstr¨ om and Haibo Li, Asymmetrical principal component analysis for video coding, Electronics letters, Volume 44 (4), February 2008, pp. 276-277 Ulrik S¨ oderstr¨ om and Haibo Li, Asymmetrical Principal Component Analysis Theory and Its Applications to Facial Video Coding, submitted, August 2008, Paper V 7

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 8 — #11

i

8

i

PUBLICATIONS

Ulrik S¨ oderstr¨ om and Haibo Li, Side view driven facial video coding, submitted, September 2008, Paper VI Ulrik S¨ oderstr¨ om and Haibo Li, High definition wearable video communication, submitted, September 2008, Paper VII

Other papers Ulrik S¨ oderstr¨ om and Haibo Li, Emotion recognition and estimation from tracked lip features, Proc. of Swedish Symposium for Automated Image Analysis (SSBA’04), pp. 182-185, Uppsala, March 2004 Ulrik S¨ oderstr¨ om and Haibo Li, Emotion recognition and estimation from tracked lip features, Technical report DML-TR-2004:05 Ulrik S¨ oderstr¨ om and Haibo Li, Customizing lip video into animation for wireless emotional communication, Technical report DML-TR-2004:06 Greger Wikstrand and Ulrik S¨ oderstr¨ om, Internet card play with video conferencing, Proc. of Swedish Symposium for Automated Image Analysis (SSBA’06), pp. 93-96, Ume˚ a, March 2006 Ulrik S¨ oderstr¨ om, Very low bitrate facial video coding based on principal component analysis, Licentiate thesis, September 2006 Ulrik S¨ oderstr¨ om and Haibo Li, Principal component video coding for simple decoding on mobile devices, Proc. of Swedish Symposium for Automated Image Analysis (SSBA’07), pp. 149-152, Link¨oping, March 2007 Ulrik S¨ oderstr¨ om and Haibo Li, Asymmetrical principal component analysis for encoding and decoding of video sequences, Proc. of Swedish Symposium for Automated Image Analysis (SSBA’08), Lund, March 2008

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 9 — #12

i

i

Contents

Acknowledgement

3

Abstract

5

Publications

7

List of Figures

13

Abbreviations and mathematical notations

15

1 Introduction and motivation 1.1 1.2 1.3 1.4 1.5

Motivation . . . Vision . . . . . . Research goals . Research strategy Research process

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

17 17 18 18 18 19

. . . . . . . . .

23 23 24 24 26 27 28 30 30 31

2 Related work and technical solutions used 2.1

2.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Video compression . . . . . . . . . . . . . . . . . 2.1.1.1 Discrete Cosine Transform . . . . . . . 2.1.1.2 Block matching . . . . . . . . . . . . . 2.1.1.3 Chroma subsampling . . . . . . . . . . 2.1.2 Facial representation with standard techniques . 2.1.3 Other implementations of PCA for video coding 2.1.4 Scalable video coding . . . . . . . . . . . . . . . Technical solutions . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

9

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 10 — #13

i

10

i

CONTENTS

2.2.1

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

31 32 32 35 36 39 39 40

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

43 43 45 46 47 47 49 50 51 53 53 54 54 54 55 55 56

LLE smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 60

2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7

Hands-free video equipments . . . . . . . . . 2.2.1.1 Modeling efficiency with the helmet Basic emotions . . . . . . . . . . . . . . . . . Video sequences . . . . . . . . . . . . . . . . Locally linear embedding . . . . . . . . . . . High definition (HD) video . . . . . . . . . . The face space and personal mimic space . . Quality evaluation . . . . . . . . . . . . . . .

3 Principal component analysis video coding 3.1 3.2 3.3 3.4

3.5

3.6

Principal component analysis . . . . . . . . . . . . 3.1.1 Singular Value Decomposition . . . . . . . . Principal component analysis video coding . . . . . 3.2.1 Comparison with h264 . . . . . . . . . . . . Encoding and decoding time . . . . . . . . . . . . . Theoretical bounds . . . . . . . . . . . . . . . . . . 3.4.1 Distortion bound . . . . . . . . . . . . . . . 3.4.2 Rate-Distortion bound . . . . . . . . . . . . 3.4.3 Comparison of the distortion bounds . . . . Eigenimage compression . . . . . . . . . . . . . . . 3.5.1 Quantization - uniform or pdf-optimized? . 3.5.2 Compression of the mean image . . . . . . . 3.5.3 Loss of orthogonality . . . . . . . . . . . . . 3.5.4 Compression methods . . . . . . . . . . . . Eigenspace re-use . . . . . . . . . . . . . . . . . . . 3.6.1 Sensitivity to lightning and positional shift

. . . . . . . . . . . . . . . .

4 Ultra low bitrate video coding 4.1

5 Asymmetrical Principal Component Analysis video coding 5.1

Asymmetrical principal component analysis video coding . . . . . . 5.1.1 Reduction of complexity for the encoder . . . . . . . . . . . 5.1.2 Reduction of complexity for the decoder . . . . . . . . . . . 5.1.3 Variance of areas and reconstruction change for asymmetrical principal component analysis video coding . . . . . . . . 5.1.4 Experiments with asymmetrical principal component analysis video coding . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4.1 Case 1: Encoding with mouth, decoding with entire frame . . . . . . . . . . . . . . . . . . . . . . . 5.1.4.2 Case 2: Encoding with mouth and eyes, decoding with entire frame . . . . . . . . . . . . . . . . . . .

63 63 65 65 67 68 68 69

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 11 — #14

i

CONTENTS

i

11

5.1.4.3

5.1.5 5.1.6

Case 3: Encoding with extracted features, decoding with entire frame . . . . . . . . . . . . . . . . 5.1.4.4 Case 4: Find all edges for encoding, decoding with the entire frame . . . . . . . . . . . . . . . . . . . Side view driven video coding . . . . . . . . . . . . . . . . . Profile driven video coding . . . . . . . . . . . . . . . . . .

6 High definition wearable video communication 6.1 6.2

High definition wearable video equipment . . . . . . . . . . . . . . Wearable video communication . . . . . . . . . . . . . . . . . . . .

7 Contributions, conclusions and future work 7.1 7.2

Contributions and conclusions . . . . . . . . . . . . . . . . . . . . . Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

70 72 73 75 79 80 80 85 85 86 89

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 12 — #15

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 13 — #16

i

i

List of Figures

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

3.1 3.2 3.3

3.4

DCT coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Encoding and decoding with DCT. . . . . . . . . . . . . . . . . . . A motion field generated from matching of frames A and B. . . . . YUV 4:1:1 subsampling . . . . . . . . . . . . . . . . . . . . . . . . Assembled YUV frame . . . . . . . . . . . . . . . . . . . . . . . . . Hands-free video equipments. . . . . . . . . . . . . . . . . . . . . . Example frames from video recorded without hands-free equipment. Difference between using the hands-free helmet and using a fixed camera. – Hands-free helmet - - Fixed camera . . . . . . . . . . . . The six basic emotions. . . . . . . . . . . . . . . . . . . . . . . . . Example frames from three video sequences. . . . . . . . . . . . . . Example frame from from a HD video sequence. . . . . . . . . . . The tilted mirror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame from a video sequence shot in front of the mirror. . . . . . . The first three dimension of the personal mimic space for two facial mimics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison between a codec based on PCA and a H.264 codec at very low bitrates. (– PCA coding - - H.264) . . . . . . . . . . . . . Example frames from H.264 encoding at 2.4 kbps. . . . . . . . . . Reverse water-filling for independent Gaussian principal components. Only the components who have a variance larger then γ is allocated bits in the quantization process. . . . . . . . . . . . . . . Mean rate-distortion bound for all 10 video sequences of facial mimic.

25 26 26 27 28 33 34 34 35 36 37 38 38 41

48 48

52 52

13

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 14 — #17

i

14

i

LIST OF FIGURES

3.5

Comparison of the different coding schemes. – Compressed Eigenspace at both encoder and decoder - - Compressed Eigenspace at decoder & original mean image -· Compressed Eigenspace at decoder & compressed mean image ·· LS & GS result – Loss of orthogonality . . . 56

4.1 4.2

Ultra low bitrate video coding scheme. . . . . . . . . . . . . . . . . Personal mimic face images mapped into the embedding space described by the first two coordinates of LLE. . . . . . . . . . . . . . The reconstructed frames shown in a two-dimensional LLE space. . The reconstructed frames shown in a two-dimensional LLE space with smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 4.4

Top left: P =0, M =15 Top right: P =10, M =15 Bottom: P =15, M =15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Variance image of the individual pixels. White = Zero variance Color indicates variance in a heat map (Yellow = high variance) . 5.3 Individual pixel PSNR. red =improved PSNR blue=reduced PSNR 5.4 The entire video frame and the foreground If used in Case 1. . . . 5.5 Foreground with the eyes and the mouth . . . . . . . . . . . . . . . 5.6 Area decided with edge detection and dilation. . . . . . . . . . . . 5.7 Area decided with edge detection on all frames. . . . . . . . . . . . 5.8 Example frames of profiles relating to the side view. . . . . . . . . 5.9 The mean profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 The first three Eigenprofiles φpr j (Scaled for visualization). . . . . .

59 60 61 61

5.1

6.1 6.2

p

Frame reconstructed with aPCA. (25 φfj and 5 φj bg are used.) . . . Frame encoded with H.264. . . . . . . . . . . . . . . . . . . . . . .

66 67 67 69 70 71 72 75 76 76 83 83

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 15 — #18

i

i

Abbreviations and mathematical notations

Abbreviations PCA aPCA PSNR mse SVD HD LLE DCT DP

Principal component analysis Asymmetrical Principal component analysis Peak signal to noise ratio Mean square error Singular value decomposition High definition Locally linear embedding Discrete cosine transform Dynamic programming

15

i

i i

i

i

i “thesis” — 2008/9/1 — 23:04 — page 16 — #19

i

16

i

ABBREVIATIONS AND MATHEMATICAL NOTATIONS

Mathematical notations Φ φ I C I0 G If Is If r Xp r {α} N M P K λj bij Φp Φpbg h v Di

Eigenspace Eigenvector, principal component Original data as vectors Original data as images Mean of data A different video sequence than I Foreground image Side view Frontal view Side view profile Projection coefficients Total number of principal components in a sequence The number of principal components used for decoding The number of principal components used for background reconstruction Total number of pixels in a frame (YUV subsampled) Eigenvalues Eigenvectors Pseudo Eigenspace Pseudo Eigenspace for the background Number of pixels in horizontal direction Number of pixels in vertical direction Displacement in block-based motion estimation

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 17 — #20

i

i

Introduction and motivation

1.1

Motivation

The motivation for this work can originally be found in the beginning of 2003 when I started my employment as a Ph.D.-student. Visual communication can provide a great improvement in communication but it is not available in many cases (in 2003 it was rarely available). Most of the distance communication consists of only audio and we wanted to extend audio communication to audio-visual communication since a large part of the information in communication comes from non-verbal cues. Video requires a minimum quality to improve communication; with a poor video quality or when it is used incorrectly it will actually degrade the communication quality. So, the motivation was to provide video communication with high quality for everyday distance communication. This communication often takes place over low capacity networks, meaning that the video needs to be compressed to very low bitrates to be used on these networks. The video compression standards that are available today (and were available in 2003) cannot provide video with sufficient quality at these bitrates. The need for a new compression scheme is obvious. For any kind of data it is preferable to use as low bitrate as possible for transmission. For communication data there are especially high demands on transmission time; communication requires online functionality. Any video codec that uses low bitrates is much less sensitive to delay and other errors than video that needs a higher bitrate. The cost for transmission is also lowered when less bits are needed. This cost might be the price the user has to pay for a service but it might also be the transmission cost measured in bits. Both of these can be lowered if less data needs to be transmitted. So the original motivation for this work is to create a video codec that can function at low bitrates, at low cost and with low sensitivity to errors. During the course of time we also wanted to use high definition (HD) resolution for the video and to always maintain an extremely low bitrate for this kind of video is beyond the scope of this thesis work. It is possible to encode HD video to very low bitrates but this will be examined in the future. 17

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 18 — #21

i

18

INTRODUCTION AND MOTIVATION

1.2

Vision

i

My vision for the future is that high quality video can be provided anywhere and anytime regardless of the network and environment.

1.3

Research goals

This work regards a very fundamental problem; video compression. More specifically, video compression at very low bitrates. The techniques that are available are not sufficiently effective to encode high quality video into very low bitrates and there is a need for a new technique. We propose that this technique is principal component analysis. We have specified more concrete goals for low bitrate video coding. • The bitrate for the video transmission should be lower than 5 kbps. • The objective quality of the reconstructed video should be approximately 35 dB (measured in PSNR). • The resolution of the video should be approximately 320x240 pixels (CIF resolution). • The video codec should be able to encode and decode video in real-time; a requirement for online usage of it. • The video codec should be able to compress facial video sequences; not an arbitrary video sequence. These are the goals we have for most of out work but we also want to encode HD video so the resolution goal isn’t valid for all research projects. The bitrate and objective quality is also altered for HD video.

1.4

Research strategy

The goals we have set for video compression are quite challenging to reach. To succeed we need a good strategy for achieving these goals and we have used the following strategies. 1. Instead of using waveforms to represent images we use facial images for representation. 2. We decided to model the changes within a face; the facial mimic. 3. We make use of hands-free video equipment to remove the global motion in the video.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 19 — #22

i

RESEARCH PROCESS

i

19

4. Our first implementation was made to see that the idea actually works. Before continuing further we wanted to see the theoretical possibility for the coding. 5. We wanted to focus on the main problem; facial mimic modelling and facial encoding with this model. We do not use advanced image analysis techniques, like warping for facial alignment, for the videos. We use the hands-free video equipment to circumvent several issues and concentrate on the main issue. 6. We decided that we would build a model of the mimic that contains all the possible facial changes directly. It is also possible to evaluate the performance of the model continuously and update it. 7. We are not interested in creating an efficient system for the coding but instead evaluating the idea. We have therefore implemented most of the research in Matlab; an environment which is good for matrix (image) operations but quite slow.

1.5

Research process

All research presented in this thesis has evolved through discussions with my supervisor, Prof. Haibo Li. Our first implementations dealt with facial animations since we thought that we couldn’t achieve real video transmissions at such low bitrates [1,2]. After we found that natural video is superior to animations we tried to create low bitrate video instead of low bitrate facial representation through animations. Haibo Li had tested to use low bitrate representation of facial video when he was a student, back in 1993 [3]. We continued to develop his idea since it built on real video. We found that the American psychologist Paul Ekman had shown that facial expressions can be modelled in a very efficient way [4,5]. We wanted to incorporate this modelling in a coding scheme and chose to make use of principal component analysis [6] for this modelling since it is the most efficient modelling technique available. We developed a coding scheme based on principal components. We extract a model of facial mimic through principal component analysis and then use this model for encoding and decoding of video (Paper I). When we evaluated the implementation of principal component video coding we found that for encoding and decoding we had reached our goals. The coding scheme can encode facial video with a resolution of 240x176 pixels in real-time and the video can also be reconstructed in real-time with a quality higher than 35 dB. When we implemented the coding scheme we made use of uniform quantization. We wanted to know exactly how low bitrate that could be achieved and what quality that could be reached when an error-free representation is used. So we

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 20 — #23

i

20

i

INTRODUCTION AND MOTIVATION

examined boundaries for coding through optimal bit allotment and boundaries for an unlimited representation (Paper II). Together with Greger Wikstrand I examined the use of video for conversational multimedia [7]. The result was disappointing at first but they implied that video has such high impact that it must be used wisely, otherwise video might actually degrade the user experience. Low bitrate video coding has a major role to play here since low bitrates can be used to improve the quality of visual communication. The problem that needed to be solved is the practical usage of the principal components; they are needed by both the encoder and decoder to enable video coding. They can either be transmitted between the encoder and decoder and this requires a high bitrate since the components are images. We have examined how to update the principal components and how to transmit new components (Paper III). Together with Hung-Son Le we investigated how it is possible to reuse images through face recognition. In this work we show how we can use locally linear embedding (LLE) to provide decoded video with a more natural appearance (Paper IV). To avoid transmission of new or updated Eigenspaces we examined how the Eigenspaces can be used again. There are two difficulties for achieving this: 1. The facial features are positioned on different pixel positions in new video frames compared to the Eigenspace. 2. The facial features have different pixel intensity due to different lightning and shading of the face in the videos. The lightning problem is not as severe as the positional problem. It can be handled by adjusting the light by normalizing. The positional problem can be solved by either aligning the new video frame so it matches the existing Eigenspace or aligning the Eigenspace so that it matches the new video frame. We have evaluated alignment of video frames to match existing Eigenspaces through affine transformation [8]. Affine transformation is not enough to overcome the positional differences between videos. We used PCA for encoding of an arbitrary video [9]. The idea was to make use of simple encoding and decoding that is associated with PCA. Simple decoding can save energy for a device which is powered by batteries and a simple coding scheme can be used without device-specific implementations. The very low bitrates are lost with this implementation because of the need to transmit new Eigenimages for each clip. Almost the entire bitrate consists of Eigenimage transmission. Even though the idea was to create high-quality video the results show that this kind of video can be used as low-quality preview. A reconstruction with few Eigenimages doesn’t give away any high-quality version of a video but the motion in the video is apparent. The content of the video is visible to a user at a very low bitrate cost. We wanted to improve the modelling of the facial mimic and at the same time reduce the complexity for encoding and decoding. To do this we introduced a coding scheme that can use a part of the frame for encoding and decode the entire

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 21 — #24

i

RESEARCH PROCESS

i

21

frame or a different part of the frame. This scheme is called asymmetric principal component analysis (aPCA) and it is described in Paper V. aPCA allows faster updating of Eigenspaces since it reduces the amount of data that needs to be updated. We have also showed how it is possible to use the side view of a face for encoding and decode the frontal view of this face. This is a totally new way of treating video coding since the part that is decoded is not even used for encoding (Paper VI). In this work we also use only the profile of the side view for encoding but still decode the entire frontal view. In the end of all these contributions result in a video compression scheme that work in real-time for facial video sequences and have high quality at extremely low bitrates. We have also extended the coding scheme to high-definition video material. It is still facial video but the resolution is much higher, e.g. 1080i (1080 interlaced). The coding scheme now works at a bitrate which is low considering HD content but very high compared to the other work we have produced (Paper VII).

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 22 — #25

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 23 — #26

i

i

Related work and technical solutions used

2.1

Related Work

This work is about video compression that can be decoded in layers so related work spans video coding and scalable video coding. General video coding with standard methods are described in section 2.1.1 and video compression based on PCA is described in section 3.2. Scalable video coding is discussed in section 2.1.4. Harashima et.al. [10] provide a differentiation of video coding into generations; this division is shown in Table 2.1. Coding based on discrete cosine transform (DCT) is regarded as 1st generation video coding where the main idea behind the compression is to remove the redundancy. H.264 uses several other techniques but still compresses the video by removing redundancy; both in the information itself and information which is redundant for human observers. The 4th generation video compression techniques work by treating parts of the video as individual object, e.g., background, house, car or face. Compression of each individual object can then be specialized with techniques that work well for compression of the different object types. Torres and Delp [11] explain how this coding can function and they also explain how PCA can be used for compression of facial images in video sequences. Encoding of faces with PCA can be regarded as 4th generation video coding. Coding generation 0th generation 1st generation 2nd generation 3rd generation 4th generation 5th generation

Approach Direct waveform coding Redundancy removal Coding by structure Analysis and synthesis Recognition and reconstruction Intelligent coding

Technique PCM DPCM, DCT, DWT, VQ Image segmentation Model-based coding Knowledge-based coding Semantic coding

Table 2.1: Classification of image and video coding.

23

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 24 — #27

i

24

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

2.1.1

Video compression

The purpose of video compression is to reduce the quantity of information that is needed to represent a sequence of images. Video compression can be performed lossless or lossy where lossy of course provide a larger data reduction than lossless compression. Lossy video compression achieve good visual results working on the premise that much information available before compression is redundant for a human observer and can be removed without loss in visual appearance. Lossy compression can be perceived just as well by a human observer as lossless compression is. Not many video codecs are considered to be lossless. There is no point of retaining all information since some information have no semantic meaning; it doesn’t improve the perceptual quality. Most standard video codecs rely on transform coding and motion estimation to encode video. The reigning transform technique is discrete cosine transform (DCT) (section 2.1.1.1). Motion estimation between frames adjacent or close in time is performed through block matching (section 2.1.1.2). Images in a video sequence are encoded differently; there are intracoded frames and intercoded frames. An intracoded frame is only dependent on itself; it is only coded based on it’s content. An intercoded frame is dependent on a previous, and perhaps subsequent, intracoded frame. An intracoded frame is compressed as an image, which it is. Intercoded frames encode the differences from the previous frame. Since frames which are adjacent in time usually share large similarities in appearance it is very efficient to only store one frame and the differences between this frame and the others. Only the first frame in a sequence is encoded purely with DCT. For the following frames only the changes between the current and first frame is encoded. The number of frames between intracoded frames is called the group of pictures (GOP). A large GOP size means fewer intracoded frames and lower bitrate. Standard video codecs includes the MPEG standards and h.26x codecs, where MPEG-4/AVC is the same as H.264 [12,13,14,15,16]. The main parts of these standards; DCT, motion estimation and chroma subsampling are described in the following sections. There are two major differences between the way we use PCA and traditional video coding with discrete cosine transform (DCT): 1. PCA is used for encoding of full frames while DCT employs block-based processing. DCT can be used for full frame encoding and PCA can be used for block-based encoding as well. 2. PCA is signal-dependent while DCT is independent of the input signal. This means that PCA requires different bases dependent on the data it should model but DCT models every kind of data with the same bases. 2.1.1.1 Discrete Cosine Transform Discrete cosine transform (DCT) describes a sequence of data as a sum of cosine functions with different frequencies. Cosine functions are much more efficient for

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 25 — #28

i

RELATED WORK

i

25

Figure 2.1: DCT coefficients.

signal approximation than sine functions and they also have wanted boundary conditions when it comes to differential equations. DCT is similar to the Fourier transform but uses only real numbers and cosine functions; DFT has both complex numbers and sine functions as well as cosine functions. DCT expresses the signal in terms of sinusoids with different frequencies and amplitudes. With a DCT it is possible to evaluate the signal even at places where there isn’t an input from beginning; DCT has extension to the signal. Two-dimensional DCT:s, which are used for image compression, can be separable products of two one-dimensional transforms; one in the horizontal direction and one in the vertical direction but the transform can also be performed in a single step. The coefficients for two-dimensional DCT are shown in Figure 2.1. Each step from left to right and top to bottom is an increase in frequency by 1/2 cycle. DCT works on blocks; the input image is divided into 8x8 blocks. Each block is then transformed to a linear combination of the 64 frequency squares; DCT is just as PCA a linear transform. This transform is possible to inverse without error. Compression through DCT comes from quantization on the frequencies; with different quantization steps for each frequency square. In this way it is possible to weight the importance of the different frequencies; low frequencies are always more important than high frequencies for image content. The encoding and decoding steps of DCT are shown in Figure 2.2. The coefficients in DCT are completely arbitrary; DCT can be used to encode any kind of signal. The transform coefficients are therefore not dependent on the

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 26 — #29

i

26

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Figure 2.2: Encoding and decoding with DCT.

Figure 2.3: A motion field generated from matching of frames A and B.

signal but the energy compactness is degraded compared to a signal-dependent transform like PCA. 2.1.1.2 Block matching Block matching is a technique for motion estimation. The motion between two frames is calculated and motion vectors that describe the motion are stored. This generates a motion field for the displacements between the frames. How a motion field relates two frame to each other is visualized in Figure 2.3. The term block matching comes from the fact motion vectors are calculated for blocks, usually 8x8 times large blocks. Block matching is performed based on pixel intensities; the matching score between two blocks is the sum of pixel differences between the blocks. This sum is usually the squared sum: costA B =

S X S X

(blockAij − blockBij )2

(2.1)

i=1 j=1

where S is the size of the blocks A and B. It is possible to represent frame B with the pixel intensities from frame A and the motion vectors between them. The changes in each block is also stored to enable interblock changes; without these changes a block cannot change appear-

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 27 — #30

i

RELATED WORK

Y

Y Y

Y

U

i

27

V

Figure 2.4: YUV 4:1:1 subsampling

ance between frames, it can only change position. Block matching provides high compression since adjacent frames often share large similarities. When there is a large change between frames it is not useful to employ block matching. 2.1.1.3

Chroma subsampling

A step which is used in any video compression scheme is chroma subsampling. The original video frames are represented in RGB color space. This color space can without loss be converted into YUV color space. Conversion is performed with equations (equation 2.2). Y = 0.299R + 0.587G + 0.114B U = −0.147R − 0.289G + 0.436B V = 0.615R − 0.515G − 0.100B

(2.2)

Conversion between YUV and RGB color space are possible in both directions and without loss of any information. The YUV color space has a very important feature; it is based on the perception of the human eye. A YUV color space has three parameters where Y is based on luminance while U and V are based on chrominance. The human eye is much less sensitive to changes in grayscale (luminance) than color (chrominance) [17], so U and V can be compressed without any visible effect for a human observer. This procedure is called subsampling since only a few values of U and V are retained. U and V is first low pass filtered and then subsampled. There are many different ways to perform chroma subsampling and we have chosen to use YUV 4:1:1 subsampling. With this subsampling method only every fourth value of U and V is stored together with all Y values. The horizontal color resolution is reduced with this method but the vertical chroma resolution is not affected. The pixel information that is retained is organized into a single frame (Figure 2.4). In this way the frame size is reduced to half the original size almost without any loss in quality visible to a human observer. The decrease in quality is still measurable with peak signal-to-noise ratio. Since we use YUV 4:1:1 subsampling the size in horizontal direction is reduced. A video with a horizontal resolution of 240 pixels actually have 720 pixels in the horizontal direction since a trimatic colorspace is used. After chroma subsampling the number of pixels is reduced to 360 pixels. The organization of Y, U and V components have no importance for modeling of the mimic with PCA or compression of the model with PCA. It does however

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 28 — #31

i

28

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Y

U V

Figure 2.5: Assembled YUV frame

have a large impact on compression of the Eigenimages with JPEG and H.264. The compression for these methods is affected by the assembling of the pixels. JPEG and H.264 is more efficient at compressing similar areas; similarity is measured as difference in pixel intensity. By organizing all Y pixels adjacent to each other the change in intensity is minimized. The same is valid for U and V and the compression of the images is made as efficient as possible. An example of how the pixels are organized is shown in Figure 2.5. The subsampled color information can be used to create a full YUV frame by adding more U and V components. From this frame a reconstructed video frame in RGB format can be attained.

2.1.2

Facial representation with standard techniques

Images of faces can be represented as a combination of bases. Most modern video codecs are based on discrete cosine transform (DCT) and block matching. Video coding based on DCT and block matching is described in section 2.1.1.1 and section 2.1.1.2. This kind of video does however require bitrates higher than most low bandwidth networks can provide, at least when the quality of the video transmission is high. The bitrate for video encoding through DCT and block-matching is highly dependent on the number of frames between each frame encoded only through DCT, the Group Of Picture (GOP) size. Video of a face share large similarities between all frames and the GOP size can the kept very high and the bitrate low. Still DCT and block-matching requires several DCT-coefficients to encode the first frame in a GOP and several possible movements of the blocks between the frames. Consequently the best codec available today does not provide high quality video at very low bitrates even if the video is suitable for high compression. A comparison of H.264 and PCA video coding is found in section 3.2.1. Representation of facial images through DCT-coefficients is clearly not sufficient when very low bitrates are wanted so there must be other ways to represent facial images. One way is to represent the images as a collection of features from an alphabet. The idea can easily be visualized by the letters in the Arabic alphabet; 26 letters, or features, are sufficient to model all the words in the English language. By building an alphabet for video features it should be possible to model all video frames as a combination of these features. A technique that uses such an alpha-

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 29 — #32

i

RELATED WORK

i

29

bet is Matching Pursuit (MP) [18]. The encoder divides the original video image into features from an alphabet and very low bitrate is achieved by only transmitting information about which features that are used to the decoder, who uses the features to reconstruct the video frame. This technique uses the same idea for representation of facial images as DCT; the image is represented as a combination of features. The difference is what these features are. Images of a face can also be represented in other ways than by a combination of features. Several techniques make use of a wireframe to model faces. The wireframe has the same shape as a face and to make it look more natural it can be texture-mapped with a real image of a face. To make the face move and change appearance between video frames it is enough to transmit the changes in the wireframe; consequently these techniques achieve a facial representation at very low bitrates. Techniques that make use of a wireframe to model facial images are for example MPEG4 facial animation [19] and model based coding [20,21]. These techniques reach very low bitrates while retaining high spatial resolution and framerate. A method that relies on the statistical model of the shapes and gray-levels of a face is Active Appearance Model (AAM) [22]. AAM are statistical models; models of the shape of an object. The models are iteratively deformed so that they fit an object. For facial coding a model of the facial features are mapped onto a facial image. The model cannot vary in any possible way; it is constrained by the changes which occur within a training set. Depending on the statistics that are used with the training set it is possible to allow the model to vary more or less than the training set. To make the fitting more robust the model is combined with models of the appearance around all the point in the model. These models work on the statistical variation of gradients or lines. AAM always starts with an initial estimate of the model and end when the fit is considered good enough; often solved with an error threshold. All of these techniques have drawbacks that are critical for efficient usage in visual communication. Pighin et al. provides a good explanation why high visual quality is important and why video is superior to animations [23]. The face simply exhibits so many tiny creases and wrinkles that it is impossible to model with animations or low spatial resolution. Therefore any technique based on animation or texture-mapping to a model is not sufficient. Some approaches have focused on retaining the spatial quality of the video frames at the expense of frame rate. Wang and Cohen presented a solution where high quality images are used for teleconferencing over low bandwidth networks with a framerate of one frame each 2-3 seconds [24]. The idea of using low framerate is however not acceptable since both high framerate and high spatial resolution is important for many visual tasks. According to Lee and Eleftheriadis different facial parts have different encoding needs [25]. The sense of having eye contact is sensitive to low resolution while lip synchronization is sensitive to low framerate. Therefore it is not sufficient to provide either high framerate or spatial resolution; both are important. Any technique that want to provide video at very low bitrates must be able to provide video with high spatial resolution, high framerate and have natural-looking appearance.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 30 — #33

i

30

2.1.3

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Other implementations of PCA for video coding

There are some previous implementations that use principal component analysis for video coding. These implementations aim at encoding facial sequences into very low bitrates with high quality. Torres et.al. have published several articles about using PCA for encoding of facial parts of video frames [11,26,27,28,29]. Their implementation is conveyed within the MPEG-7 standard [30] where the video frame is divided into objects which can be encoded separately. They encode the facial parts of the frames through PCA and the rest of the objects with other techniques. Their implementation of PCA is called Adaptive PCA and they use it in combination with DCT coding. The first image in a sequence is coded with DCT intracoding. Each following frame is projected onto an Eigenspace for encoding. If the result of the reconstruction with the Eigenspace satisfies a certain quality this is the only coding for the facial image that is used. If the reconstruction result is not good enough the frame is intercoded with DCT using the first frame as reference. The DCT encoded image is added to the Eigenspace so that y representation of the images is improved. Crowley et.al. name their implementation orthonormal basis coding [31,32,33,34] but they use PCA since it is based on a combination of orthonormal bases. They make use of a threshold method to extract the bases that are used for encoding and the results are achieved for video sequences of 400 frames. They do not use a full PCA calculation of the bases but instead a threshold method. The normalized cross-correlation is computed between the first image and a new frame. When the cross-correlation is below a threshold the new image is added to the Eigenspace. The cross-correlation is then calculated between this image and the subsequent images. When this cross-correlation drops below the threshold the new image is added to the bases and so on. To do this in a more efficient way they compare the cross-correlation between the first image and all the other images. Very similar images are placed in the same set as the first frame. Reasonably similar images are placed in a new set and images which are not similar are placed in a third set. The third set is then compared to an image in the second set and either grouped as very similar, reasonably similar or completely different. This procedure continues until all images are grouped in sets and a representative image from each set is chosen as a basis for the Eigenspace.

2.1.4

Scalable video coding

Scalable video coding (SVC) enables encoding and transmission of parts of a video sequences divided into partial bitstreams. The partial bitstreams are designed to provide video with lower temporal resolution, spatial resolution and/or visual quality when fewer bitstreams than possible is used. As more bitstreams are added the quality and resolutions can be improved. This enables heterogeneous decoders to receive as many bitstreams as the network allows or the device can make use of. Scalability refers to a meaningful removal of certain parts of the video. Scalable

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 31 — #34

i

TECHNICAL SOLUTIONS

i

31

video can provide a solution to the use of heterogenous networks and devices since the video can be optimized for transmission with several different parameters. The goal for SVC is to encode video into several layers which can be decoded with the same complexity and reconstruction quality as if the video is encoded into a single layer, without using any more bits. This is not really possible so a complexity, quality and bitrate similar to one-layer encoded video is a more realistic goal. Degradation of the video can be needed because of restrictions in bitrate, video format, power usage or large transmission losses in the network. The encoder must encode the video into layers for a decoder to be able to decode the video in a scalable manner. The decoder can only choose from the layers that the encoder has created; it cannot select the reconstruction way all by itself. Scalable video encoding has been standardized for quite some time; it is part of several old standards [13,14]. But the scalable features of these standards are seldom used since scalability is achieved only with large losses in complexity and quality even if the bitrate for scalable video is similar to single-layer encoded video. The current state-of-the-art standard for video coding, H.264, have scalable extensions meaning that video can be layer encoded with this technique. Schwartz et.al. provide an extensive introduction to scalable video within the H.264 standard [35]. They constitute a list of of requirements for scalable video: • Similar coding efficiency for layered encoding as single-layer encoding. This should be valid for each subset of the scalable bit stream. • Small increase in decoding complexity compared to decoding single-layer encoded video. • Possibility of scaling in quality, resolution and framerate. • Support of a backward compatible base layer; the layer-encoded video shall function with single-layer decoders. • Simple bitstream adaptations of the encoded video. It should be possible to design the bitstream based on the layers that the decoder wants to receive. A scalable video compression scheme should meet all of these requirements to be able to provide scalable video over different networks and to devices with different needs and/or functionality.

2.2

Technical solutions

2.2.1 Hands-free video equipments The fact that it’s mostly the facial mimic that changes when a person communicates verbally can be exploited in compression algorithms to reach very low bitrates. By removing the global motion of the head we can concentrate on modeling

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 32 — #35

i

32

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

the local motion, i.e., the facial mimic. Consequently, we have used a hands-free video equipment both to allow the user to use both hands and to make encoding more efficient. We have used two different system (Figure 2.6); they will from now on be referred to as the helmet and the backpack. The helmet emulates a handsfree video system with automatical normalization. Video sequences recorded with the helmet consist of the user’s face or head and shoulders. To emulate such a solution for video communication we use a construction helmet and a web camera. A light metal arm is attached to the helmet and the web camera is attached to the metal arm. The camera is positioned so that it films the face of the person wearing the helmet. The helmet ensures that the facial features are positioned at approximately the same pixel position from the beginning. The backpack consists of a backpack, an aluminium arm and a mounting for a camera at the tip of the arm. This equipment is built by the company Easyrig AB [36]. A major difference between the helmet and the backpack is that for the helmet the camera follows the head motion and ensures that the face is filmed at near frontal view. The backpack follows the motion of the back of the wearer and normalization of the facial features is not automatical. The face can be moved independent of the camera and the face isn’t always filmed at a near frontal view. But the backpack allows the user to have free hands and be able to move quickly without placing the camera on a tripod or on the ground. 2.2.1.1 Modeling efficiency with the helmet To show how hands-free equipment improves the modeling and coding efficiency we have performed a test with the helmet. We have recorded a video sequence where the user is displaying Ekman’s six basic emotions but they are recorded without the use of the hands-free helmet. The video sequence is recorded with a fixed camera and when the user moves and rotates the position and angle of the persons head is changed in the video. Example frames are shown in Figure 2.7. The contents of the video sequences are not exactly the same so they can not be compared objectively, e.g., with PSNR. We use the bound from section 3.4.1 for comparison. This bound describes the efficiency of a facial mimic model extracted from a video sequences by stating the highest quality that can be achieved by using a certain number of model dimensions. The bound is still measured objectively but the different video sequences can be compared. It is clear from the figure that the theoretical quality of sequences recorded with the hands-free video equipment is much higher than the sequences recorded without it.

2.2.2

Basic emotions

The creation of a model for facial mimic requires that it is possible to create a space which is spanned by some representative facial expression. This idea have has been addressed within the psychological community. According to American psychologist Paul Ekman all facial emotions consist of blended versions of only

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 33 — #36

i

TECHNICAL SOLUTIONS

i

33

(a) helmet

(b) backpack

Figure 2.6: Hands-free video equipments.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 34 — #37

i

34

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Figure 2.7: Example frames from video recorded without hands-free equipment.

Figure 2.8: Difference between using the hands-free helmet and using a fixed camera. – Hands-free helmet - - Fixed camera

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 35 — #38

i

TECHNICAL SOLUTIONS

(a) happiness

(b) sadness

(c) surprise

(d) fear

(e) anger

(f) disgust

i

35

Figure 2.9: The six basic emotions.

six basic emotions [4,5]. They are visualized in Figure 2.9. He has later added a neutral expression to his definition so that there are 7 emotions needed. The neutral expression is however contained in the transition between the other emotions so there is no need to model it. By modelling the 6 basic emotions you actually model all possible facial emotions.

2.2.3 Video sequences Several experiments in this work are based on the same 10 video sequences recorded with the helmet (Section 2.2.1). Each of these video sequences shows the face or head-and-shoulders of a person. The person is displaying the six basic emotions which Ekman has proposed. Between the emotions the person in the video returns to a neutral facial expression. Each video sequence is approximately 30 seconds long and a new emotion is displayed every fifth second. Each emotion is displayed for 2-4 seconds, consisting of a change from a neutral expression to a basic emotion and returning to a neutral expression. The spatial resolution for the video sequences is 240x176 pixels and the framerate is 15 fps. The values for framerate and resolution are chosen to approximately match the settings used in 3G video telephony from 2003. Two different backgrounds are used for the video sequences. Eight of the sequences are shot against a white background and two of the sequences are shot against a non-homogenous background. Example frames from some of the video sequences are shown in Figure 2.10.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 36 — #39

i

36

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Figure 2.10: Example frames from three video sequences.

Theses sequences are used for calculation of theoretical bounds, evaluation of PCA video coding and Eigenspace updating. They are also used for asymmetrical principal component analysis. Four of these sequences show the same person recorded on different time instances and they are used for examination of Eigenspace re-use. We also make use of video sequences that show one person when he is communicating verbally. These video sequences are 2 minutes long. The facial mimic in these sequences are less expressive than the mimic in the previously described video sequences but they have the same framerate and spatial resolution (15 fps, 240x176 pixels). We furthermore use 5 video sequences recorded with the backpack which have a high definition (HD) resolution of 1440x1080 pixels (Figure 2.11). These video sequences have a framerate of 25 fps and show one person when he is communicating or displaying the basic emotions. For side view and profile encoding we have recorded video sequences without any hands-free video equipment. These video sequences are shot in front of mirror which is tilted 90 degrees (Figure 2.12) which allow us to record both the frontal and side view in the same video with a correspondence between the two views (Figure 2.13). To verify the usability of the hands-free video helmet (section 2.2.1) we use two video sequences recorded without any hands-free equipment. These sequences show a person when he is displaying the basic emotions but where the head is free to move independently of the camera (example figures can be found in section 2.2.1.1). The sequences have the same framerate and resolution as the video sequences recorded with the helmet; 15 fps and 240x176 pixels.

2.2.4

Locally linear embedding

When sequential data is evaluated individually an unnatural scene transition might occur and the result is individually good but as a sequence the results is not satisfactory. The result for video will be frames with high individual quality but as a sequence it will be jerky and unnatural. To generate a natural-looking video the transition between the frames also has to be considered; as a cost for smoothness.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 37 — #40

i

TECHNICAL SOLUTIONS

i

37

Figure 2.11: Example frame from from a HD video sequence.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 38 — #41

i

38

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

Figure 2.12: The tilted mirror.

Figure 2.13: Frame from a video sequence shot in front of the mirror.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 39 — #42

i

TECHNICAL SOLUTIONS

i

39

A dimension reduction tool can measure the similarities between two adjacent frames. The non-linear dimension reduction tool called locally linear embedding (LLE) [37] is good for this task. LLE is an unsupervised learning algorithm that computes low-dimensional embeddings of high-dimensional data. The lowdimensional embedding has the feature of preserving the neighborhood between the data and LLE can learn global structure of nonlinear manifolds. We use LLE to embed the changes in the face, i.e., the facial mimic, in a very low-dimensional space. This space should reflect the structure of the facial mimic. The approach which is used is to project two frames into a LLE space and use the distance there to measure the similarity between the frames: S(ˆIi−1 (x), Jk (x)) = D(L(Ii−1 (x), L(Jk (x))))

(2.3)

where S() is a cost function, L() is the LLE projection operation and D()is a distance measure on the locally linear embedding space. I is an input frame and J is an image form the gallery. By enforcing smooth transitions we reduce the unnatural transitions between the frames.

2.2.5 High definition (HD) video High-definition (HD) video refers to any video system of higher resolution than standard-definition video that is used in regular TV broadcasts and DVD-movies. The display resolutions for HD video are called 720p (1280x720), 1080i and 1080p (both 1929x1080). A resolution of 1440x1080 is called 1080 anamorphic. The difference between i and p is that i uses interlaced frames with half the stated resolution while progressive (p) uses frames with full resolution. As of today the TV-transmissions that are labelled HDTV use either 720p or 1080i; in Sweden it is mostly 1080i. The video that we use as HD video has a resolution of 1440x1080. It is originally recorded as interlaced video with 50 interlace fields per second but it is transformed into progressive video with 25 frames per second.

2.2.6 The face space and personal mimic space A space which contains all possible human faces can be called face space. It is high-dimensional and faces are distributed within the space. This space is mostly used for facial recognition where the goal is to find a difference between faces. PCA is an important technique for the creation of this space since PCA reduces the dimensionality of a space but still retains the distance, the difference, between the contents of the space. A face space extracted through PCA is compact enough to enable real-time usage of the space and it is possible to find differences between the faces within the space. The Euclidian distance between the faces in the face space represents the difference in appearance that exists between the faces, the appearance difference

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 40 — #43

i

40

i

RELATED WORK AND TECHNICAL SOLUTIONS USED

between people. This space contains no changes within the face; no facial mimic. A personal mimic space contains the same persons face but with different facial expressions. This space models all the possible expressions a person can exhibit through her face. A good explanation of the face space, or personal mimic space, is given by Ohba et al. [38,39]. Just as the face space can be reduced in dimensionality through PCA, the personal mimic space can as well. The separation of the different faces in the personal mimic space corresponds to the separation of the facial expressions. The space is high-dimensional but this cannot be visualized so the first three dimensions for two mimic spaces are shown in Figure 2.14. Each point within this space is a certain expression and every possible position contained by, and close to, the points are other expressions. By modeling this space you actually model a person’s facial mimic.

2.2.7

Quality evaluation

In most experiments we make use of objective quality assessment through peak signal to noise ratio (PSNR). This quality depends on the mean square error (mse) between the original and reconstructed images. mse and PSNR is calculated according to: mse =

h∗v X (Ij − ˆIj )2 h∗v j=1

(2.4)

where x and y are the horizontal and vertical resolution of the frames, respectively. I is the original image and ˆI is the reconstructed image. P SN R = 20 ∗ log( √255 ) mse

(2.5)

where 255 is the maximum value for the pixel intensity; both for RGB and YUV color spaces. A higher PSNR value means that there is a low difference in pixel intensity between the two images (original and reconstructed). It doesn’t have to mean that the reconstruction is visually better, but a low objective difference usually means a visually better result. In the thesis there are tow different PSNR calculations: • PSNR calculated for the R, G and B color channels. For these calculations the PSNR is presented as a single value. • PSNR calculated for the Y, U and V channel of the YUV color space. For these calculations the PSNR is presented as three separate values; one for Y, U and V each.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 41 — #44

i

TECHNICAL SOLUTIONS

i

41

Figure 2.14: The first three dimension of the personal mimic space for two facial mimics.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 42 — #45

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 43 — #46

i

i

Principal component analysis video coding

3.1

Principal component analysis

Principal component analysis (PCA) [6] is a transform for vector spaces that can be used to reduce the dimensionality of a data set. It is sometimes called the most valuable result from applied linear algebra. PCA is a linear transform that creates a new coordinate system for data. The new coordinates are selected based on the variance of the data. The first coordinate spans the highest variance of the data and the second spans the highest variance which is orthogonal to the first coordinate. The third coordinate spans the highest variance which is orthogonal to all previous coordinates and so on. The technique works by calculating the Eigenvalues of a matrix and deciding the importance of each part of the matrix, i.e., each row or column of the matrix. The coordinates can be called bases and the goal of PCA is to find the most meaningful bases for the given data set. The new bases are a linear combination of the old bases so PCA is a linear transform. If the original data is multi-dimensional it is first transformed into one-dimensional data, i.e., into vectors. The length of each vector is equal to the product of the length of all dimensions. In this work we deal with images so the original data is two-dimensional and the length of each vector is h*v ; the number of elements in the images: 

C1,1 I= . C1,h∗v

C2,1 . C2,h∗v

 ... CN,1 . .  ... CN,h∗v

(3.1)

where I are the original data as vectors and C are the original data as images. N is the number of elements in the original data. The mean of the data is removed so that the remaining data is centered on zero. The mean I0 is calculated as: 43

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 44 — #47

i

44

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

I0 =

1 N

N X

Ij

(3.2)

j=1

The mean is then subtracted from each basis in the original data, ensuring that the data is zero-centered. ´Ij = (Ij − I0 )

(3.3)

In a mathematical expression PCA is a transform of the data and the new data is a linear combination of the original data. The combination is decided by a matrix F . ˘I = F ´I

(3.4)

PCA is theoretically the optimal transformation of data in a least square error sense, meaning that there is no there linear representation that outperforms a representation yielded through PCA. We want to find an orthonormal matrix F where ˘I = F ´I and each row of ˘I is diagonal. The columns of ˘I are then the principal components of the original data I. Something orthonormal is a set of bases which are orthogonal to each other and have the norm of 1. This orthonormal matrix can be found by calculating the covariance matrix of the original data. The covariance matrix uses the mean ´ of the mean centered data µI . One way to find the covariance matrix is to use singular value decomposition (SVD) (section 3.1.1). T The covariance matrix {´I´I } for ´I will have a very large size since it is a square matrix with the the size of x *y. To reduce the computational complexity T for calculating SVD or covariance matrix it is possible to create {´I ´I} instead. This matrix will be a square matrix with the size of N . The Eigenvectors for this matrix can be multiplied with the original data I to create Eigenvectors φj for I. φj =

X

bij (Ii − I0 )

(3.5)

i

where I are the original data and I0 is the mean of all the original data. bij are the Eigenvectors from the the covariance matrix {(Ii − I0 )T (Ij − I0 )}. The Eigenvectors with the highest Eigenvalues corresponds to the principal components that have the highest variance. When PCA has been performed for a data set the Eigenvectors can be arranged based on their importance. The Eigenvectors with low Eigenvalue can be discarded without any significant loss in data accuracy. An Eigenspace Φ can consist of N Eigenvectors φj :   φ1,1 φ2,1 ... φN,1 . . .  Φ= . (3.6) φ1,h∗v φ2,h∗v ... φN,h∗v

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 45 — #48

i

PRINCIPAL COMPONENT ANALYSIS

i

45

Only the chosen number M of principal components in Φ, j =(1,2,...M) with the largest Eigenvalues Λ have to be stored as the Eigenspace.   φ1,1 φ2,1 ... φM,1 . . .  Φ= . (3.7) φ1,h∗v φ2,h∗v ... φM,h∗v Coefficients {αj } is extracted for any vector of the same size of the data by removing the mean I0 from the data and multiplying it with the Eigenvectors: αj = (I − I0 )T φj

(3.8)

A coefficient αj is available for every vector φj in the Eigenspace Φ. The coefficients can be used with the Eigenvectors φj and the mean I0 to approximate the original data. If all coefficients are used the representation is error-free: I = I0 +

N X

αj φj

(3.9)

j=1

If only a chosen number of coefficients M are used to reconstruct the data this will incur an error in the representation. ˆI = I0 +

M X

αj φj

(3.10)

j=1

Because PCA is very efficient at compacting the important data the number M can be kept very small at the same time as the representation error (I − ˆI) is low. An important assumption which is used for PCA in this work is that the data is Gaussian distributed (normal distributed). Gaussian distribution ensures that the principal components which are extracted are not only orthogonal but also independent.

3.1.1 Singular Value Decomposition Singular value decomposition (SVD) is a factorization method from linear algebra. It divides a square matrix into three matrices: I = U ΣV

(3.11)

where U and V are unitary matrices. Together the matrices contain the Eigenvectors and Eigenvalues of the original matrix I. SVD is used in PCA to extract the Eigenvectors (principal components) and Eigenvalues of matrix I. Only the Eigenvectors are used for further processing in PCA; the Eigenvalues are used for calculation of model efficiency.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 46 — #49

i

46

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

3.2

Principal component analysis video coding

i

Video coding through principal component analysis actually consists of two parts: 1. Creation of model for facial mimic. 2. Encoding and decoding (reconstruction) of video frames. PCA is only used for the first part; model extraction. Both encoding and decoding of video is done through vector multiplication. Assume that there is an original video sequence C with a given number of frames N . The video consists of a person showing the basic emotions and it is normalized regarding the pixel positions of the facial features. The most prominent information in this video is the changes in the face, i.e., the facial mimic. A model of this video which is extracted through PCA is a personal mimic space for this person. According to section 3.1 a model, i.e., an Eigenspace, is extracted form the original video frames. Before the original data is transformed to vectors it is YUV subsampled (section 2.1.1.3) so the length of each vector is halved compared to using all available RGB pixels.   C1,1 C2,1 ... CN,1  . . . . I= (3.12) C1,(x/2)∗y C2,(x/2)∗y ... CN,(x/2)∗y where I are the original data as vectors and C are the original data as images. N is the number of elements in the original data, x is the horizontal resolution in RGB pixels and y is the vertical resolution. An Eigenspace Φ = [φ1 φ2 ...φN ] is extracted from I with the mean I0 subtracted from I. φj =

X

bij (Ii − I0 )

(3.13)

i

where bij are the Eigenvectors from the the covariance matrix {(Ii −I0 )T (Ij −I0 )}. Encoding is the same as extracting projection coefficients {αj } for each video frame that should be encoded. αj = φj (I − I0 )T

(3.14)

where I is the video frame that is encoded. Decoding is performed by combining the mean I0 , the coefficients {αj } and the Eigenvectors φj . ˆI = I0 +

M X

αj φj

(3.15)

j=1

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 47 — #50

i

ENCODING AND DECODING TIME

i

47

where M is a selected number of principal components used for reconstruction (M < N ). The extent of the error incurred by using fewer components than possible (M < N ) is examined in section 3.4. With the model it is possible to encode entire video frames to only a few coefficients {αj } and reconstruct the frames with high quality. Only these coefficients {αj } needs to be transmitted between encoder and decoder when they both have access to the model, i.e., access to the Eigenspace Φ. This allows video coding to be performed at extremely low bitrates. Exactly how low is also discussed in section 3.4. PCA allows the decoder to scale the resulting video based on reconstruction quality. A high number M of principal components will increase the quality of the reconstructed video compared to a low number of principal components. The encoder doesn’t have to do anything different for decoding with different quality of the video. Heterogeneous decoding of the same encoding is achieved with PCA; usually the encoder has to encode the video into layers and the decoder scales the quality based on the number of layers which are used for decoding. A detailed explanation of PCA video coding is found in Paper I.

3.2.1 Comparison with h264 The state-of-the-art in video coding today is a standard called H.264/AVC [15,16]. Any new video coding technique should be compared with the best available today. To show the benefits of our PCA video codec we compress the same video sequences with H.264 and compare the results. We have used ffdshow v. 20051129 to perform H.264-encoding. The I-frame interval is set to 500 so the video sequences, which are 450 frames long, are encoded with an initial I-frame and then P-frames (IPPPP...). We use one-pass encoding based on quantization steps and we use the quantization steps to encode the video at different bitrates. It is actually not possible to reach bitrates lower than 2.4 kbps with H.264. Even when the highest quantization steps and only one I-frame are used the bitrate never goes below 2.4 kbps. The result measured in PSNR for two video sequences are shown in Figure 3.1. Some example frames for H.264 encoding at 2.4 kbps are shown in Figure 3.2. It can clearly be seen that the visual quality at this rate is far from satisfying. It is possible to discriminate that this is a face but nothing more. When we use coding based on PCA the result is vastly better, even at lower bitrates.

3.3

Encoding and decoding time

The encoding and decoding of this video is very simple compared to other video coding schemes. But to ensure that the coding scheme can work in real-time on mobile devices we measured the encoding and decoding time on a computer with limited computational and memory capacity (Pentium III Mhz workstation with 256 Mb RAM).

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 48 — #51

i

48

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

Figure 3.1: Comparison between a codec based on PCA and a H.264 codec at very low bitrates. (– PCA coding - - H.264)

Figure 3.2: Example frames from H.264 encoding at 2.4 kbps.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 49 — #52

i

THEORETICAL BOUNDS

i

49

The average time for encoding one video frame was 10,9 ms for an Eigenspace of 10 Eigenimages and 26,6 ms for 25 Eigenimages. On average it took 13,8 ms to decode one frame for 10 Eigenimages and 34,9 ms for 25 Eigenimages. The encoding and decoding time is equal to a framerate of 40 fps for 10 Eigenimages and 16 fps for 25 Eigenimages. The video sequences have a framerate of 15 fps so it is possible to encode them with up to 25 Eigenimages in real-time. The capacity for mobile devices is today higher than the capacity of our test setup.

3.4

Theoretical bounds

The performance of PCA video coding is dependent on the facial mimic model extracted through PCA. The efficiency of this model will directly decide how many dimensions that are needed to reach a certain representation quality. By modelling the 6 basic emotions it is possible to model all possible facial emotions. It is however not so straightforward to say that you can use 6 images of the basic expressions to model all images of the facial expressions. There is a difference between the psychological representation of expressions and the representation in digital images. The representation in digital images, and video, is usually measured objectively and not subjectively. So, the result will be a certain quality for a given number of model dimensions; not that it models all possible emotions. We have examined how compact representation that is needed to reach a reconstruction quality on images approximated by the model. Kirby and Sirovich have previously stated that it is enough with 100 male Caucasian faces to model all possible male Caucasian faces [40,41]. This means that it is enough with a space of 100 dimensions to model millions of faces. All these face do however have the same facial expression; usually a neutral expression where teeth aren’t showing. This doesn’t model any facial expression and the modelling of facial mimic isn’t considered. There are implementations that take the personal mimic space in consideration. These implementations use a personal mimic space for encoding of facial mimic, either still images or video sequences. The results show that models of the facial mimic can represent facial images at very low bitrates with high quality but there is no bound that describes how well the modelling can be. Torres et.al. have an average reconstruction quality of 29.6 dB when they use 16 basis images [26]. Crowley et.al. show that they can produce a reconstruction quality of almost 35 dB (PSNR) when they use 15 basis images for reconstruction [31]. We have shown that when 10 basis images are used an average reconstruction quality above 31 dB can be achieved [42]. More information about these implementations is found in section 2.1.3 and section 3.2. We will describe two different boundaries for facial mimic modelling. The bounds are affected by several factors so we have limited the boundaries to the following circumstances: • A spatial resolution of 240x176 pixels.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 50 — #53

i

50

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

• A color depth of 8 bits per pixel (0-255). • RGB color space in the original video (24 bits). • Theoretical bounds are calculated individually for each person. • The objectively quality is measured for the R, G and B-channel together. The spatial resolution is chosen to match the resolution of mobile phones. The standard with the highest quality that was used when these boundaries where calculated was called QVGA and has a total pixel number of approximately 77000 [43]. A resolution of 240x176 pixels is equal to 42420 pixels. The resolution can be different and the rate is still unchanged; the quality might be affected. We have calculated two bounds for the modelling efficiency; a distortion bounds and a rate-distortion bound. The distortion bound shows the minimum distortion that can be achieved for a specific source when there are no restrictions on the rate. Rate-distortion bound describes the minimum rate that can be used at a given distortion. The source cannot be modelled correctly by a rate lower than the rate described in the function. The bounds are measured objectively since it is impossible to calculate bounds for something that is subjective; the achieved subjective quality is dependent on the subjects.

3.4.1

Distortion bound

The representation quality of facial mimic is affected by the number of dimensions that are used for modelling of the mimic. The mean value for each pixel position I0 also affects the modeling in a large extent. In section 3.1 we describe how data is modelled through PCA and this representation is error-free when all N principal components are used. An error in representation occurs when less features, or components (M < N ) are used. The error is calculated as: mse(opt) =

N X

λj

(3.16)

j=M +1

where λij are the Eigenvalues for the principal components. The modeling efficiency is calculated as the sum of the error which is incurred by not using a number of Eigenvectors. How the mean square error can be calculated for a high-dimensional source when the mean has been subtracted is explained by Fukunaga [44]. This will only explain how much information that is removed from mean-centered data. To calculate the quality of the total representation you also need to include the mean. A mean square error bound can be calculated for the number of Eigenvectors φj that are used for image representation (equation 3.16) by varying the number of dimensions M of the model which are used for representation of the facial mimic. This is the distortion bound for representing the signal with the selected number of

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 51 — #54

i

THEORETICAL BOUNDS

PSNR [dB]

i

51

Number of Eigenvectors φj 5 10 15 20 25 34.56 36.82 38.13 39.07 39.75

Table 3.1: Average PSNR values for 10 facial mimic video sequences.

Eigenvectors. If the mean square error (mse) is reduced the error for representation is reduced. Peak signal-to-noise ratio (PSNR) is calculated from the mse so that a higher measurement value means better representation. Even though the distortion bound is calculated individually for each person the average result of facial mimic from 10 video sequences (6 different persons) are shown in Table 3.1. This table shows the average of maximum quality that can be reached for facial mimic representation. The bound will start at 0 Eigenvectors and continue above 25 as well. We have chosen to calculate the boundary from 5 to 25 Eigenvectors.

3.4.2 Rate-Distortion bound With the previous distortion bound we only examined the maximum quality that can be reached for modelling of facial mimic with a certain number of dimension. For the rate-distortion bound we also calculate the minimum bitrate which is needed to reach a certain representation quality. In many low bandwidth applications (almost all) the limiting factor is bandwidth so the bitrate becomes an important factor; often more important than the quality. The quality will be maximized based on the available bitrate, or the available bandwidth. If quantization is used one important question is how to assign the bits to model the source in the most efficient way, assuming that the bitrate is fixed. The actual assignment of bits is not discussed here; the bound describes how many bits you can use with an optimal assignment. The theoretical bound can be calculated with a rate-distortion function. Rate-distortion functions refer to the minimum rate for representing a source at a given distortion, or the lowest possible distortion at a given rate. High-dimensional sources can be assumed to consist of several variables. When PCA is used to create the model these variable are Gaussian distributed and independent of each other. The rate-distortion function for such a source is calculated through reverse ”water-filling” where the rate and distortion is controlled by a variable γ [45]. The variable γ controls how many dimensions, i.e., principal components, that are used for source representation; the number of dimensions relates to both the rate and the distortion. An example of reverse ”water-filling” when six principal components are used is shown in Figure 3.3. Only the principal components who have a variance higher than γ is represented with bits. All others are ignored. The rate distortion function controlled by γ is given by

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 52 — #55

i

52

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

σˆ

2 1

σˆ

2 2

σˆ

2 3

σˆ

2 4

γ D1

D2

D3

D4

X1

X2

X3

X4

σˆ

2 5

σˆ

2 6

D5

D6

D7

D8

D9

D10

X5

X6

X7

X8

X9

X10

Figure 3.3: Reverse water-filling for independent Gaussian principal components. Only the components who have a variance larger then γ is allocated bits in the quantization process.

Figure 3.4: Mean rate-distortion bound for all 10 video sequences of facial mimic.

R(D) =

M X 1 j=1

2

σj2 Dj

(3.17)

γ < σj2 , γ ≥ σj2 ,

(3.18)

log

where γ, if Dj = { 2 σj , if where γ is chosen so that

M X

Dj = D

j=1

The average results for 10 video sequences are shown in Figure 3.4.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 53 — #56

i

EIGENIMAGE COMPRESSION

i

53

3.4.3 Comparison of the distortion bounds The two different bounds are calculated for the same video sequences so they are directly comparable. The distortion bound describes the representation quality which is achievable with an unlimited bitrate. The rate-distortion bound describes the quality that is achievable with a given number of bits. As the bitrate is increased the two bounds will be closer and if the rate is high enough in the ratedistortion bound they will exactly equal. The difference between the two bounds can be measured as: M ∗γ

(3.19)

where M is the number of Eigenvectors φj used for reconstruction and γ is the level from equation 3.18. A more detailed description of the boundaries can be found in Paper II.

3.5

Eigenimage compression

Video coding based on PCA requires that principal images of the facial mimic are available at both the encoder and the decoder side. These images take up a lot of space, originally an Eigenspace of 10 Eigenimages need 5 MB for storage. The video sequences used in this section have a spatial resolution of 240x176 pixels. After chroma subsampling the storage need is reduced to 2.5 MB. This need is not so important since mobile devices; any device in fact, has a storage capacity much higher than this. But the transmission cost for sending 2.5 MB over any network can be costly and for low capacity networks it is extremely costly. It would for example take more than 35 minutes to transmit 2.5 MB over a GSM network, and approximately 3 minutes over a GPRS network. To enable use of the Eigenimages for transmission over a low capacity network the Eigenspace needs to be compressed. The Eigenspace consists of images so a straightforward way to compress them is to use image compression; we will use the JPEG image compression standard [46,47]. The mean image I0 is also needed for both encoding and decoding and it should also be compressed with JPEG compression. Compression of the Eigenimages φj and the mean image I0 will affect the reconstruction quality of video which is encoded and decoded with them. The compression will affect the Eigenimages φj and the mean image I0 and this will affect the reconstruction quality of video which is encoded and decoded with them. The images are first quantized and the reconstruction levels are stored. The quantized images are then JPEG encoded and the encoded images are used with the reconstruction levels from the quantization to reconstruct the Eigenimages. 1. Quantization of the Eigenspace. The quantization values are stored in an image. The reconstruction values are stored without loss.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 54 — #57

i

54

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

2. JPEG-encoding of the quantization values. 3. Inverse quantization mapping of the JPEG-encoded values with the quantization reconstruction values.

3.5.1 Quantization - uniform or pdf-optimized? The quantization for the Eigenspace Φ and mean image I0 can be performed in many different ways. We have found that when ≈ 7 bits are used for quantization there is no difference between using uniform or pdf-optimized quantization. 8 bit quantization is used since it is a suitable number for computers (1 Byte). Therefore we use uniform quantization in our work since it also reduces the data amount needed for inverse quantization.

3.5.2

Compression of the mean image

The mean image is compressed in the same way as the Eigenimages. Since a compressed mean is subtracted before the model is extracted with PCA and the same compressed mean can be added in the decoding process it can be compressed more heavily than the Eigenimages. The mean is compressed with a JPEG quality factor of 25, resulting in a size of 6 kB for the mean image.

3.5.3

Loss of orthogonality

The principal components in the original, uncompressed Eigenspace are orthogonal, meaning that they follow 1 f or φTi φj = { 0 f or

i=j i 6= j

(3.20)

The orthogonality between the principal components ensures that the information in one principal component is independent from information in another principal component. This increases the compactness and efficiency of the model since no information is described in more than one component. Compression of the Eigenimages will result in a loss of orthogonality and model degradation. The loss of orthogonality is measured asPthe average of the sum of the inner product between all principal components, φTi φj . When there is perfect orthogonality this sum is zero. Even though the orthogonality between the principal components is lost in compression it can be regained. We have examined two different methods to ensure orthogonality between the Eigenimages: • Least-square calculation of projection coefficients • Re-orthogonalization of the Eigenspace through modified Gram-Schmidt projection

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 55 — #58

i

EIGENSPACE RE-USE

Method Original Quantized [8 bits] JPEG-compressed

i

55

Storage need [kB] 256 64 1-37

Table 3.2: Storage need for one Eigenimage.

Both of these methods ensures orthogonality between the principal component but in different ways.

3.5.4 Compression methods To compare between different methods we have evaluated their performance. As reference we have used the original Eigenimages and mean images to see how much the different methods degrade the reconstruction performance. Table 3.2 depicts the storage need each Eigenimage has when it is compressed. The methods also require a different number of operations for encoding and decoding; they yield different complexities. Orthogonality loss is only an important factor when the compression ratio is high. Figure 3.5 show the orthogonality loss for the different methods together with a chosen compression level. The size of the compressed Eigenimages is depicted along the horizontal axis and the left side of the y-axis shows the reconstruction quality in PSNR. The right side of the y-axis depicts the loss of orthogonality and the vertical line in the figure symbolizes the compression ratio that we have chosen to use. It can clearly be seen in this figure that the orthogonality loss only have an effect for very high compression. There is no need to use any kind of re-orthogonalization. The best method to use is compressed Eigenspace and mean image at both encoder and decoder without any re-orthogonalization technique. If 10 Eigenimages are used and they are compressed to 8 kB each and the mean image is compressed to 6 kB they can be transmitted over a GSM network in ≈ 1 minute and in 6 seconds over a GPRS network. The average reconstruction quality for video encoded with the compressed Eigenspace is still above 34 dB, reduced with ≈ 2 dB.

3.6

Eigenspace re-use

Existing Eigenspaces can be used for a new communication event instead of creating a new Eigenspace and transmitting it. Since both encoder and decoder already have access to this Eigenspace it can be used directly. There is an existing Eigenspace Φ which is extracted from video sequence I with the mean I0 . There is another video sequence G and this video is encoded with Eigenspace Φ and mean I0 . Both I and G show the same person and are

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 56 — #59

i

56

i

PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

Figure 3.5: Comparison of the different coding schemes. – Compressed Eigenspace at both encoder and decoder - - Compressed Eigenspace at decoder & original mean image -· Compressed Eigenspace at decoder & compressed mean image ·· LS & GS result – Loss of orthogonality

recorded with the helmet. But the facial image is different in viewing angle, facial size and illumination. Encoding of G with Φ and I0 is performed in the same as if video sequence I is encoded. The projection coefficients that are extracted are denoted αjG since they come from projection of video sequence G. αjG = (G − I0 )T φj

(3.21)

Decoding is performed by combining the coefficients {αjG }, the Eigenvectors φj and the mean image I0 . The projection coefficients {αjG } are multiplied with ˆ the Eigenvectors φj and the mean image I0 is added to reconstruct the image G. ˆ = I0 + G

M X

αjG φj

(3.22)

j=1

ˆ will resemble I instead of G since it is the mean image The resulting video G and Eigenimages extracted from I that are used for reconstruction. The result is not good since the facial features are not positioned at the same pixel position for I and G. When the mean image I0 is subtracted from G the result is an image that actually isn’t an image of a face.

3.6.1

Sensitivity to lightning and positional shift

The result is not satisfactory and it may depend on two factors: 1. Different pixel positions for the facial features in I and G.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 57 — #60

i

EIGENSPACE RE-USE

i

57

2. Different pixel intensities for the facial features in I and G. The effect that different lightning has is negligible but it can still be adjusted for. The mean intensity of the mean image I0 is calculated. The same value is calculated for each frame in G and the difference between them can be subtracted from the new frames in G. The difference in pixel positions for the facial features between I and G needs to be addressed. We have used affine transformation to normalize between them; i.e., align one face to the other. Affine, or any, transformation of the Eigenimages φj will reduce the orthogonality of the space and consequently the encoding efficiency. So, it is more convenient to transform the frames in G instead. Affine transformation is performed through multiplication with a rotation matrix A and addition of a translation vector b. · ¸ · ¸ x ´ x =A +b (3.23) y´ y where matrix A refers to a rotation and/or scaling and vector b refers to a translation. x, y, x ´ and y´ are the horizontal and vertical positions for the original and transformed image respectively. Semantically important feature points from the eyes and the mouth are collected along with points from the nostrils for both I and G. The points are selected from the mean of the respective video sequence. The rotation matrix A and translation vector b are calculated with these points. The frames from video sequence ´ G is then transformed to G. ´ = AG + b G

(3.24)

This video sequence is encoded with Eigenspace Φ and mean image I0 , yielding ´ projection coefficients αjG . ´

´ − I0 )T φj αjG = (G

(3.25)

These coefficients are then used with Φ to reconstruct the frames of G. ˆ´ G = I0 +

M X

´

αjG φj

(3.26)

j=1

The result is much better than the result without affine transformation but it is still not acceptable. A detailed description about this experiment is found in Paper III. Affine transformation normalization is not enough to solve the problem with different positions for the facial features. One issue might be that a global affine transformation is performed so that all features are transformed according to a single transform matrix. There are still both visually poor frames and emotions which are mapped incorrectly between I and G. A better normalization (alignment) is needed.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 58 — #61

i

i

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 59 — #62

i

i

Ultra low bitrate video coding

With ultra low bitrate we talk about a bitrate which is close to 100 bits/s (0.1 kbps). The basic idea here is not principal component analysis but instead facial recognition. The encoder and decoder has a personal mimic gallery for a person, which contains several images of the person with different facial mimic; it should cover all the possible facial expressions a person can exhibit through her face. Encoding is done by matching an input frame against the personal mimic space based on facial recognition. The only information that needs to be transferred between the encoder and decoder is the index number of the selected frame from the gallery. The decoder simply has to retrieve the correct number from the gallery and display it. The idea for the coding scheme is visualized in Figure 4.1. The personal mimic gallery may contain several images but all the images are highly correlated since it is the same person in all images. Highly correlated images require very few bits for indexing; our experiments show that on average we need 4 bits for indexing of a specific mimic. With a framerate of 25 fps this will only require 100 bits every second for transmission. The personal mimic gallery can be downloaded, or pre-installed, for a specific user and high quality can be used for the images in the gallery. The displayed images are taken from the gallery so image quality in the gallery directly decides the quality of the displayed frames. This coding gives high quality video at extremely low bitrates but each frame is treated individually; not as a part of a sequence which each frame actually is. The dynamic behavior of a reconstructed video sequence isn’t considered. To

Figure 4.1: Ultra low bitrate video coding scheme.

59

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 60 — #63

i

60

i

ULTRA LOW BITRATE VIDEO CODING

0.05

0

−0.05

−0.1

−0.15

−0.2

−0.25 −0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 4.2: Personal mimic face images mapped into the embedding space described by the first two coordinates of LLE.

study this we create two-dimensional locally linear embedding (LLE) [37] spaces for the original images and the images in the reconstructed video. Figure 4.2 show a two-dimensional LLE space for the original images. The point which links the two parts of the space corresponds to a neutral face. When the frames from the reconstructed video sequence are plotted in the LLE space (Figure 4.3) it is clear that a lot of jumps occur between the two emotion branches. Each jump corresponds to a jerky movement in a sequence.

4.1

LLE smoothing

After smoothing in the LLE space the number of jumps is reduced significantly. Dynamic programming (DP) is used to optimize the index sequence of the reconstructed frames. Instead of using the similarity cost with smoothing for each individual image the globally optimal matching is selected. The smoothness cost is calculated from the LLE space so the optimization is done in the LLE space. With this smoothing the resulting video will get slightly improper expressions

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 61 — #64

i

LLE SMOOTHING

i

61

0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 −0.3 −0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 4.3: The reconstructed frames shown in a two-dimensional LLE space.

0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 −0.3 −0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 4.4: The reconstructed frames shown in a two-dimensional LLE space with smoothing.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 62 — #65

i

62

i

ULTRA LOW BITRATE VIDEO CODING

Table 4.1: Subjective evaluation of reconstructed sequences

HMM HMM+LLE

Good 73.7% 80.3%

Acceptable 19.6% 16.2%

Unacceptable 6.7% 3.5%

for each individual frame but the sequence of images will appear much more natural (Figure 4.4). Since each image have a high individual quality the resulting video will still have the same high quality. Since PSNR is measured as the difference between the original and reconstructed image an incorrect expression will have a lower PSNR even if the visual quality is the same. As evaluation of the improvement LLE smoothing can give we performed a subjective evaluation. A subject is shown two frames, the original and reconstructed image, and is asked about the match between these frames regarding the facial expression. The subject can grade the match as good, acceptable and unacceptable. The result for this evaluation is shown in Table 4.1. A more detailed description of this coding is found in Paper IV.

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 63 — #66

i

i

Asymmetrical Principal Component Analysis video coding

5.1

Asymmetrical principal component analysis video coding

Video coding based on PCA offers high quality video at very low bitrates. The video coding is very good but it can be improved further. There are essentially two major issues with PCA video coding. The first regards the model efficiency. PCA extracts the most important information based on the variance of the information throughout a data set. If the facial mimic changes the underlying pixels will have a large variance and are given high importance. But the background or edges between the face and the background may also shift rapidly and exhibit large variance. These are pixels without semantically important information but with high variance. These pixels will degrade the efficiency of a PCA model since changing background may occur without any change in facial expression. The model should only be dependent on semantically important information. The second issue regards the complexity for encoding and decoding of the video. The complexity is much lower than conventional video coding based on DCT (section 2.1.1) but it can still be reduced. All operations regarding encoding and decoding are linearly dependent on the number of elements (pixels) in the frames and the Eigenspace Φ. A video with high spatial resolution will require more computations than a video with low resolution. But when the frame is decoded it is a benefit of having large spatial resolution (frame size) since this provides better visual quality. A small frame should be used for encoding to reduce the complexity and a large frame should be used for decoding to optimize the quality. This is possible to achieve through the use of pseudo principal components; information where not the entire frame is a principal component. Parts of the video frames are considered to be important; they are regarded as foreground If . If = crop(I)

(5.1)

The Eigenspace for the foreground Φf ={φf1 φf2 ... φfN } is constructed according to the following formula: 63

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 64 — #67

i

64

i

ASYMMETRICAL PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

φfj =

X

bfij (Ifi − If0 )

(5.2)

i

where bfij are the Eigenvalues from the foreground of the video frame If and If0 is the mean of the foreground. Encoding and decoding is performed as: αjf = φfj (If − If0 )T ˆIf = If + 0

M X

αjf φfj

(5.3)

(5.4)

j=1

where {αjf } are coefficients extracted using information from the foreground If . f The reconstructed frame ˆI has smaller size and contains less information than a full size frame. A space which is spanned by components where only the foreground is orthogonal can be created. The components spanning this space are called pseudo principal components and this space has the same size as a full frame: φpj =

X

bfij (Ii − I0 )

(5.5)

i

From the coefficients {αjf } it is possible to reconstruct the entire frame: ˆI = I0 +

M X

αjf φpj

(5.6)

j=1

where M is the selected number of pseudo components used for reconstruction. A full frame video can be reconstructed (Eq. 5.6) using the projection coefficients from only the foreground of the video (Eq. 5.3) so the foreground is used for encoding and the entire frame is decoded. It is easy to prove that ˆIf = crop(ˆI) φfj

crop(φpj )

(5.7)

If0

since = = crop(I0 ). and aPCA adds something very interesting to video coding based on PCA; namely spatial scalability. We have already explained that PCA offers scalability in quality for the decoder without the encoder having to do anything extra. With the use of pseudo principal components a decoder can decide how much of the frame it wants to decode without decision from the encoder. The encoder encodes the video with If and the encoder can produce several different versions of Φp ; giving the decoder freedom of spatial scalability. The spatial scalability doesn’t simply function as a downsampling in size; parts which are decoded can be decoded with full resolution. Reduction in spatial resolution is not a size reduction of the entire

i

i i

i

i

i “thesis” — 2008/9/1 — 22:31 — page 65 — #68

i

ASYMMETRICAL PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

i

65

frame since parts of the frame can be decoded. No quality is lost in the decoded parts; it is up to the decoder to choose how much and which parts of the frame it wants to decode.

5.1.1 Reduction of complexity for the encoder The complexity for encoding is directly dependent on the spatial resolution of the frame that should be encoded. The important factor for complexity is K ∗ M , where K is the number of pixels and M is the chosen number of Eigenvectors. When aPCA is used the number of pixels k in the selected area gives a factor of n= K k in complexity reduction.

5.1.2 Reduction of complexity for the decoder The complexity for decoding can be reduced when a part of the frame is used for both encoding and decoding. In the formulas above we only use the pseudo principal components for the full frame φpj for decoding but if both Φp and Φf are used for decoding the complexity can be reduced. Only a few principal components of Φp is used to reconstruct the entire frame. More principal components from Φf is used to add details to the foreground. ˆI = I0 +

P X j=1

αjf φpj +

M X

αjf φfj

(5.8)

j=P +1

The result is reconstructed frames with slightly lower quality for the background but with the same quality for the foreground If as if only Φpj was used for reconstruction. The quality of the background is decided by parameter P : a high P -value will increase the information used for background reconstruction and increase the decoder complexity. A low P -value has the opposite effect. The reduction in complexity (compression ratio CR) is calculated as: CR =

K(M + 1) (1 + P )K + (M − P )k

(5.9)

+1 When k