codec-agnostic dynamic and distributed adaptation of scalable ... - ITEC

Michael Ransburg, Dipl.-Ing.

CODEC-AGNOSTIC DYNAMIC AND DISTRIBUTED ADAPTATION OF SCALABLE MULTIMEDIA CONTENT

DISSERTATION

zur Erlangung des akademischen Grades Doktor der Technischen Wissenschaften

Alpen-Adria-Universität Klagenfurt Fakultät f¨ ur Technische Wissenschaften

1. Begutachter:

Univ.-Prof. Dr. Hermann Hellwagner

Institut:

Multimedia Communication Institute of Information Technology Faculty of Technical Sciences Klagenfurt University

2. Begutachter:

Univ.-Prof. Dr. Rik Van de Walle

Institut:

Multimedia Lab Department of Electronics and Information Systems Faculty of Engineering Ghent University - IBBT May 2008

Ehrenwörtliche Erklärung Ich erkl¨ are ehrenw¨ ortlich, dass ich die vorliegende wissenschaftliche Arbeit im Sinne des §51 Abs. 2 Z. 8 bzw. §51 Abs. 2 Z. 13 Universitätsgesetz 2002 selbstst¨ andig angefertigt und die mit ihr unmittelbar verbundenen Tätigkeiten selbst erbracht habe. Ich erkl¨ are weiters, dass ich keine anderen als die angegebenen Hilfsmittel benutzt habe. Alle aus gedruckten, ungedruckten oder dem Internet im Wortlaut oder im wesentlichen Inhalt u ¨bernommenen Formulierungen und Konzepte sind gem¨ aß den Regeln f¨ ur wissenschaftliche Arbeiten zitiert und durch Fußnoten bzw. durch andere genaue Quellenangaben gekennzeichnet. Die während des Arbeitsvorganges gewährte Unterst¨ utzung einschließlich signifikanter Betreuungshinweise ist vollst¨ andig angegeben. Die wissenschaftliche Arbeit ist noch keiner anderen Pr¨ ufungsbehörde vorgelegt worden. Diese Arbeit wurde in gedruckter und elektronischer Form abgegeben. Ich bestätige, dass der Inhalt der digitalen Version vollst¨ andig mit dem der gedruckten Version u ¨bereinstimmt. Ich bin mir bewusst, dass eine falsche Erklärung rechtliche Folgen haben wird.

Unterschrift:

Klagenfurt, 28. Mai 2008

ii

Contents List of Tables

viii

List of Figures

ix

Danksagung

xi

Acknowledgements

xii

Kurzfassung

xiii

Abstract

xvi

I

Introduction and overview

1

1 Introduction

2

1.1

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Adaptation of scalable media

7

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Scalable media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.2

MPEG-4 Scalable Video Codec . . . . . . . . . . . . . . . . . . . . .

9

2.2.3

Motion Compensated - Embedded Zero Block Coding . . . . . . . .

17

2.2.4

MPEG-4 Advanced Audio Coding: Bit Slice Arithmetic Coding . . .

18

iii

2.3

2.4

2.2.5

MPEG-4 Visual Elementary Streams . . . . . . . . . . . . . . . . . .

20

2.2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Codec-specific versus codec-agnostic adaptation . . . . . . . . . . . . . . . .

21

2.3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.3.2

Codec-specific adaptation . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3.3

Codec-agnostic adaptation . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Stateless, stateful and application-aware adaptation mechanisms . . . . . .

24

2.4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4.2

Stateless adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4.3

Stateful adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.4

Application-aware adaptation node . . . . . . . . . . . . . . . . . . .

29

2.4.5

Stateless, stateful and application-aware adaptation: a combined walk-

2.4.6

through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3 gBSD-driven adaptation of scalable media

36

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.2

Application scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.3

The Digital Item Adaptation standard . . . . . . . . . . . . . . . . . . . . .

39

3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

II Codec-agnostic dynamic and distributed adaptation of scalable media 47 4 gBSD-driven dynamic and distributed adaptation of scalable media

48

4.1

Motivation and scope

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.3

Streaming Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.3.1

XML Streaming Instructions . . . . . . . . . . . . . . . . . . . . . .

54

4.3.2

Media Streaming Instructions . . . . . . . . . . . . . . . . . . . . . .

57

4.3.3

Properties Style Sheet . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

iv

4.4

4.5

4.6

Synchronized storage, processing and transport of Process Units and media fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.4.2

Evaluation of media and metadata transport mechanisms . . . . . .

64

4.4.3

Transport formats and strategies . . . . . . . . . . . . . . . . . . . .

67

4.4.4

Storage of Process Units . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Using Streaming Instructions to enable dynamic and distributed adaptation

77

4.5.1

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.5.2

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.5.3

Adaptation proxy implementation . . . . . . . . . . . . . . . . . . .

84

4.5.4

Intercepting adaptation proxy implementation . . . . . . . . . . . .

86

4.5.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Conclusions and original contributions . . . . . . . . . . . . . . . . . . . . .

99

5 Dynamic and distributed adaptation of scalable media based on the Generic Scalability Header

100

5.1


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

100

5.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101

5.3

Syntax and semantics of the Generic Scalability Header . . . . . . . . . . .

102

5.3.1

Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

5.3.2

Base Header

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

5.3.3

Scalability Unit Size . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

5.3.4

Bitrate Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

5.3.5

Layer Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

5.3.6

Update Data Length . . . . . . . . . . . . . . . . . . . . . . . . . . .

108

5.3.7

Update Truncation Points . . . . . . . . . . . . . . . . . . . . . . . .

108

5.3.8

Truncation Points Location . . . . . . . . . . . . . . . . . . . . . . .

109

5.3.9

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

Examples of describing the scalability features of a codec with the GSH . .

111

5.4.1

MPEG-4 Scalable Video Codec . . . . . . . . . . . . . . . . . . . . .

111

5.4.2

MPEG-4 Advanced Audio Coding: Bit Slice Arithmetic Coding . . .

112

5.4.3

MPEG-4 Visual Elementary Streams . . . . . . . . . . . . . . . . . .

115

5.4

v

5.4.4 5.5

5.6 5.7

III

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

Enabling different types of adaptation nodes with the GSH . . . . . . . . .

116

5.5.1

Stateless adaptation node . . . . . . . . . . . . . . . . . . . . . . . .

117

5.5.2

Stateful adaptation node

. . . . . . . . . . . . . . . . . . . . . . . .

117

5.5.3

Application-aware adaptation node . . . . . . . . . . . . . . . . . . .

118

5.5.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

Using the Generic Scalability Header to enable dynamic and distributed adaptation in an intercepting adaptation proxy . . . . . . . . . . . . . . . .

119


120

Evaluation and discussion

122

6 Evaluation and comparison

123

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123

6.2

Fragmentation of media and metadata . . . . . . . . . . . . . . . . . . . . .

124

6.2.1

Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

6.2.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126

Compression of metadata for transport . . . . . . . . . . . . . . . . . . . . .

126

6.3.1

Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

6.3.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

Transformation of metadata . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

6.4.1

Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

6.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

gBSD-based adaptation proxy . . . . . . . . . . . . . . . . . . . . . . . . . .

130

6.5.1

Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

6.5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

gBSD-based and GSH-based intercepting adaptation proxys . . . . . . . . .

135

6.6.1

Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

6.6.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

142

6.6.3

Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . .

151


153

6.3

6.4

6.5

6.6

6.7

7 Summary and conclusion

155 vi

Bibliography

160

vii

List of Tables 2.1

Classification of adaptation nodes for scalable media in streaming scenarios

35

4.1

XML Streaming Instructions properties . . . . . . . . . . . . . . . . . . . .

55

4.2

Semantics of different puModes . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3

Media Streaming Instructions properties . . . . . . . . . . . . . . . . . . . .

58

4.4

Advantages and disadvantages of different metadata transport mechanisms

66

4.5

Semantics of IBMFF fields . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

4.6

Example values for IBMFF fields . . . . . . . . . . . . . . . . . . . . . . . .

76

5.1

Mapping of temporal priority to temporal resolution . . . . . . . . . . . . .

104

5.2

Mapping of spatial priority to spatial resolution . . . . . . . . . . . . . . . .

105

5.3

Flags in the GSH base header . . . . . . . . . . . . . . . . . . . . . . . . . .

105

5.4

Example of layer avg bitrate and decoded layer avg bitrate values . . . . .

107

5.5

SVC test content for GSH annotation . . . . . . . . . . . . . . . . . . . . .

111

5.6

GSH values for SVC test content . . . . . . . . . . . . . . . . . . . . . . . .

113

5.7

BSAC test content for GSH mapping . . . . . . . . . . . . . . . . . . . . . .

113

5.8

VES test content for GSH mapping . . . . . . . . . . . . . . . . . . . . . . .

116

5.9

GSH values for VES test content . . . . . . . . . . . . . . . . . . . . . . . .

116

6.1

Characteristics of test data for Streaming Instructions evaluation . . . . . .

125

6.2

Characteristics of test data for gBSD transformation evaluation . . . . . . .

129

6.3

Characteristics of test data for performance comparison of different adaptation mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

141

List of Figures 2.1

Classification of adaptation options for scalable media . . . . . . . . . . . .

8

2.2

Hierarchy of pictures in SVC . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3

Hierarchy of pictures in SVC: after dropping a layer . . . . . . . . . . . . .

11

2.4

Hierarchy of pictures in SVC including spatial / CGS scalability . . . . . .

12

2.5

SVC fully scalable bitstream representation (adopted from [1]) . . . . . . .

15

2.6

SVC fully scalable bitstream representation with adaptation path . . . . . .

17

2.7

EZBC fully scalable bitstream representation (adopted from [1]) . . . . . .

18

2.8

BSAC scalable bitstream representation (adopted from [1]) . . . . . . . . .

19

2.9

VES scalable bitstream representation (adopted from [1]) . . . . . . . . . .

20

2.10 Codec-specific adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.11 Codec-agnostic adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.12 Stateless adaptation node . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.13 Stateful adaptation node . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.14 Application aware adaptation node . . . . . . . . . . . . . . . . . . . . . . .

30

2.15 A combined walkthrough

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.1

Dynamic and distributed adaptation use case . . . . . . . . . . . . . . . . .

38

3.2

gBSD-based adaptation approach . . . . . . . . . . . . . . . . . . . . . . . .

45

4.1

Processing related to XML Streaming Instructions . . . . . . . . . . . . . .

55

4.2

Examples of different puModes . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3

Processing related to Media Streaming Instructions . . . . . . . . . . . . . .

57

4.4

Examples of different auModes . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.5

Mapping of Streaming Instructions attributes to RTP header fields . . . . .

69

4.6

Relationship between the different file formats (adopted from [2]) . . . . . .

70

4.7

Timed metadata in the IBMFF . . . . . . . . . . . . . . . . . . . . . . . . .

73

ix

4.8

Dynamic gBSD-based adaptation approach . . . . . . . . . . . . . . . . . .

80

4.9

DANAE adaptation architecture . . . . . . . . . . . . . . . . . . . . . . . .

85

4.10 IP Tables overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.11 Codec-agnostic gBSD-based adaptation in an intercepting adaptation proxy

90

4.12 Dynamic gBSD-based adaptation approach using regular expressions . . . .

92

5.1

SVC fully scalable bitstream representation with adaptation paths . . . . .

112

5.2

Codec-agnostic GSH-based adaptation in an intercepting adaptation proxy

119

6.1

Maximum throughput of the media and XML fragmenters . . . . . . . . . .

126

6.2

Compression factors for gBSD PUs . . . . . . . . . . . . . . . . . . . . . . .

128

6.3

Time needed to transform a PU which describes an SVC NALU . . . . . . .

131

6.4

Time needed to transform a PU which describes an SVC/BSAC AU . . . .

131

6.5

Time needed to transform a PU which describes an SVC GoP . . . . . . . .

131

6.6

Time needed to transform a gBSD which describes 3000 SVC/BSAC AUs .

132

6.7

gBSD-driven adaptation of 1 to 5 BSAC streams: memory utilization and CPU load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.8

gBSD-driven adaptation of 1 to 5 QCIF SVC streams: memory utilization and CPU load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.9

133 134

gBSD-driven adaptation of 1 to 5 QCIF EZBC streams: memory utilization and CPU load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

134

6.10 CPU load distribution for gBSD-based adaptation . . . . . . . . . . . . . .

145

6.11 CPU load distribution for gBSD-based adaptation (optimized by avoiding gBSDtoBin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

146

6.12 Throughput of gBSD-based adaptation . . . . . . . . . . . . . . . . . . . . .

147

6.13 Throughput of gBSD-based adaptation (optimized by avoiding gBSDtoBin)

148

6.14 Throughput of codec-specific and GSH-based adaptation . . . . . . . . . . .

149

6.15 Metadata overhead of GSH-based and gBSD-based adaptation . . . . . . .

150

x

Danksagung

Hier m¨ ochte ich allen Personen danken, die diese Arbeit unterst¨ utzt haben.

In erster Linie danke ich meinem Betreuer, Prof. Dr. Hermann Hellwagner, f¨ ur die fruchtvollen Diskussionen und die technischen Ratschläge. Herzlicher Dank geb¨ uhrt auch Prof. Dr. Rik Van de Walle f¨ ur die Zweitbegutachtung dieser Arbeit.

Weiters danke ich meiner Familie, die mir die nötige Unterst¨ utzung gegeben hat, um diese herausfordernde Arbeit durchf¨ uhren zu können. Besondereren Dank richte ich an meine Eltern, Herbert und Judith, die mich immer bestmöglich unterst¨ utzten.

Allen Kollegen aus dem DANAE EU IST-Projekt und aus der MPEG-Standardisierung geb¨ uhrt mein besonderer Dank.

Spezieller Dank geb¨ uhrt Sylvain Devillers f¨ ur die Zusammenarbeit während der MPEGStandardisierung und des DANAE EU IST-Projektes. Er trug zu den neuen Mechanismen zur Fragmentierung von Metadaten bei, die in dieser Arbeit präsentiert werden. Insbesondere der Properties Style Sheet Mechanismus wurde von ihm ausgearbeitet.

Schlussendlich m¨ ochte ich mich bei allen Kollegen der Multimedia Communication Group im Speziellen und allgemein bei allen Kollegen des Instituts f¨ ur Informationstechnologie danken, deren kooperative und dennoch wettbewerbliche Umgebung den Fortschritt dieser Arbeit unterst¨ utzte. xi

Acknowledgements

At this point I want to acknowledge those people, who supported this thesis in any way.

First, I want to thank Prof. Dr. Hermann Hellwagner, the supervisor of this thesis, for his technical advice and encouraging discussions. Prof. Dr. Rik Van de Walle acted as a second referee for this work, for which I also thank him.

Furthermore, many thanks go to my whole family who gave me the necessary motivation during the period of this challenging work. Particularly, I would like to thank my parents, Herbert and Judith, who always supported me in the best possible way.

All colleagues from the DANAE EU IST project and MPEG standardization deserve my particular thanks for good cooperation and many fruitful brainstorming sessions in diverse locations.

Special thanks go to Sylvain Devillers for the cooperation both at MPEG standardization and the DANAE EU IST project. Particularly, he contributed to the novel mechanisms for fragmentation of metadata which are presented in this thesis and it must be noted that the Properties Style Sheet mechanism was developed by him.

Finally, I thank all my colleagues from the Multimedia Communication Group particularly and the Institute of Information Technology generally for the cooperative and competitive environment which encourages advancements. xii

Kurzfassung Heutzutage wird das Internet von unterschiedlichsten Endgeräten durch verschiedenste Arten der Netzwerkanbindung genutzt. Unabhängig von dieser großen Anzahl verschiedener Benutzungsumgebungen m¨ ochten Benutzer die Inhalte in bestmöglicher Qualität konsumieren. Designer neuer Medien-Kodierverfahren reagieren auf diese große Anzahl verschiedener Benutzungsumgebungen, indem sie eine leichte Anpassung der codierten Inhalte auf verschiedenste Benutzungsumgebungen vorsehen. Diese so genannten skalierbaren Kodierverfahren, wie zum Beispiel der neue MPEG-4 Scalable Video Codec, erlauben es, den kodierten Inhalt leicht an einen gegebenen Benutzungskontext anzupassen, indem bestimmte Teile des Medien-Bitstroms einfach verworfen werden.

All diese Variablen (unterschiedliche

Endger¨ ate, Netzwerkanbindungen, Benutzerw¨ unsche, Medien-Kodierverfahren, Skalierbarkeitsoptionen) f¨ uhren zu einer Mannigfaltigkeit von möglichen Anpassungen der Medieninhalte an den Benutzungskontext. Um dieser Komplexit¨ at entgegenzutreten, spezifiziert der MPEG-21 Digital Item Adaptation (DIA) Standard XML-Beschreibungen f¨ ur die Medieninhalte, die Adaptierungsm¨ oglichkeiten und die Benutzungsumgebung. Die relevanten Beschreibungen sind: 1) Die ”generic Bitstream Syntax Description” (gBSD), die eine generische Sprache spezifiziert, um die Skalierbarkeitsm¨ oglichkeiten eines Medieninhalts zu beschreiben. 2) Die ”Adaptation Quality of Service Description” (AQoS), welche beschreibt, wie ein Medieninhalt adaptiert werden muss, um einer bestimmten Benutzungsumgebung zu entsprechen, z.B., wie viele Qualit¨ atsSkalierbarkeitslayer entfernt werden m¨ ussen, um die Bitrate auf die derzeit verf¨ ugbare Bandbreite zu reduzieren. 3) Die ”Usage Environment Descriptions” (UEDs), die die Benutzungsumgebung beschreiben, z.B. die verf¨ ugbare Bandbreite im Netzwerk. Da all diese Beschreibungen, also alle codec-spezifischen Informationen, gemeinsam mit dem Medieninhalt ausgeliefert werden k¨ onnen, wird dadurch codec-agnostische Adaptierung unterst¨ utzt. Es k¨ onnen also Adaptierungsknoten implementiert werden, die unabhängig vom Codec alle xiii

skalierbaren Medieninhalte, die durch diese DIA-Beschreibungen beschrieben sind, adaptieren können. Diese Arbeit erweitert den statischen, server-zentrierten, gBSD-basierten Adaptierungsansatz um dynamische und verteilte Adaptierung. Um dies zu erreichen, werden neue Mechanis˜ 1 r Fragmentierung, Speicherung und Transport von inhaltsbezogenen XML Metamen fA 4

daten eingef¨ uhrt. Einen spezieller Beitrag stellt die Einf¨ uhrung des Konzeptes von Samples f¨ ur Metadaten dar, welche durch Streaming Instructions ermöglicht werden. Die Streaming Instructions steuern die Fragmentierung von XML-basierten Metadaten und weisen den Samples Zeitstempel zu. Dies ermöglicht die dynamische Abarbeitung und zeitliche Synchronisierung von Metadaten und den beschriebenen Medieninhalten. Weiters wird das ISO Base Media File Format erweitert, um die Speicherung solcher Metadatenströme zu ermöglichen. Schlußendlich wird die Anwendbarkeit des Real-Time Transport Protocols (RTP) f¨ ur den Transport solcher Metadatenströme untersucht. Ein codec-agnostischer Adaptierungsknoten wird basierend auf diesen neuartigen Mechanismen implementiert und hinsichtlich der Adaptierungsperformance f¨ ur verschiedene Arten von skalierbaren Medien evaluiert. Pr¨ azise Messungen mit diesen Testdaten zeigen, welche Teile des gBSD-basierten Adaptierungsmechanismus am meisten von Optimierungen profitieren können. Zusätzlich wird ein Mechanismus, der auf einem neuartigen binären Header basiert, spezifiziert, der ebenso wie der gBSD-basierte Ansatz die codec-agnostische Adaptierung von Medieninhalten erm¨ oglicht. Dieser Generic Scalability Header (GSH) wird jedem Medienpaket vorangestellt und basiert auf den Konzepten der gBSD-basierten Adaptierung. Er beinhalten Informationen u ¨ber die Bitstream Syntax und die möglichen Adaptierungen. Er kombiniert daher die Informationen, die in MPEG-21 DIA gBSD und AQoS enthalten sind, und erm¨ oglicht eine wesentliche Steigerung der Laufzeiteffizienz. Auch dieser Mechanismum wird f¨ ur verschiedene Arten von skalierbaren Medien evaluiert und die beiden Ansätze (GSH-basierte Adaptierung und gBSD-basierte Adaptierung) werden sowohl miteinander als auch mit einem codec-spezifischen Ansatz verglichen. Eine abschließende Diskussion analysiert die Resultate der quantitativen und qualitativen Evaluierung der beiden Mechanismen. Insbesondere zeigen die Ergebnisse, dass f¨ ur MPEG-4 Scalable Video Codec und MPEG-4 Visual Elementary Streams der Durchsatz des GSH-basierte Ansatzes nur 1.25 mal geringer ist als der Durchsatz einer codec-spezifischen Implementierung. Weiters betr¨ agt der Metadaten-Overhead des GSH-basierten Ansatzes nur 1 Prozent. Der gBSD-basierte Mechanismus ist rechenintensiver und bringt einen 10 xiv

mal geringeren Durchsatz und bis zu 10 Prozent Metadaten-Overhead im Vergleich zur codec-spezifischen Implementierung. Wir schlussfolgern, das, abhängig vom Anwendungsfall, beide Mechanismen (gBSD-basierte und GSH-basierte Adaptierung) gute Alternativen zu existierenden, codec-spezifischen Implementierungen sein können. Insbesondere in Anwendungsf¨ allen mit Medieninhalten, die mit mehreren verschiedenen skalierbaren Codecs codiert werden, kann die Flexibilität der codec-agnostischen Mechanismen ihre verringerte Performance ausgleichen.

xv

Abstract Today’s Internet is accessible to diverse end devices through a wide variety of network types. Independent from this huge amount of usage contexts, content consumers desire to retrieve content with the best possible supported quality. The designers of new media codecs react to this diversity of usage contexts by including adaptation support into the codec design. Scalable media codecs, such as the new MPEG-4 Scalable Video Codec, enable to easily retrieve different qualities of the media content by simply disregarding certain media segments. All these variables (different end devices, network types, user preferences, media codec types, scalability options) lead to a manifold of needed and possible adaptation operations. In order to counter this complexity, the MPEG-21 Digital Item Adaptation (DIA) standard specifies a set of descriptions (and related processes) in order to describe the media content, the adaptation possibilities and the usage context in the XML domain. The relevant descriptions are: 1) The generic Bitstream Syntax Description (gBSD), which uses a generic language to describe, for instance, the parts of a media content which may be removed for scalability purposes. 2) The Adaptation Quality of Service Description (AQoS), which describes how (segments of) a media content need(s) to be adapted in order to correspond to the various usage contexts, e.g., how many quality layers need to be dropped to correspond to the currently available network bandwidth. 3) The Usage Environment Descriptions (UEDs) which describe the usage context, e.g., the available network bandwidth. Since all of these descriptions, i.e., all codec-specific information, are provided together with the media content, this helps to enable codec-agnostic adaptation nodes, which support any type of scalable media which is properly described by those DIA descriptions. This thesis extends the static, server-based, gBSD-driven adaptation mechanism towards xvi

dynamic and distributed environments. To achieve this, novel mechanisms for fragmentation, storage and transport of content-related XML metadata are introduced. One particular contribution is the introduction of the concept of samples for metadata by employing Streaming Instructions which steer the fragmentation of and provide timing for XML-based metadata. This enables the synchronized processing of such a metadata stream with the described media samples. Furthermore, investigations of the ISO Base Media File Format show how such metadata streams can be stored for later processing. Finally, the applicability of the Real-Time Transport Protocol (RTP) is analyzed for the transport of such metadata streams. A codec-agnostic adaptation node based on these novel mechanisms is implemented and evaluated with regards to its adaptation performance for different types of scalable media. Extensive measurements with these scalable media contents show which parts of the gBSD-based adaptation process (could) benefit most from optimization. Additionally, a mechanism based on a novel binary header to enable codec-agnostic adaptation of media content is specified. This Generic Scalability Header (GSH) prefixes each media packet payload and is based on the concepts of the gBSD-based adaptation mechanism. It provides information on both the bitstream syntax and the adaptation options and therefore combines (some of) the information provided by the MPEG-21 DIA gBSD and AQoS descriptions. However it enables codec-agnostic adaptation at a considerably lower performance cost. As above, the adaptation performance of this mechanism is evaluated for several types of scalable media. Finally, both mechanisms are implemented in the same adaptation architecture and compared to each other and additionally to a codec-specific adaptation approach using several types of scalable media. A concluding discussion analyzes the results of the quantitative and qualitative evaluation of both mechanisms. Most notably the measurements show that for MPEG-4 Scalable Video Codec and MPEG-4 Visual Elementary Streams the GSH-based mechanism’s throughput is only about 1.25 times lower than for the codec-specific mechanism and the metadata overhead is less than 1 percent. The gBSD-based mechanism comes at a higher cost for these codecs (about 10 times lower throughput and a maximum of 10 percent metadata overhead with compression). We conclude that, depending on the application scenario, both mechanisms can be viable alternatives to existing codec-specific adaptation approaches. In particular in scenarios where contents encoded with diverse (and potentially changing) scalable media codecs need to be adapted, the flexibility of codec-agnostic approaches can outweigh their reduced performance. xvii

Part I

Introduction and overview

1

CHAPTER

1 1.1

Introduction

Context

Today, multimedia content is accessible to diverse end devices through a wide variety of network types. Additionally, content consumers desire to retrieve content not only with the best possible supported quality, but also fulfilling their personal usage preferences. This requires to offer the multimedia content to the content consumer adapted to this huge amount of usage contexts in order to maximize the content consumer’s Quality of Experience (QoE). At the moment, content and service providers mostly rely on the stream-selection paradigm in order to counter this huge variety of usage contexts. This means that multiple variations of the same content are stored in different qualities and are separately offered for download or streaming. However, this is not optimal since each content variation demands for additional hard disk space and it is unrealistic (and inefficient) to assume that there exists a variation for each possible usage context. Rather than that, the content variations represent approximate reactions to the usage context which might be encountered. While this approach works, it is not optimal since a content variation which is not quite fitting the current usage context can still lead to a degradation in the QoE for the content consumer. In order to react to this, the designers of new media codecs attempt to include adaptation support into the codec design. These scalable media codecs support the generation of a degraded version of the original bitstream by means of simply removing bitstream segments. Depending on which segments are removed, the adapted version can represent a lower quality in one or more scalability dimensions. A version with a lower framerate can be retrieved by removing segments belonging to the temporal dimension. Similarly, a version with a lower spatial resolution can be retrieved by removing segments belonging to the

CHAPTER 1. INTRODUCTION

Page 3

spatial dimension. Finally, the quality of the resulting video may be reduced by removing segments of the quality dimension, which results in a lower visual quality of the media content. While some of these dimensions are only meaningful for visual media content, there are also scalable codecs for audio. Scalable codecs already provide good means to counter the problems described above, but they do not counter the heterogeneity introduced by the variety of scalable media codecs itself. That is, while all scalable contents have in common that a degraded version can be extracted by simply removing bitstream segments, the bitstream syntax of the encoded content is still different for each of these codecs. Therefore, mechanisms were specified to enable this adaptation in a codec-agnostic way. One such mechanism is specified by the Motion Picture Experts Group (MPEG) in their MPEG-21 Digital Item Adaptation (DIA) standard [3]. This mechanism relies on XML descriptions which describe the scalability properties of a scalable media content in a codec-agnostic way through so called generic Bitstream Syntax Descriptions (gBSDs). MPEG specifies two different types of Bitstream Syntax Descriptions (BSDs), i.e., the codec-specific BSDs and the generic BSDs (gBSDs). While the concepts and mechanisms in this thesis are intended to be applicable to both types of BSDs, we focus on gBSDs to evaluate the concepts introduced in this work. gBSD-based adaptation represents a comprehensive mechanism for codec-agnostic adaptation, however it is limited to static adaptation, i.e., the complete media content is adapted only once before it is provided to the content consumer. This is inefficient in streaming scenarios, where the usage environment may dynamically change during the streaming session. In this thesis we focus on dynamic and distributed codec-agnostic adaptation, i.e., we envision one or more codec-agnostic adaptation nodes along the delivery chain which dynamically adapt the media content to the current usage environment in streaming scenarios. We describe the advantages of such adaptation nodes over codec-specific adaptation mechanisms and quantitatively compare them with each other.

1.2

Outline

In the following we describe the structure of this thesis. In the next chapter, we introduce the adaptation of scalable audiovisual media. To this end, we first introduce different types of scalable media, i.e., three scalable video codecs and one scalable audio codec. Since this thesis focuses on codec-agnostic adaptation, we subsequently describe the difference


Page 4

between traditional codec-specific adaptation and codec-agnostic adaptation. Finally we describe different types of adaptation nodes which differ with regards to their efficiency and adaptation flexibility. This is relevant for subsequent chapters where we show how our novel mechanisms support these different types of adaptation nodes. Chapter 3 concludes this Part by introducing MPEG-21 DIA, i.e., a codec-agnostic adaptation mechanism which is based on the idea of transferring the adaptation into the XML domain. This mechanism is the foundation for our contribution in Chapter 4 where we extend it towards dynamic and distributed adaptation scenarios. In the scope of this we introduce novel mechanisms for fragmentation and transport of content-related XML metadata and synchronization with the described media data. We describe the implementation of these mechanisms to enable two types of different adaptation nodes, i.e., a regular adaptation proxy and an intercepting adaptation proxy. Chapter 5 further explores the concept of dynamic and distributed codec-agnostic adaptation. While conceptually still being based on the mechanisms introduced in Chapters 3 and 4, we move away from the XML domain and focus on performance. To this end, we introduce a mechanism which is based on a novel binary header to enable codec-agnostic adaptation of media content. This Generic Scalability Header (GSH) prefixes each media packet payload and is based on the concepts of the gBSD-based adaptation mechanism. We introduce its syntax and semantics, provide examples, and show how it supports the different types of adaptation nodes which were introduced in Chapter 2. Finally we describe the implementation of an intercepting adaptation proxy based on the GSH. The final Part of this thesis evaluates the implementation of the novel mechanisms for dynamic and distributed codec-agnostic adaptation in Chapter 6. We also compare the different approaches which were introduced in Chapters 3 and 4 with codec-specific adaptation as introduced in Chapter 2. This enables us to measure the overhead of codecagnostic adaptation compared to codec-specific adaptation. Finally, Chapter 7 concludes this thesis.

1.3

Contributions

The major contributions of this thesis are in the area of codec-agnostic adaptation of scalable media content.


Page 5

For one, this thesis extends the static gBSD-based mechanism towards dynamic and distributed environments. To achieve this, novel mechanisms for fragmentation, storage and transport of content-related XML metadata are introduced. One particular contribution is the introduction of the concept of “samples” for metadata by employing Streaming Instructions for XML-based metadata. Measurements show that the actual media adaptation process is most expensive with regards to CPU load closely followed by the gBSD transformation process. We investigate optimizations for both tasks which help to substantially increase the throughput of a gBSD-based adaptation node. Further measurements include an evaluation of the efficiency of the Streaming Instructions mechanism and a comparison of different compression mechanisms for gBSD samples. Two codec-agnostic adaptation nodes based on these novel mechanisms are evaluated with regards to their adaptation performance for several types of scalable media. The first of these nodes acts as a regular proxy, i.e., the client is aware of it. The second implementation acts as an intercepting proxy, i.e., the client is not aware of it. This results in a much simpler implementation which we subsequently use for throughput measurements. Additionally, a mechanism based on a novel binary header to enable codec-agnostic adaptation of media content is specified. This Generic Scalability Header (GSH) prefixes each media packet payload and is based on the concepts of the gBSD-based adaptation mechanism. It provides information on both the bitstream syntax and the adaptation options and therefore combines (some of) the information provided by the MPEG-21 DIA gBSD and AQoS descriptions. However it enables codec-agnostic adaptation at a considerably lower performance cost. Again, the adaptation performance of this mechanism is evaluated for several types of scalable media using the intercepting proxy which we described above. Finally, a traditional codec-specific adaptation mechanism is implemented within the intercepting proxy and compared to the two codec-agnostic adaptation mechanisms. The results compare the performance of the GSH-based mechanism, gBSD-based adaptation and codec-specific adaptation with regards to throughput and metadata overhead. Most notably the measurements show that for MPEG-4 Scalable Video Codec (see Section 2.2.2) and MPEG-4 Visual Elementary Streams (see Section 2.2.5) the GSH-based mechanism’s throughput is only about 1.25 times lower than for the codec-specific mechanism and the metadata overhead is less than 1 percent. The gBSD-based mechanism comes at a higher cost for test content encoded with these codecs (about 10 times lower throughput and a maximum of 10 percent metadata overhead using compression). For the MPEG-4 Bit


Page 6

Slice Arithmetic Coding scalable audio codec (see Section 2.2.4) the two codec-agnostic mechanisms perform considerably worse and we analyse the reasons for that in Section 6.6. To summarize, this thesis introduces novel means and mechanisms for codec-agnostic adaptation of scalable media content, with special focus on in-network adaptation. It analyzes different means and mechanisms to accomplish this and shows that codec-agnostic adaptation does not need to be expensive with regards to system and network resources. Our contributions in the domain of gBSD-driven dynamic and distributed adaptation of scalable media originate in [4] in which we initially describe our ideas on fragmentation and streaming of metadata and related media content. In [5] we focus on evaluating the different means for the transport of content-related metadata as described in Sections 4.4.2 and 4.4.3. Work performed in the DANAE1 project includes the contribution of our concepts as a leader of the “Distributed Adaptation Framework” task. We then decided to approach both the Multimedia Description Schemes (MDS) and Systems groups at MPEG with our concepts on gBSD-driven dynamic and distributed adaptation of scalable media. This resulted in 41 authored or co-authored MPEG input contributions which lead to 4 international standards for which we served as a co-editor [6][7][3][8]. [6] includes the novel concepts on fragmentation and streaming of metadata, which are described in Section 4.3. In [7] we contribute the novel approach to storage of timed metadata, as described in Section 4.4.4, which is tested for conformance in [8]. Finally, [3] integrates the concepts on fragmentation and streaming of metadata with the existing DIA standard. Our work also led to two MPEG core experiments [9][10] and we contributed to the whitepaper on MPEG-21 Digital Item Adaptation [11]. Overall, our work in this area resulted in nine papers [4][5][12][13][14][15][16][17][18] published in conference proceedings. An Elsevier journal paper is currently under review [19]. A book [20] containing a chapter on “Dynamic and Distributed Multimedia Content Adaptation based on the MPEG-21 Multimedia Framework” [21] will be available starting from June, 2008. A complete list of contributions is available online2 .

1 2

DANAE, http://danae.rd.francetelecom.com http://www.itec.uni-klu.ac.at/∼mransbur

CHAPTER

2 2.1

Adaptation of scalable media

Introduction

In this introduction we attempt to qualitatively categorize the different options for the adaptation of scalable media. See Section 2.2 for an introduction to the scalable media codecs which are facilitated in this thesis. As shown in Figure 2.1 and detailed in Section 2.3 we see two different main categories for adaptation of scalable multimedia content: codec-agnostic adaptation and codec-specific adaptation. Independent from the chosen approach, the adaptation can be applied statically, i.e., only once at the beginning, or dynamically, i.e., continuously throughout the media consumption. Chapter 4 details this differentiation. Each approach can be classified by its adaptation complexity/flexibility as detailed in Section 2.4. Stateless adaptation mechanisms, for example, are very simple (and thus scalable), since they do not require to store any information about what occurred previously. However, they are also very limited in adaptation capabilities. Finally, we provide typical examples, some of which are described or introduced in this work, for each category: The gBSD-based mechanism (and the closely related BSD-based mechanism) represents a static, codec-agnostic adaptation mechanism as introduced in Chapter 3. We extend this mechanism towards dynamic adaptation by facilitating Streaming Instructions as described in Chapter 4. GBSD-based adaptation is usually applied in application-aware and stateful adaptation nodes, since it depends on various types of metadata (AQoS, UED, ...), which are valid for more than one media packet and are therefore stored in a state. We also introduce a novel mechanism based on a so called Generic Scalability Header (GSH) that represents an alternative to the gBSD-based adaptation approach, which is also applicable to stateless adaptation as described in Section 5.5. For codec-specific

CHAPTER 2. ADAPTATION OF SCALABLE MEDIA

Figure 2.1: Classification of adaptation options for scalable media

Page 8


Page 9

adaptation there are many well-known examples such as stream switching and transcoding. Both of them are usually facilitated in static adaptation approaches. Some exceptions exist, such as SVC to AVC transcoding by simple header rewriting operations [22], which is also facilitated in dynamic adaptation scenarios. Note that these two approaches are not specific to scalable media. Codec-specific adaptation of scalable media can be performed by either directly investigating the media header or by relying on information in the transport protocol header as it is provided, for example, by the Real-Time Transport Protocol (RTP) for some scalable codecs. We compare an implementation of a codec-specific adaptation approach, which relies on the media header, to the codec-agnostic approaches in Part III. There, we also introduce various quantitative metrics which can be used to measure the performance of the various adaptation approaches illustrated in Figure 2.1.

2.2 2.2.1

Scalable media Introduction

As described in Section 1.1 scalable codecs enable to retrieve a degraded version of the encoded media by keeping only specific segments of the bitstream. This degradation can be performed in various scalability dimensions depending on which dimensions are offered by the specific codec. In the following we will analyze the scalability features of four different media codecs, which we will use to evaluate our mechanisms throughout this thesis. Please note that our description of these codecs is not comprehensive. We focus on the scalability features of the codecs (and here in particular on the MPEG-4 Scalable Video Codec) and refer the reader to further reading material if adequate.

2.2.2 2.2.2.1

MPEG-4 Scalable Video Codec Introduction

In this section we provide a high level overview and elaborate on the scalability features of the MPEG-4 Scalable Video Codec (SVC) which has recently been standardized as an amendment to the MPEG-4 Advanced Video Codec (AVC) [23]. As such it is a block-based hybrid scalable video codec with an AVC-compatible base layer.


Page 10

Figure 2.2: Hierarchy of pictures in SVC 2.2.2.2

Video Coding Layer

Similarly to other video codecs, a video encoded with SVC consists of a sequence of pictures, i.e., access units. Each access unit therefore contains all data which is necessary to decode exactly one picture. There are basically three different types of pictures. Intra-coded pictures (I pictures) do not use any information prior to their location in the bitstream, i.e., they provide random access into the stream. Note that SVC also provides switching pictures which enable random access at non-I pictures [24]. Predictively-coded pictures (P pictures) and bidirectionally-predictive-coded pictures (B pictures) exploit the temporal redundancy in a video by only encoding those pixels that differ compared to a reference picture. While P pictures only exploit the temporal redundancy to a previous reference picture, B pictures can exploit the redundancy to both previous and future reference pictures. An additional important fact is that there are strict dependency rules for the different types of pictures. Specifically, P pictures may only rely on other P pictures or I pictures for exploiting temporal redundancy. B pictures on the other hand may depend on any type of picture for exploiting temporal redundancy. The concept of B pictures depending on other B pictures enables hierarchical B pictures [25]. The leaf B pictures of this hierarchy are not needed by any other picture in the bitstream. This hierarchy of pictures is illustrated in Figure 2.2, where 4 temporal layers (T0 to T3) are shown. The hierarchy of B pictures is an important scalability property, as it enables temporal scalability. In Figure 2.2 the pictures are already divided in layers based on their dependencies. Thus, one can remove all pictures of the highest layer (T3 in Figure 2.2) without any dependency, i.e., decoding, problems. This would reduce the frame rate of the bitstream


Page 11

Figure 2.3: Hierarchy of pictures in SVC: after dropping a layer and result in the bitstream shown in Figure 2.3, at which point the adaptation process can be repeated if needed. Note that the number of temporal layers is configurable by changing the Group of Pictures (GoP) size. In addition to the temporal dimension, SVC content can also be scaled in the spatial dimension. That is, different spatial resolutions can be embedded in the same bitstream, e.g., Common Intermediate Format (CIF, 352x288 pixel) and 4x Common Intermediate Format (4CIF, 704x576 pixel). This is achieved by encoding pictures as multiple Video Coding Layer (VCL) units. It must be noted that a VCL unit does not necessarily correspond to an encoded picture, since the fundamental unit of processing of the VCL is actually a slice and a picture might contain one or more slices. For further reading on the composition of encoded pictures, we refer to [26]. In our example above, the first VCL unit contains all information to decode the picture at CIF resolution and the second VCL unit contains the additional information needed in order to decode the picture at 4CIF resolution. This enables to easily reduce the spatial resolution by simply disregarding all VCL units belonging to the 4CIF resolution. Figure 2.4 illustrates this concept. In addition to the four temporal layers (T0 to T3) there are now two spatial layers (S0 and S1) embedded in the bitstream. In addition to the intra layer prediction between the different pictures (as described above), inter layer prediction is used as well in such a bitstream. This means that each picture from S1 predicts from its equivalent picture in S0, i.e., the first I picture in S1 predicts from the first I picture in S0, the first B picture in S1 predicts from the first B picture from S0, and so on. The same mechanism can be used to achieve scalability in the quality dimension. However, in this case the additional information from the second VCL unit is not used for upsampling the picture to 4CIF resolution, but rather to enhance the visual quality (i.e.,


Page 12

Figure 2.4: Hierarchy of pictures in SVC including spatial / CGS scalability

reduce the number of visual artifacts) for the CIF resolution. This type of scalability is refered to as Coarse Grained Scalability (CGS). All three scalability dimensions have in common that between any two switching pictures only complete layers can be dropped, e.g., it is not allowed to only drop every second B picture of the third layer in Figure 2.2. Switching pictures can be I pictures or P pictures which are specially encoded to allow layer switching. For more information the interested reader is refered to [24]. For scalability in the quality dimension a more fine granular way of scalability was desired. Therefore, Medium Grained Scalability (MGS) was introduced. MGS is performed in the same manner as CGS, however with the difference that the MGS layer can be changed at every access unit rather than only at switching pictures. That is, VCL Units belonging to an MGS layer can be removed individually at every access unit, thus, e.g., allowing to adjust the bitrate of the SVC content in a more fine granular way. However, in order to selectively decide which VCL unit belongs to which temporal, spatial or quality layer, this information needs to be added to the VCL unit, which is the task of the Network Abstraction Layer (NAL), as discussed in Section 2.2.2.4.


2.2.2.3

Page 13

Parameter Sets and Supplemental Enhancement Information

Parameter Sets (PSs) and Supplemental Enhancement Information messages (SEI messages) are non-VCL units. A PS contains information which applies to a large number of VCL units of a specific layer, where it would be inefficient to encode this information for each VCL unit. The spatial resolution of a video segment of a specific layer is an example of information which is included in a PS. SEI messages provide supplemental data which is not necessary for the decoding process, but which may be helpful for the processing of the bitstream. Most importantly they carry layer boundary information which indicates the highest values of temporal level, quality level and dependency id (see Section 2.2.2.4) for all VCL units of the media stream. Additionally they contain bitrate information for each layer of the scalable stream. 2.2.2.4

Network Abstraction Layer

In order to - among other things - selectively decide which VCL units belong to which layer, the so called Network Abstraction Layer Unit header (NALU header) is added in front of each VCL unit. This prefixing of VCL units with a NALU header is the second step after encoding the pictures in composing an SVC stream, and the resulting VCL units are called Network Abstraction Layer Units (NAL Units or NALUs). The NALU header is shown in Listing 2.1. Listing 2.1: NALU Header +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | F | NRI | Type |R| I | PRID |N| DID | QID | TID |U|D|O| RR| +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

In the following we focus on those fields of the header which are important with regards to adaptation. These are the priority id (PRID), temporal id (TID), dependency id (DID), quality id (QID) and the discardable flag (D). The priority id is a 6 bit field which specifies an application-specific priority setting. It originally indicated that a lower id indicates a higher priority, however this constraint was removed at a very late step in the standardization process. Section 5.4.1 describes one of the potential reasons for this, where a lower value does not always indicate a higher priority. The priority id enables to describe adaptation paths, which specify the order in which enhancement layers are truncated, as described in Section 2.2.2.5.


Page 14

The dependency id is a 3 bit field which specifies the inter-layer dependency for CGS and spatial scalability. NALUs with a higher dependency id can depend on NALUs with a lower dependency id, but never the other way around. This means that if NALUs with dependency id x are dropped, then any NALUs with dependency id x or greater must be dropped as well between any two switching pictures. In order to find out whether a NALU provides CGS or spatial enhancement, the Sequence Parameter Set (SPS) for the current bitstream segment needs to be investigated. This dependency between NALU header and SPS (among others) shows that while the NALU header does provide certain information which can help adaptation, its main application is still the decoding of the bitstream, rather than its adaptation. The quality id is a 4 bit field which specifies the quality level hierarchy of an MGS NALU. Similar to above, MGS NALUs of a higher level depend on MGS NALUs of a lower level, thus the highest level(s) can be removed for quality scalability. The temporal id is a 3 bit field which specifies the temporal level hierarchy of the current NALU. The same rules as above apply, i.e., the highest level(s) are to be removed first for temporal scalability. The discardable flag is a 1 bit flag which indicates whether the current NALU is needed for decoding NAL units of the current picture. Additionally, if set, this NALU is not needed by any other NALU in subsequent pictures which have a greater dependency id than the current NALU, i.e., such NAL units can be discarded without risking the integrity of higher layers with greater dependency id. When also taking into account the dependency id definition, this means that NALUs which have the discardable flag set can always be disregarded without restriction. AVC VCL units are kept in the bitstream for backwards compatibility with AVC players. In case of such an AVC VCL unit a so called prefix NALU, which consists of a start code plus 4 byte SVC header, is inserted in front of this AVC VCL unit in order to describe the scalability characteristics of the following AVC NALU. Optionally, each NALU is prefixed by a start code which is represented by a constant of three bytes (0x000001). This is needed for parsing NALUs from the file or extracting them from some transport protocols (e.g., MPEG-2 Transport Stream) which do not provide such a boundary on their own. Figure 2.5 shows three encoded pictures of an SVC bitstream which includes combined scalability, i.e., scalability in all dimensions. On the x-axis there are three pictures belonging


Page 15

Figure 2.5: SVC fully scalable bitstream representation (adopted from [1]) to different temporal layers as indicated by their temporal ids. Thus, for scaling this stream in the temporal domain, the rightmost picture would be removed first, as it has the highest temporal id. Furthermore, on the y-axis we see the base layer and two enhancement layers in the CGS / spatial domain as indicated by the dependency id. Finally, on the z-axis we see two MGS enhancement NALUs for each NALU with the quality ids 1 and 2 respectively (not visible in the figure). 2.2.2.5

Adaptation paths

The priority id enables to describe adaptation paths, i.e., a specification of the order in which enhancement layers are truncated. In order to achieve this, it is assumed that a lower id indicates a higher priority, i.e., NALUs belonging to an enhancement layer which shall be dropped first would receive the highest priority id. The optimal adaptation path(s) depend on both content characteristic and the usage environment of the end user. For example, the experienced quality for an adapted movie which contains scenes with rapid camera movements benefits from keeping the framerate high and thus rather dropping nontemporal enhancement layers first. Another example is a mobile device with a small screen with a low resolution. There, an adaptation path would indicate to first drop spatial layers, since the difference in spatial resolution will be barely visible on such a device. In case that there is only a single adaptation path described, then the priority id is sufficient. Figure 2.6 shows an example of an adaptation path, in which first the quality


Page 16

layers and then the temporal layers are dropped, while the spatial resolution is kept at the highest level. The adaptation path does not have to end at [T0,D2,Q0] (as shown in the figure) but can continue with dropping a spatial layer and then again quality layers, etc. until only the base layer [T0,D0,Q0] is left - however, this was omitted in the figure for clarity. For the adaptation path indicated in this figure, there are 8 enhancement layers which can be dropped. Thus, NALUs belonging to the enhancement layer [T2,D2,Q2] (which shall be dropped first) would receive the highest priority id (i.e., 8), the enhancement layer [T2,D2,Q1] would receive the priority id 7, and so on. Note that any adaptation path is always restricted by the inter-layer and intra-layer dependencies as shown in Figure 2.4. For example, the next temporal layer to be dropped according to the priority id must have a lower value for the temporal id than the current layer. If there is only a single adaptation path, then it is sufficient to only consider the priority id. However, as Figure 2.6 indicates, multiple adaptation paths are possible. Depending on the initial quality (which may be decided, e.g., based on the display capabilities of the end device) an adaptation path may start at [T2,D2,Q2] (as shown in the figure) but also at [T2,D1,Q2] or [T2,D0,Q2]. In this case it is no longer sufficient to only consider the priority id in order to detect the end of an adaptation path, but rather the priority id and the temporal, spatial and quality ids need to be considered. Specifically, if the next layer to be dropped (according to the priority id) has higher values for any of the temporal or dependency ids than the current layer, then the next layer starts a new adaptation path. An example of the case of multiple adaptation paths is provided in the context of the Generic Scalability Header description in Section 5.4.1.

2.2.2.6

Conclusion

In this section we reviewed the SVC codec and its adaptation capabilities in the temporal, spatial and quality domains. We reviewed the VCL units which are output from the encoder and the NAL which aggregates the VCL with useful header information. Additionally we introduced PSs and SEI messages which provide useful non-VCL data. Furthermore we introduced the concept of adaptation paths, which specify in which order enhancement layers shall be dropped. For further reading the reader is refered to [22].


Page 17

Figure 2.6: SVC fully scalable bitstream representation with adaptation path

2.2.3

Motion Compensated - Embedded Zero Block Coding

In this section we provide a high level overview and elaborate on the scalability features of the Motion Compensated - Embedded Zero Block Coding (EZBC) video codec [27]. EZBC is based on wavelets in contrast to the block-based hybrid SVC codec. However, from an adaptation point of view both have very similar characteristics, that is, scalability in three dimensions, i.e., temporal, spatial and quality (as described in Section 2.2.2). Each EZBC stream starts with a general header, which includes information on the complete bitstream. This is very similar to the concept of PSs in SVC. Information included in this header includes the maximum framerate, maximum spatial resolution and other information which is valid for the complete stream and which is used for decoder initialization. After this header, there is another header which provides the sizes of all GoPs in this bitstream. Providing all these sizes at the start of the stream fullfills a similar goal as SVC’s start codes for NALUs, i.e., it provides a parser with the GoP boundaries. Note that this mechanism is less robust than the SVC approach, e.g., in the case that a GoP is lost, dropped or truncated the header containing the GoP sizes will no longer be correct. Additionally, this approach only works for offline encoding of bitstreams, where their size is known in advance, in contrast to live encoding where this is not known. Our approach to counter this deficiency in robustness and applicability in live streaming scenarios is to adopt the start code mechanism from SVC in order to separate GoPs in the stream.


Page 18

Figure 2.7: EZBC fully scalable bitstream representation (adopted from [1]) That is, GoPs are separated by start codes (similar to how NALUs are separated by start codes in SVC) instead of using the header which contains all GoP sizes. The GoP size header is followed by a sequence of GoPs. Each GoP includes a GoP header with the motion vectors for all pictures. Subsequently the individual pictures are present in the GoP. As with SVC, each picture can be separated into several units, i.e., a base unit and several spatial enhancement units. Each unit can have quality enhancements which are realized through bit plane encoding, which is again similar to the SVC MGS approach. As such, EZBC provides similar adaptation possibilities as SVC as shown in Figure 2.7, however with a different bitstream structure. For further reading the reader is refered to [27].

2.2.4

MPEG-4 Advanced Audio Coding: Bit Slice Arithmetic Coding

In this section we provide a high level overview and elaborate on the scalability features of the MPEG-4 Bit Slice Arithmetic Coding (BSAC) audio codec [28]. Since BSAC is an audio codec, it is constrained to scaling in the quality dimension. However, in this dimension it provides a type of scalability which we did not introduce yet, i.e., Fine Grained Scalability (FGS). Each BSAC frame, which is similar to a picture for a video codec, consists of a base layer and an enhancement layer in the quality dimension. This is comparable to the CGS/MGS


Page 19

Figure 2.8: BSAC scalable bitstream representation (adopted from [1])

mechanism which we introduced in Section 2.2.2. Figure 2.8 shows the scalability options for BSAC. FGS is more flexible than MGS. MGS is already very flexible due to the fact that MGS segments can be individually removed, i.e., it is not needed to remove the complete layer. FGS builds on top of this and adds the feature that the FGS unit (i.e., the enhancement layer) can be truncated individually at each layer element. This enables a very fine granular adjustment to usage environment constraints as each enhancement layer can consist of up to 64 layer elements. Note that this granularity could also be achieved with MGS, but would come at a considerable overhead due to the required NALU header for each MGS VCL unit. BSAC also includes a header, similar to the NALU header, which is rather limited with regards to provisions for adaptation. Listing 2.2 shows the start of this header which includes the frame length and the top layer fields. The frame length provides the length in bytes of the frame. As such it basically fullfills the same task as the start code in SVC, however with the big disadvantage that this field needs to be updated after each adaptation. The top layer field provides the number of layer elements in the enhancement layer. This is needed for calculating the layer element addresses and also needs to be updated after each adaptation. In order to compute the size of the base layer and the addresses of the layer elements of the enhancement layer, a large part of the frame itself needs to be decoded. Similarly to the NALU header this shows the decoder-oriented design of these frames which is often


Page 20

Figure 2.9: VES scalable bitstream representation (adopted from [1]) suboptimal for adaptation purposes. Listing 2.2: BSAC header +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | h leng | s | top layer | ... | | frame length +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

While the scalability options of BSAC are a lot more limited than that of a fully scalable video codec such as SVC, the design of the BSAC headers / bitstream makes it more expensive to exploit its scalability. For further reading the reader is refered to [29].

2.2.5

MPEG-4 Visual Elementary Streams

In this section we provide a high level overview and elaborate on the scalability features of MPEG-4 Visual Elementary Streams (VES) [30]. VES is an older and already widely spread codec. As such, it does not provide the comprehensive scalability features of the newer AVC or SVC codecs. In fact it is limited to temporal scalability as shown in Figure 2.9. However, the distinction in layers as shown in Figure 2.9 is not explicit such as in SVC, i.e., one can drop any B picture individually in a VES stream. However, this characteristic requires that B pictures never predict from other B pictures, which comes at the cost of coding efficiency.


Page 21

In the same way as (optionally possible) with SVC NALUs, pictures in VES are separated by 0x000001 start codes as shown in Listing 2.3. The start code is followed by an identifier which identifies the type of bitstream segment which is described by this header. For adaptation purposes the Video Object Planes (VOPs) are relevant which are the equivalent to pictures in SVC. Each VOP then has an identifier which states whether it is an I, P or B VOP, where the semantics are similar to those of I, P and B pictures in SVC. Listing 2.3: VES Header +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | s t a r t code ( 0 x000001 ) | ID ( 1 8 2 == VOP) | ID | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−+

The scalability options of VES are constrained to the temporal dimension. No updates to the bitstream are necessary, in contrary to BSAC, which provides simple and cheap adaptation through dropping of B VOPs. For further reading the reader is refered to [31].

2.2.6

Summary

In this section we introduced the different types of scalable media codecs which we consider in this thesis. MPEG-4 SVC is the latest scalable codec which was only recently standardized. It is a block-based hybrid video codec which provides scalability in three dimensions. EZBC is an older, wavelet-based, scalable video codec, which offers similar scalability dimensions as MPEG-4 SVC. Subsequently we introduced the MPEG-4 BSAC scalable audio codec which enables fine granular quality scalability by truncating up to 64 layer elements. Finally, we described the wide-spread MPEG-4 VES video codec which is restricted to scalability in the temporal dimension.

2.3 2.3.1

Codec-specific versus codec-agnostic adaptation Introduction

One possible classification of adaptation mechanisms for scalable media is the difference between codec-specific and codec-agnostic adaptation mechanisms. As motivated in Section 1.1 this thesis is focused on codec-agnostic adaptation, however in order to put it into context, codec-specific adaptation needs to be considered as well.


Page 22

Figure 2.10: Codec-specific adaptation

2.3.2

Codec-specific adaptation

Codec-specific adaptation refers to an adaptation mechanism which only operates on content encoded with a specific codec. For example, SVC-specific adaptation mechanisms will consider the NALU headers and SEIs/PSs, but will not be able to adapt VES or BSAC. That is, a dedicated adaptation mechanism needs to be implemented for each content format. Figure 2.10 shows an example of the dataflow of a codec-specific, intercepting adaptation proxy which was implemented in the context of this thesis. Such a proxy is transparent to the client, which reduces its implementation complexity. As shown in the figure, we only use UDP for streaming in this implementation. In most scenarios advanced transport protocols, such as RTP, which provide QoS feedback and inter-stream synchronization mechanisms are beneficial. We describe an adaptation proxy implementation which is based on RTP in Section 4.5.3. However, for this introduction we use this simple architecture since it enables a focused description. Moreover it enables a simple and therefore more robust implementation, which is particularly beneficial for the throughput measurements which we conduct in Section 6.6. The packetizer reads a picture (or its equivalent for BSAC and VES) of the bitstream from the hard disk and sends it as (a) UDP packet(s) towards the client in a time-aware fashion. Then the adaptation proxy: 1. retrieves this packet, 2. parses and interprets it (interpretation of NALU, VOP, or BSAC headers),


Page 23

Figure 2.11: Codec-agnostic adaptation 3. modifies the packet’s data (dropping of NALUs, VOPs, or truncation of BSAC enhancement layers), 4. updates the UDP header according to the modified data (checksum, length), and 5. forwards the modified packet to the client or drops the complete packet. Finally, the client processes the adapted bitstream, e.g., decodes and renders it on a display or plays it on an audio output device.

2.3.3

Codec-agnostic adaptation

In contrast to codec-specific adaptation mechanisms, codec-agnostic adaptation requires only a single type of adaptation node. This is achieved by amending the media packets with generic metadata describing their priority and structure. The basic dataflow is similar to the codec-specific adaptation detailed in Section 2.3.2. The adaptation proxy: 1. retrieves the next packet, 2. parses and interprets the generic metadata (MD), 3. modifies the packet’s data (dropping of NALUs, VOPs, or truncation of BSAC enhancement layers) based on the generic metadata (and updates the metadata to correspond to the modified packet data), 4. updates the UDP header according to the modified data (checksum, length), and 5. forwards the modified packet to the client or drops the complete packet.


Page 24

As described above, instead of directly interpreting codec-specific headers, the adaptation node adapts based on codec-agnostic (i.e., generic) metadata (MD) as illustrated in Figure 2.11. Note that we do not detail the different types of codec-agnostic headers which already exist; for this the reader is referred to the related work presented in Sections 4 and 5.

2.3.4

Summary

In this section we introduced the difference between codec-agnostic and codec-specific adaptation. Codec-specific adaptation directly uses the information provided by the codec, however this means that the adaptation node needs to be aware of the characteristics of each codec, and thus needs separate implementations for each codec. Codec-agnostic adaptation on the other hand does not have this limitation, but needs additional codec-agnostic information (which causes overhead) on which the adaptation mechanism is based.

2.4 2.4.1

Stateless, stateful and application-aware adaptation mechanisms Introduction

In Section 2.3 we reviewed the different types of adaptation nodes with regards to their awareness of the codec. In this section we differentiate the types of adaptation nodes by their complexity and adaptation flexibility (i.e., stateless, stateful and application-aware) which directly influences their performance but also their capabilities. While we will use a codec-specific adaptation node for SVC to describe the concepts, it is noted that they can be applied to any type of adaptation node.

2.4.2 2.4.2.1

Stateless adaptation Introduction

Stateless adaptation nodes have no information about what occurred previously. Such adaptation nodes have the advantage of being very performant, but also limited in adaptation capabilities. They either make use of the SVC discardable flag or the priority id. Since these adaptation nodes are stateless, they can only adapt what is currently in their buffer. Optimal buffer management for such stateless nodes is a considerable challenge (which we do not cover in this thesis): Starting to drop packets too early results in waste of bandwidth


Page 25

Figure 2.12: Stateless adaptation node Upstream Network k ⇓ Adaptation Node k ⇓ Downstream Network

while starting to drop packets too late might result in the need to drop random packets (because there are no more packets, which have the discardable flag set, in the buffer). Using the priority id (PRID, or simply P) resolves this problem, since this field provides up to 64 levels of priority. This is already a good improvement in adaptation capabilities over the approach involving the discardable flag. However, as with the previous approach, it relies on a predefined adaptation path as described in Section 2.2.2.5. 2.4.2.2

Application scenario

Since this kind of adaptation node is stateless and thus very runtime efficient, its deployment in the network is very flexible. In fact, it can be placed anywhere in the network without restrictions, as illustrated in Figure 2.12. It basically acts as a media-aware network element (e.g., an intercepting adaptation proxy) which considers the priority id and the discardable flag for intelligent buffer management. 2.4.2.3

Walkthrough

As noted above, this stateless adaptation node is assumed to have a buffer of a limited size. If there is too much traffic going through this adaptation node, then the buffer will overflow. Without being content-aware (e.g., SVC-aware), random packets are dropped when the buffer overflows. This can result in dropping a packet which, e.g., belongs to an I Picture which has severe consequences to the quality of the service, since a complete GoP may become unusable. By facilitating a content-aware adaptation node this can be avoided at a minimal processing overhead. An abstract (the actual steps are implementation-specific) walkthrough of the processing of such an adaptation node looks as follows.


Page 26

1. A new packet is retrieved. 2. The packet is put into the buffer. 3. An adaptation decision taking engine (ADTE) is called to see if there is a need to thin the buffer. 4. The ADTE returns the amount of data by which the buffer needs to be thinned (ideally 0). This adaptation decision only depends on the current state of the buffer. 5. The adaptation node removes packets from the buffer until the data reduction requirement from the ADTE is fullfilled. The removal is based on: a The discardable flag (drop packets which have it set). b The priority id (drop packets with a low priority). c Both the discardable flag and the priority id (in which case there needs to exist a weighting between the both). The usefulness of this adaptation node completely depends on the content and / or the service provider which need to set the discardable flag and the priority ID in a meaningful way. While the discardable flag does not provide much flexibility - it can either be set, e.g., for B-frames, or not - the priority ID provides much more flexibility for a more fine granular dropping policy (26 possible values).

2.4.3 2.4.3.1

Stateful adaptation Introduction

Utilization of temporal id (TID, or simply T), dependency id (DID, or simply D) and quality id (QID, or simply Q) enhance the capabilities of the adaptation node considerably. The adaptation node is no longer constrained to a fixed adaptation path, but can now dynamically choose among three different adaptation dimensions. However, in order to facilitate this flexibility the adaptation node needs • more context than just the indication that there is a network problem, and • a state in order to remember previous adaptation decisions and to buffer scalability SEIs.


Page 27

A state is needed to remember previous adaptation decisions (i.e., the current (P)TDQ limit). This cost comes with the benefit of a much smoother adaptation, since no longer only the current buffer is considered. The scalability SEIs are needed as they provide the mapping from a T, D or Q value to a specific frame rate, spatial resolution or bitrate, respectively. The parameter sets are needed to compute in which dimension (CGS or spatial) a NALU with a higher dependency id enhances the bitstream. More context is needed, because the indication that there is a network problem does not help in deciding which adaptation path to follow (temporal, spatial, quality or a combination). Such additional context information could, for example, be the (maximum) display resolution of the terminal(s) which is (are) connected to the downstream network. By knowing this context, the adaptation node could decide that in case of problems adaptation along the spatial axis should be performed first because the display resolution of the terminal(s) is below the current resolution of the bitstream. Similarly a decision could be to reduce the temporal resolution because it is known that the terminal(s) do not have the processing power to support the current framerate of the bitstream. Much of this context information is only available on the application layer, which leads to the next type of adaptation nodes (see Section 2.4.4), i.e., application-aware adaptation nodes. 2.4.3.2


The stateful adaptation node of this section can again be implemented as an intercepting adaptation proxy which acts as a gateway to a network with a specific set of end devices of which some capabilities are known as illustrated in Figure 2.13. In contrast to the stateless adaptation node, it has more knowledge about the content and thus more adaptation possibilities. For example, such a proxy could connect from the Internet to a UMTS network. 2.4.3.3

Walkthrough

As noted above, this stateful adaptation node has - in addition to the previous adaptation node - a state for each SVC stream and aggregated knowledge of the usage environment. This information can be used to perform a smoother and more precise adaptation. An abstract (the actual steps are implementation-specific) walkthrough of the processing of such an adaptation node looks as follows. Note that this is an extension of the previous walkthrough.


Page 28

Figure 2.13: Stateful adaptation node Upstream Network k ⇓ Adaptation Node k ⇓ Access Network with known Usage Environment

1. A new packet of a specific SVC stream is retrieved. 2. The packet is put into the buffer. 3. If the packet is a (scalability) SEI message it is stored in the state of this SVC stream. If there was a previous SEI message of the same type, then the older one is overwritten. 4. The ADTE is called to take an adaptation decision which ultimately determines if and how to thin the buffer. 5. The ADTE returns the amount of data by which the buffer needs to be thinned and/or a new (P)TDQ limit, i.e., if any of the priority id, temporal id, dependency id or quality id value of a NALU is higher than the limit, then the corresponding NALU shall be dropped. • This adaptation decision depends on the current state of the buffer and on the current (P)TDQ limit / usage environment. • The ADTE might decide to lower the TDQ limit (one or more of the three values). It does this by matching the usage environment with the current TDQ limit. For this it needs to know the semantics of each TDQ value and therefore it needs to consult the current scalability SEI message in the state. For example, the usage environment indicates that devices connected to the downstream network have a maximum screen resolution of QCIF. The adaptation node then sees that the current limit for spatial layers is 5. It then consults the scalability SEI message which states that 5 means a resolution of CIF. Therefore the adaptation node decides to limit the spatial layers to 4, therefore dropping all CIF NALUs which are not needed anyway.


Page 29

6. The adaptation node removes packets from the buffer until the data reduction / (P)TDQ requirement from the ADTE is fullfilled. The removal is based on. a The discardable flag (drop packets which have it set). b The priority id (drop packets with a low priority). c The current TDQ limit. d All (or a combination) of the above.

2.4.4 2.4.4.1

Application-aware adaptation node Introduction

These kinds of adaptation nodes are powerful devices which are aware of application-layer information. These adaptation nodes are not very runtime efficient but very capable with regards to adaptation possibilities. They are not only aware of all the information in the NALU header but also buffer SEI mesages which provide additional information such as the bitrate, resolution and frame rate of the layer. In addition to being stream-aware, they are also aware of the individual terminals and their capabilities. This, together with the application-layer context such as the available bandwidth, processing power, remaining battery power, etc. enables very fine tuned adaptation to the individual contexts of the end devices. 2.4.4.2


Due to the low runtime efficiency (but high adaptation capability) of this stateful adaptation node, an example application is a set top box for a multi-person household and enables different members of the family to watch the same live stream individually adapted to their preferences and equipment. This is illustrated in Figure 2.14 and an extended application scenario is provided in Section 3.2. 2.4.4.3

Walkthrough

This application-aware adaptation node has, in addition to the previous adaptation nodes, a state for each end device and knowledge of the usage environment of each end user. This information can be used to perform an adaptation which is specific to each end device. In order to accomplish this, the adaptation node must act as a proxy, which replicates the


Page 30

Figure 2.14: Application aware adaptation node Upstream Network k ⇓ Adaptation Node k k k ... ⇓ ⇓ ⇓ ... Consumer Access Network

incoming SVC stream for each connected device and receives the individual usage environment of each of them. In contrast to the intercepting proxy, a client needs to be aware of this proxy in order to provide it with its usage environment description. An abstract (the actual steps are implementation-specific) walkthrough of the processing of such an adaptation node looks as follows. Note that this is an extension of the previous walkthroughs. This walkthrough is mostly identical with the walkthrough in Section 2.4.3. The important difference (which is highlighted in the amended walkthrough below) is that the adaptation is performed individually for each stream based on the usage environment specific to the consumption of the stream. 1. A new packet of a specific SVC stream is retrieved. 2. The packet is put into the buffer. 3. If the packet is a (scalability) SEI message it is stored in the state of this SVC stream. If there was a previous SEI message of the same type, then the older one is overwritten. 4. The packet is copied into a dedicated buffer for each end device. The following actions are performed for each end-device buffer: 1. The ADTE is called to take an adaptation decision which ultimately determines if and how to thin the buffer. 2. The ADTE returns the amount of data by which the buffer needs to be thinned and/or a new (P)TDQ limit.


Page 31

• This adaptation decision depends on the current state of the buffer and on the current (P)TDQ limit / usage environment. • The ADTE might decide to lower the TDQ limit (one of the three values). It does this by matching the usage environment of the current end device with the current TDQ limit. For this it needs to know the semantics of each TDQ value and therefore it needs to consult the current scalability SEI message in the state. For example, the usage environment indicates that the device connected to the downstream network has a maximum screen resolution of QCIF. The adaptation node then notes that the current limit for spatial layers is 5. It then consults the scalability SEI message which states that 5 means a resolution of CIF. Therefore the adaptation node decides to limit the spatial layers to 4, therefore dropping all CIF NALUs which are not needed anyway. 3. The adaptation node removes packets from the buffer until the data reduction / (P)TDQ requirement from the ADTE is fullfilled. The removal is based on: a The discardable flag (drop packets which have it set). b The priority id (drop packets with a low priority). c The current TDQ limit. d All (or a combination) of the above. Note that this walkthrough assumes that the upstream network delivers the SVC stream at the maximum available quality. This is inefficient and additional bandwidth can be saved by having the adaptation node signal the currently maximum needed quality by all connected end devices to the upstream network.

2.4.5

Stateless, stateful and application-aware adaptation: a combined walkthrough

The different types are complemental w.r.t. their capabilities, i.e., the stateful adaptation node is capable to perform everything which can be done by the stateless adaptation node. Combinations of the above adaptation nodes are not only possible but also beneficial, as described in the following scenario.


Page 32

In this scenario we combine the functionalities of all three adaptation nodes in a single media delivery scenario. We assume an IPTV service provider which delivers its (live) content over non-dedicated networks (e.g., the Internet). The live content stream originates at the service provider access network. It then crosses several autonomous systems (networks) as it is routed towards the intended consumers. In order to efficiently transport this live stream for multiple consumers (and since IP-layer multicast is still not widely available in the Internet), the service provider relies on an overlay network. In this overlay network the single stream is delivered to the access network of the consumers, where it is replicated and adapted to the usage environment (e.g., terminal capabilities) of each end device. This approach helps to save considerable bandwidth compared to traditional unicast delivery without adaptation, in particular if there are several consumers in the same access network. The complete example walkthrough is illustrated in Figure 2.15. The live content stream originates at the Server, encoded with the maximum useful quality. This maximum useful quality can be based on knowledge of the service provider’s customer base, contracts with service level agreements, and so on. As the content stream leaves the Service Provider Access Network it is rate-adapted to the network conditions of the following autonomous system(s) using the Stateless Adaptation Node as described in Section 2.4.2. Several of these adaptation nodes might be employed along the content delivery chain. In particular if there are high-delay autonomous systems along the delivery chain, this helps to quickly react to any network problems which might occur. Since not all consumers share the same consumer access network, the single stream needs to be replicated and routed differently at one stage. This is performed by the Application-aware Adaptation Node as described in Section 2.4.4. This adaptation node can already adapt the replicated stream to the common usage environment of the downstream network. Note that if no such adaptation nodes were in place, the replication would need to happen already at the server, effectively resulting in two streams leaving the server. This comes at the price of additional bandwidth but at the benefit of less complexity in the network. As the two replicated streams are transmitted towards their consumers, they might go through additional replication steps (the layout of the overlay network is a planning problem which will not be further explored in this thesis). The streams might also pass Known Networks, i.e., networks with known usage environments, such as UMTS. The Stateful Adaptation Node as described in Section 2.4.3 can be employed there to adapt the content stream. As the content stream arrives at the Consumer Access Network, an additional replication / adaptation step (using the


Page 33

application-aware adaptation node introduced in Section 2.4.4) can be performed if there are several end devices which would like to consume the content through the consumer access network. Thanks to the diverse adaptation nodes which are employed, the consumer can consume the content with a very high Quality of Experience, even as the Service Provider does not employ dedicated networks for delivery (i.e., no resource reservation is performed). In addition, thanks to the overlay multicast network, the Service Provider can offer its service at highly reduced bandwidth requirements (but at the cost of additional adaptation nodes in the network. Advanced media delivery scenarios, such as the one above, are not available in today’s systems. However, they are at the research focus of several research projects. In particular the Ambient Networks EU IST project1 has researched and validated such an overlay network for advanced media delivery scenarios [32][33].

2.4.6

Summary

In this section we introduced and compared different types of adaptation nodes which are enabled by the different types of SVC metadata. The different types of adaptation nodes which were reviewed are listed in Table 2.1. The different types are complemental with regards to their capabilities, e.g., the stateful adaptation node is capable to perform everything which can be done by the stateless adaptation node. Combinations of the above adaptation nodes are not only possible but also beneficial, as shown in Section 2.4.5 in a combined content delivery scenario which employs all the described adaptation nodes.

1

Ambient Networks, http://www.ambient-networks.org


Figure 2.15: A combined walkthrough Server ⇓ Service Provider AN ⇓ Stateless AN (2.4.2) ⇓ Autonomous System ⇓ Stateless AN (2.4.2) ⇓ Autonomous System ⇓ ... and so on ⇓ Appl.-aware AN (2.4.4) ⇓ ... ⇓ Network A Network B ⇓ ⇓ ... and so on ... and so on ⇓ ⇓ Stateful AN (2.4.3) Stateful AN (2.4.3) ⇓ ⇓ Known Network Known Network ⇓ ⇓ Appl.-aware AN (2.4.4) Appl.-aware AN (2.4.4) ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ Consumer AN Consumer AN ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ End Devices End Devices

Page 34


Page 35

Table 2.1: Classification of adaptation nodes for scalable media in streaming scenarios Type

Runtime Performance

Adaptation Capability

Stateless (2.4.2)

High: It relies on precomputed adaptation paths and has no state Medium: It needs additional resources to store a) the state of a stream (SSEIs, (P)TDQ limits, usage environment) and b) to match the current usage environment to a specific adaptation path, i.e., reduction of T, D or Q.

Low: No knowledge of the actual usage environment, only the buffer state is known. Medium: Thanks to (aggregated) knowledge of the usage environment and means to map this usage environment to a specific adaptation path (T, D or Q), it is no longer bound to a single precomputed adaptation path (priority ID). Moreover, it can apply a much smoother adaptation, since it is able to remember previous adaptation decisions. High: It can replicate and adapt a stream for each individual device, thus enabling a very fine tuned adaptation.

Stateful (2.4.3)

Application-aware (2.4.4)

Low: It not only needs to store a state for the aggregated usage environment, but for each individual device.

CHAPTER

3 3.1

gBSD-driven adaptation of scalable media

Introduction

In this section we introduce the gBSD-driven adaptation approach which is standardized in DIA by MPEG [3]. The DIA approach to enabling codec-agnostic adaptation is conceptually very pleasing, since it follows a modular approach by fullfilling the different requirements by specific tools. While this modular approach may not result in the most runtime efficient solution and suboptimal performance, it provides a conceptual foundation for the work in this thesis. We first describe an application scenario, which motivates the need for DIA. Consequently we describe the relevant features of DIA in the context of this application scenario and provide examples. This introduction by use case enables a focused introduction of DIA, while references are given for further reading.

3.2


In the following, a multimedia delivery and consumption scenario is introduced by means of which the scope and role of the MPEG-21 DIA standard can be illustrated. The scenario and architecture described here demands dynamic and distributed content adaptation techniques, some of which are at the core of the rest of this work. Assume an Internet Service Provider (ISP) provides a new feature which offers customers live streams of events. Two subscribers use this service to watch a Formula 1 car race. Both persons are in the same room and use their High Definition Television (HDTV) screen to watch this program. After some time, one of them has to leave the room. Since she wants

CHAPTER 3. GBSD-DRIVEN ADAPTATION OF SCALABLE MEDIA

Page 37

to continue watching the race, she picks her tablet PC and instructs it to duplicate the session from the HDTV screen. The Formula 1 program appears on the tablet PC and she can continue watching the race in another room. At a later time, the other person has to go to work. The race is almost over and he does not want to miss the finish. He therefore picks his UMTS-enabled PDA and instructs the HDTV screen to transfer the session so that he can watch the finish of the race on his way to work. At some point in time during his commute, the available network bandwidth is significantly reduced so that it prevents him from enjoying the race on a satisfactory quality level. As a consequence, the video stream is replaced by a slide show by the streaming system, in order to be able to maintain good audio quality and present key moments of the race. The architecture which the ISP relies upon in order to address this scenario, is explained below and illustrated in Figure 3.1. The set-top box (STB) in the subscribers’ home needs to facilitate content adaptation and therefore has to be context-aware. When the subscribers start watching the Formula 1 race on their HDTV screen, the STB needs to be aware of the terminal capabilities of the HDTV screen, including its display resolution and its media decoding capabilities. If the requested stream has not already been made available to the STB, the STB forwards the request, including the associated terminal capabilities, to the ISP. The ISP uses the terminal capabilities to adapt the media and starts to stream the selected program (i.e., the adapted media stream) to the STB. The STB forwards this adapted stream to the HDTV screen where it is being displayed. Since the available network bandwidth on the UMTS-enabled PDA depends on various factors such as the location and speed of the subscriber, the live stream also needs to be adapted on-the-fly to actual network conditions in order to always deliver the best possible quality to the consumers. The ISP performs media adaptation according to the users’ network characteristics, which can result in degrading the audio and video quality or even in dynamically switching media types (from one video codec to another or from video to a slideshow, for instance). When, as described above, one of the customers requests to continue watching the selected program on the tablet PC, the device requests the same channel from the STB and provides its terminal capabilities within the request. The STB analyzes the provided device capabilities in order to find out whether the quality of the channel which it receives from the ISP is appropriate, or if it needs to request the media streams in different quality. In


Figure 3.1: Dynamic and distributed adaptation use case

Page 38


Page 39

fact in this case, the current quality level is too high given the display resolution and the decoding capabilities of the tablet PC. Therefore, the STB replicates the stream which it already receives in best quality from the ISP and adapts this replicated stream according to the terminal capabilities of the tablet PC. By means of this setup, the ISP saves bandwidth through application-layer multicast. The stream replication and possible adaptation processes are being performed at the customers’ premises without putting load on the ISP’s equipment. Additionally, the architecture also enables the customers to transfer or duplicate the session to any device in their vicinity. Furthermore, the adaptation techniques employed in this architecture enable the ISP to provide its content to its subscribers independent of end device or network types.

3.3

The Digital Item Adaptation standard

For the above scenario to work, interoperability is required between all of the involved devices: • The end devices (HDTV screen, PDA, tablet PC) need to report their terminal capabilities in a format which is interpretable by both the STB and the ISP’s server. • The STB needs to be able to interpret and adapt the media content which is sent by the server. Ideally, it should be able to perform these actions in a general manner, independently of the actual media encoders used by the server. MPEG-21 DIA provides normative description formats which enable interoperability in both cases above. Device interoperability (i.e., the first item above) is achieved through a unified description model that covers information about the user characteristics, terminal capabilities, network conditions, and natural environment properties. This context information is generally referred to as the Usage Environment Description (UED). Two examples of UEDs are shown in Listings 3.1 and 3.2. Listing 3.1 shows the power characteristics of a device, including the remaining battery capacity. An adaptation server which is aware of this property could - for example - decide to switch to a less battery-demanding encoding format in case that the battery consumption is too high to consume the complete media stream. Listing 3.2 presents another example of a UED, which provides the currently available bandwidth. This can be used by an adaptation server to adapt the media to fit this bandwidth requirement, e.g., by dropping B VOPs of a VES.


Page 40

Coding format independence (i.e., the second item above) is accomplished by means of generic Bitstream Syntax Descriptions (gBSDs), Adaptation Quality of Service (AQoS) specifications, and Universal Constraints Descriptions (UCDs). The concept of coding format independent multimedia content adaptation relies on the characteristics of scalable coding formats which enable the generation of a degraded version of an original media bitstream by means of simple remove operations followed by minor update operations, e.g., removal of spatial layers of a video and updates of certain header fields as described in Section 2.2. A gBSD is an XML document which describes a (scalable) bitstream enabling its adaptation in a codec agnostic way. Only the high-level bitstream structure is described, i.e., how it is organized in terms of packets, headers, or layers. The level of detail of this description depends on the scalability characteristics of the bitstream and the application requirements. Listing 3.3 shows a gBSD for a VES content. Since VES only provides scalability in the temporal domain, the corresponding gBSD is quite simple. Each VOP of the VES content is described by a gBSDUnit, which provides its start and length in bytes. Additionally, the marker attribute indicates which temporal layer (T) this VOP belongs to, i.e., either T0 for P and I VOPs or Ti, i≥1, for B VOPs. The higher the temporal layer, the earlier the corresponding B VOP is dropped. In the course of content adaptation, the gBSD of a media bitstream is transformed first, followed by the generation of the adapted bitstream from the original one, guided by the modified gBSD. The method for the transformation of the gBSD is not specified in the DIA standard [3]. Current implementations use mostly XSLT, but also STX for the transformation process. Extensible Stylesheet Language Transformations (XSLT) [34] is a declarative, template based, transformation language for XML documents. An XSLT processor needs two inputs: an XSLT style sheet that contains the transformation rules expressed in XML and the input XML document represented as DOM1 tree. In addition, a set of parameters and parameter values can be passed to the XSLT processor to steer the transformation. The XSLT processor traverses the DOM tree and applies the changes according to the transformation rules defined in the style sheet. Streaming Transformations for XML (STX) [35] uses style sheets with XSLT-like notation to perform the transformation of XML documents. Instead of a DOM representation 1

Document Object Model, http://www.w3.org/DOM/


Page 41

of the XML document the event-based SAX2 approach is used. Structural events are extracted from the input document and passed to the STX processor that filters or alters the events corresponding to the STX style sheet. In contrast to XSLT, the event-based STX approach does not mandate to have the complete document in memory, however this advantage comes at the cost of generating the SAX events. Additionally, this causes a lack of context information compared to DOM, therefore STX supports the buffering of events which makes it equally powerful as XSLT. Both XSLT and STX support the codec-agnostic adaptation approach by providing a generic transformation process that is controlled by a codec-specific style sheet which can be provided together with the media content. A very simple XSLT style sheet which corresponds to the gBSD in Listing 3.3 is shown in Listing 3.4. If this style sheet is applied to the gBSD it transforms it by either copying the current gBSDUnit (matchAll template) or removing it in case that the marker attribute indicates a B VOP (removeBVOPs template). An AQoS description provides means to select optimal parameter settings for media content adaptation engines to satisfy constraints imposed by terminals and/or networks while maximizing QoS. In other words, the parameters for transforming the aforementioned gBSD are provided, e.g., to the style sheet which transforms it. In our example this parameter would allocate the TemporalLimit parameter of the style sheet with an appropriate limit in order to react to the current usage environment. Finally, UCDs restrict the solution space provided by the AQoS description through limitation and optimization constraints. In this thesis we focus on the actual adaptation of the media. For information on the decision taking process, including examples of AQoS and UCD descriptions, please refer to [36]. Listing 3.1: UED: power characteristics 2

Simple API for XML, http://www.saxproject.org/


Listing 3.2: UED: network condition

Listing 3.3: generic Bitstream Syntax Description for a VES content < !−− . . . new GoP . . −−>

Page 42


< !−− . . . new GoP . . −−> < !−− . . . and so on . . . −−> < / dia:DIA>

Listing 3.4: XSLT Style sheet for B VOP dropping < !−− Match a l l : default t e m p l a t e −−> < !−− Test and remove B−VOPs −−>


Page 44

Figure 3.2 depicts a gBSD-based adaptation process, taking place for instance in the STB of the above scenario. The adaptation comprises an adaptation decision-taking process (ADTE) resulting in an adaptation decision, which guides the gBSD transformation. As shown in the figure the complete gBSD is processed at a specific point in time (t0 in the figure). The transformed gBSD then steers the bitstream generation process [37]. The transformation is performed, e.g., by using standardized XML transformation languages such as XSLT. The gBSD transformation may drop elements within the gBSD that describe, e.g., enhancement layers leaving only those elements that describe an adapted version of the original bitstream; minor update operations of, e.g., header or parameter fields to reflect the changes to the bitstream may have to be performed subsequently. The bitstream generation process (gBSDtoBin) parses the transformed gBSD and generates the adapted media bitstream by using the bitstream offsets and parameter values of the remaining gBSD elements. It also optionally generates an updated gBSD with its start and length values aligned to the adapted media bitstream. Only the bitstream segments described by the remaining gBSD elements are copied to the output bitstream whereas all other segments are skipped. The output bitstream (and optionally its XML metadata, which may be encoded/compressed beforehand in order to reduce its size) is then provided to a media consumer, e.g., an end device or a network node that performs further adaptation steps. Since all of the DIA descriptions (including the transformation instructions, e.g., an XSLT style sheet, for the gBSD), are provided together with the media bitstream, this enables codec-agnostic adaptation nodes. These can support any type of scalable media which is properly described by such DIA descriptions. However, this kind of adaptation has some drawbacks in streaming scenarios which will be discussed in the following chapters. In particular, a solution based on metadata is presented that allows the processing and delivery of media and metadata in a piece-wise fashion. For further information on the generic gBSD-based adaptation approach the reader is kindly referred to [38][12][39][40].

3.4

Summary

In this section we introduced - supported by a use case and examples - the gBSD-based adaptation approach. We presented the architecture of this approach which serves as a


Figure 3.2: gBSD-based adaptation approach

Page 45


foundation for the gBSD-related work in this thesis.

Page 46

Part II

Codec-agnostic dynamic and distributed adaptation of scalable media

47

CHAPTER

4 4.1

gBSD-driven dynamic and distributed adaptation of scalable media


The gBSD-based adaptation approach as introduced in Chapter 3 shows that the role of XML-based metadata for describing advanced multimedia content gains ever more importance. One purpose of such metadata is to increase the access to such contents from anywhere and anytime. In the past, two main categories for this kind of metadata have become apparent [41]. The first category of metadata aims to describe the semantics of the content, e.g., by means of keywords, violence ratings, or genre classifications. Metadata standards supporting this category are, e.g., MPEG-7, TV Anytime, and SMPTE [42]. The second category of metadata does not describe the semantics, but rather the syntax and structure of the multimedia content. This category, for instance, includes languages for describing the bitstream syntax which in turn yield a wide range of research activities enabling codecagnostic adaptation engines for scalable contents. Some examples for such languages are the Bitstream Syntax Description Language (BSDL), which includes both BSD and gBSD descriptors, introduced by MPEG-21 DIA, BFlavor [43], and XFlavor [44]. Note that MPEG-7 also provides means for describing syntactical aspects of multimedia bitstreams [45]. Both categories of metadata (semantic and syntactic descriptions) have in common that they tend to be designed in increasing detail, as this increases the accessibility of the media content. They often describe the content per segment or even per access unit (AU), which are the fundamental units for transport of media streams and are defined as the smallest data entities which are atomic in time, i.e., to which decoding time stamps can be attached. Note that AUs correspond to audio frames and video pictures as introduced in Section 2.2.

CHAPTER 4. GBSD-DRIVEN DYNAMIC AND DISTRIBUTED ADAPTATION OF SCALABLE MEDIA Page 49

An example of this tendency is that a single violence rating for an entire movie might exclude many potential consumers, even if the movie contains only one or two extremely violent scenes. However, if the violence rating were provided per scene, for instance, the problematic scenes could simply be skipped for viewers who are not supposed to see them. On another vein, providing highly descriptive metadata for scalable multimedia content (i.e., describing spatial, temporal, and fine-grained scalability) would enable the accessibility of the content on as many devices as possible. An example of this are gBSDs which are considerably larger for the very scalable MPEG-4 SVC content (where 3 scalability dimensions need to be described) than for content encoded with MPEG-4 VES (where only a single scalability dimension is described). This increased detail is also very visible when comparing the corresponding RTP payload formats [46][47]. As a consequence, this metadata is often of considerable size, which, even when applying compression, is problematic in streaming scenarios. That is, transferring entire metadata files before the actual transmission of the media data, if possible at all, could lead to a significant start up delay. Additionally, there is no information on how this metadata is synchronized with the corresponding media data, which is necessary for streamed (i.e., piece-wise) processing thereof. The concept of piece-wise (and timed) processing is natural for media data. As introduced in Section 2.2, a video consists of a series of independent pictures which are typically taken by a camera. These independent pictures are then encoded, typically exploiting the redundancies between these pictures. The resulting AUs can depend on each other (e.g., in the case of bidirectional encoded pictures) but are still separate samples of data. Although the characteristics of this content-related metadata are very similar to those of timed multimedia content, no concept of “samples” exists for this metadata today. In the following, we introduce the concept of “samples” for metadata by employing Streaming Instructions for XML-based metadata. Furthermore, Streaming Instructions for the multimedia content are proposed as well that allow synchronized processing of both media and metadata. The XML Streaming Instructions specify the fragmentation of the content-related metadata into meaningful fragments and their timing. These fragments are referred to as Process Units (PUs), which introduce the concept of “samples” known from audio-visual content to content-related metadata. The Media Streaming Instructions are used to locate AUs in the bitstream and to time them properly. They are specific to XML-based metadata which describes the media content in a linear way, e.g., the gBSD.


Both types of streaming instructions enable time-synchronized, piece-wise (i.e., streamed) processing and delivery of media data and its related metadata. Furthermore, the fragmentation mechanism helps to overcome the start up delay introduced by the size of the metadata. Another, less obvious benefit is that the Streaming Instructions enable to extend the existing gBSD-based media adaptation approach to dynamic and distributed use cases as the one described in Section 3.2. This extension will be addressed in Section 4.5. It must be noted that MPEG-21 DIA uses the term BSDL to specify both Bitstream Syntax Descriptions (BSDs), which describe the bitstream syntax using codec-specific XML elements and attributes, and generic Bitstream Syntax Descriptions (gBSDs), which describe the bitstream syntax using generic XML elements and attributes. Generally, the mechanisms used in this thesis are aimed to apply to both BSDs and gBSDs, however we focus on gBSDs to validate them. The work on the Streaming Instructions was performed in cooperation with Sylvain Devillers in the scope of MPEG standardization and the DANAE EU IST project. Particularly, Sylvain contributed to the novel mechanisms for fragmentation of metadata which are presented in Section 4.3 and it must be noted that the Properties Style Sheet mechanism originated from his ideas.

4.2

Related work

In this section we review related work in the literature that deals with mechanisms enabling streamed processing and transport of multimedia content and related metadata. Multiple mechanisms for specifying the fragmentation and timing of media content are well known, e.g., the sample tables of the ISO Base Media File Format [48]. The difference is that in our approach this information, i.e., both the XML Streaming Instruction and the Media Streaming Instructions, is specified as part of the metadata. This provides a common way for a user to specify the fragmentation and timing of both media and metadata. MPEG is currently standardizing so called Multimedia Application Formats, which aim at combining technology from MPEG and other standardization bodies to specify a specific application, e.g., a photo player and a music player [49]. All these applications employ XML metadata and currently either use it only on a track/movie level or they use mechanisms from the ISO Base Media File Format to provide the timing of more fine granular metadata. However, this requires that the metadata is already fragmented beforehand and that the


metadata is therefore no longer available in its original format for non-streamed processing. Wong et al. [50] define a method for fragmenting an XML document for optimized transport and consumption, preserving the well-formedness of the fragments. However, what is consumed are not the fragments themselves but rather the document resulting from the aggregation of the fragments. Furthermore, the fragmentation is achieved according to the size of the Maximum Transport Unit (MTU) and not based on the semantics of the fragment, i.e., no syntax is provided for a content author to specify which fragments should be consumed at a given time. Alternatively, MPEG-7 provides an encoding method (Binary Format for Metadata) to progressively deliver and consume XML documents in an efficient way [51]. Therefore, so-called Fragment Update Units (FUUs) provide means for altering the current XML document by adding or removing elements or attributes. However, MPEG-7 only specifies the syntax of FUUs and its decoding, whereas our work concentrates on the composition of XML fragments. In both cases above, no timing information is provided which enables the synchronized use of the metadata and the corresponding multimedia content. The Continuous Media Markup Language (CMML) [52] is an XML-based mark-up language for time-continuous data similar to MPEG-7. Together with the Annodex exchange format [53] it allows to interleave time-continuous data with CMML markup in a streamable manner. This approach is specific to CMML whereas in our work we aim to offer a generic solution for time-synchronized, streamed processing and transport for media and related metadata. The Synchronized Multimedia Integration Language (SMIL) [54] provides a timing and synchronization module which can be used to synchronize the play-out of different media streams. However SMIL is only concerned with media as a whole and therefore no AU location, fragmentation, and timing for metadata are provided. The Simple API for XML (SAX) is an event-based API which allows streamed processing of XML [55]. It allows to parse an XML document without loading the complete document into memory. This does help to avoid the start up delay for streamed processing. However, legacy applications which rely on the Document Object Model (DOM) [56], which mandates to load the complete document into memory, would need to be re-implemented. Moreover, no timing or fragmentation information is provided for piece-wise and synchronized processing of media and metadata. Additionally, no concept of samples is supported


which disables random access into such a SAX stream. Our concept is close to a mechanism provided by Scalable Vector Graphics (SVG) [57] to indicate how a document should be progressively rendered: the externalResourcesRequired attribute added to an element specifies that the document should not be rendered until the sub-tree underneath is completely delivered. This mechanism is specific to SVG. In contrast, our method allows isolating a fragment that can be consumed at a given time, but this fragment does not need to contain the previous one. In particular, it is possible to progressively consume a document without ever the need of loading the full document into memory since only a fragment is consumed at a time. In this chapter we focus on gBSDs. However, gBSDs are not the only means to describe the syntax of a media bitstream. Related mechanisms include XFlavor [44], the MPEG-21 Bitstream Syntax Descriptions (BSDs) [58], BFlavor [59] and gBFlavor [60]. XFlavor is an extension of Flavor [61]. Flavor is specified in MPEG-4 Systems [62] as the Syntactic Description Language (SDL) and enables to automatically generate a Flavor document, i.e., C++ and Java code to parse the media bitstream on a bit-per-bit basis. XFlavor extends Flavor to enable to create an XML description based on the Flavor document, which corresponds to the described media bitstream. Other than with gBSDs or BSDs, this XML description is a bit-per-bit description of the media bitstream, i.e., the original media bitstream is no longer needed to generate a (possibly adapted) version of the media bitstream. This leads to comprehensive XML descriptions which are much larger than the original media bitstream. BSDL also enables to automatically create XML descriptions (BSDs) for media bitstreams, but other than XFlavor it relies on an XML schema (instead of the Flavor document) which describes the structure of the media bitstream in order to create the BSD. Furthermore, the BSDs are not meant to replace the media bitstream, i.e., they are not describing it in a bit-per-bit fashion and thus both the original media bitstream and the BSD are needed for processing. A disadvantage of BSDL is that it suffers in execution time due to the need to keep the complete BSD in memory when creating it. BFlavor tries to harmonize XFlavor and BSDL by combining their strengths, i.e., BSDLs’ smaller description size due to a high-level description and XFlavors better execution performance. Specifically it extends XFlavor with the ability to create high-level bitstream syntax descriptions (i.e., BSDs) in a more efficient way than the one taken by BSDL. Finally, gBFlavor represents an extension of BFlavor to support the automatic creation of gBSDs. To the best of our knowledge, the concept of PUs and in particular the method we


developed for specifying their composition, processing, and their transport in conjunction with media fragments are original.

4.3

Streaming Instructions

In the following, we first introduce the basic requirements which we identified for the streaming of metadata and related media data: • The Streaming Instructions need to describe how metadata and/or associated media data should be fragmented into PUs (for metadata) and AUs (for media data) respectively, for processing and/or delivery. • A PU has to be well-formed (w.r.t. an XML schema) and needs to be able to be consumed and processed as such by a terminal (i.e., no other fragments are needed to consume and process it). • The Streaming Instructions shall enable to assign a timestamp to a PU and/or an AU indicating the point in time when the fragment shall be available to a terminal for consumption. • The Streaming Instructions need to provide mechanisms which allow a user to join a streaming session that is in progress. This means that one needs to be able to signal when a PU and/or AU shall be packaged in such a way that random access into the stream is enabled. • It shall be possible to apply the Streaming Instructions without modifying the original XML document as there may be use cases where it is not possible or feasible to modify the multimedia content and its metadata, e.g., due to digital rights management issues. • A streaming instruction’s processor shall work in a memory and runtime efficient way. Consequently, we introduce three different mechanisms to respond to the requirements described above: 1. The XML Streaming Instructions describe how XML documents shall be fragmented and timed. 2. The Media Streaming Instructions localize AUs in the bitstream and provide related time information.


3. Finally, the Properties Style Sheet provides means to describe all of the above properties in a separate document, rather than directly in the metadata. The XML and Media Streaming Instructions are defined as properties. The properties are abstract in the sense that they do not appear in the XML document, but augment the element information item in the document infoset [63]. They can be assigned to the metadata by using XML attributes and/or by the Properties Style Sheet. Additionally, an inheritance mechanism is defined for some of these properties: the value of the property is then inherited by all descendant elements until the property is defined with a different value which then supersedes the inherited value, and is itself inherited by the descendants. Lastly, a default value is specified for some of the properties. In the sequel, we will introduce the mechanisms listed above separately and then combine them as they are applied to the scenario in Section 4.5.

4.3.1

XML Streaming Instructions

The XML Streaming Instructions provide the information required for streaming an XML document by the composition and timing of PUs. The XML Streaming Instructions allow firstly to identify PUs in an XML document and secondly to assign time information to them. A PU is a set of connected XML elements. It is specified by one element named anchor element and by a PU mode indicating how other connected elements are aggregated to this anchor to compose the PU. Depending on the mode, the anchor element is not necessarily the root of the PU. Anchor elements are ordered according to the navigation path of the XML document. PUs may overlap, i.e., some elements (including anchor elements) may belong to several PUs. Additionally, the content provider may require that a given PU be encoded as a random access point, i.e., that the encoded PU (the AU) does not require any other AUs to be decoded. Figure 4.1 illustrates how an XML document is fragmented and timed using the XML Streaming Instructions. The fragmenter uses as input the XML document to be streamed and a set of XML Streaming Instructions properties, as shown in Table 4.1, provided either internally (as XML attributes within the XMLSI namespace) and/or externally (in a Properties Style Sheet as specified in Section 4.3.3). The output of the fragmenter is a set of timed PUs. The fragmenter parses the XML document in a depth-first order. XML Streaming Instructions properties are computed as explained below. An element with the anchorElement


Figure 4.1: Processing related to XML Streaming Instructions

Table 4.1: XML Streaming Instructions properties Name

Possible Values

Inherited

Default Value

anchorElement puMode

undefined, false, true undefined, self, ancestors, descendants, ancestorsDescendants, preceding, precedingSiblings, sequential undefined, false, true undefined, an integer value undefined, an integer value undefined, a string value undefined, a string value undefined, an integer value

no yes

undefined undefined

yes yes yes yes no no

undefined undefined undefined undefined undefined undefined

encodeAsRAP timeScale ptsDelta absTimeScheme absTime pts

property set to true indicates an anchor element and a new PU. The PU then comprises connected elements according to the puMode property of the anchor element. In the following, the XML Streaming Instructions properties are specified for: • Fragmenting an XML document into PUs. • Indicating which PUs shall be encoded as random access points. • Assigning time information (i.e., processing time stamp) to these PUs. The puMode property specifies how elements are aggregated to the anchor element (identified by the anchorElement property) to compose a PU. Figure 4.2 gives an overview of the different puModes, which were derived by analyzing various types of metadata (as introduced above) and their applications (see Section 4.5 for a detailed description of an example application). The objective was to constrain ourselves to as few puModes as possible, while


Table 4.2: Semantics of different puModes Name

Semantics

self ancestors

The PU contains only the anchor element. The PU contains the anchor element and its ancestor’s stack, i.e., all its ancestor elements. The PU contains the anchor element and its descendant elements. The PU contains the anchor element, its ancestor stack, and its descendant elements. The PU contains the anchor element, its descendant and parent elements, and all the preceding-sibling elements of its ancestor elements and their descendants. The PU contains the anchor element, its descendant and parent elements, and all the preceding-sibling elements (and their descendants) of its ancestor elements. The PU contains the anchor element, its ancestors stack and all the subsequent elements (descendants, siblings and their ancestors) until a next element is flagged as an anchor element.

descendants ancestorsDescendants preceding

precedingSiblings

sequential

still supporting all sensible applications, in order to enable an efficient implementation. The semantics of the different puModes are defined in Table 4.2 given that the white node in Figure 4.2 contains an anchorElement property which is set to true. The encodeAsRAP property is used to signal that the PU should be encoded as a random access point in order to enable random access into an XML stream. The timeScale property provides the number of ticks per second. The ptsDelta property specifies the interval in time ticks after the preceding anchor element. Alternatively, the pts property specifies the absolute time of the anchor element as the number of ticks since the origin.

Figure 4.2: Examples of different puModes


Figure 4.3: Processing related to Media Streaming Instructions The timing can not only be specified in ticks: the absTime property specifies the absolute time of the anchor element. Its syntax and semantics are specified according to the time scheme used (absTimeScheme property), e.g., NPT, SMPTE or UTC.

4.3.2

Media Streaming Instructions

The Media Streaming Instructions specify two sets of properties, as shown in Table 4.3, for annotating an XML document. The first set indicates the AUs and their location in the described media bitstream, the random access points, and the subdivision into AU parts. The second set provides the AU time stamps. Figure 4.3 illustrates how AUs in a bitstream are located and timed using the Media Streaming Instructions. The fragmenter uses as input the bitstream to be streamed and a set of Media Streaming Instructions provided either internally (as attributes) and/or externally (in a Properties Style Sheet). The output of the fragmenter is a set of timed AUs. The fragmenter parses the XML document in a depth-first order. The Media Streaming Instructions properties are computed as specified below. Anchor elements (i.e., elements with the au property set to true) are ordered according to the parsing order and so are the corresponding AUs. An anchor element indicates the start of an AU, the extent of which is specified by the auMode property. In the following, the Media Streaming Instructions properties are specified for: • Locating AUs in the bitstream. • Indicating which AUs shall be encoded as random access points. • Assigning time information (i.e., processing time stamps) to these AUs.


Table 4.3: Media Streaming Instructions properties Name

Possible Values

Inherited

Default Value

auMode au auPart rap timeScale dts cts dtsDelta ctsOffset addressUnit start length

tree, sequential undefined, false, true undefined, false, true undefined, false, true undefined, an integer undefined, an integer undefined, an integer undefined, an integer undefined, an integer bit, byte undefined, an integer undefined, an integer

yes no no yes yes no no yes yes yes no no

tree undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined

value value value value value value value

The Media Streaming Instructions are tailored to metadata which can linearly describe a bitstream on an AU granularity, such as BSD, gBSD [37], BFlavor [43], XFlavor [44] or MPEG-7 MDS [64]. The start of an AU is indicated by an element with an au property set to true. This element is named anchor element. The media streaming instructions indicate the start and the length of an AU in bits or bytes (depending on the addressUnit property). The extent of the AU depends on the value of the auMode property of the anchor element as depicted in Figure 4.4 (the white node indicates an element with the au property set to true). In the sequential mode, the AU extends until a new element is found with an au property set to false or true. If no element is found with an au property set to true or false, the AU extends until the end of the bitstream. In the latter case (i.e., new anchor element), a new access unit immediately follows. In the tree mode, the AU is the bitstream segment described by the XML sub-tree below the element flagged with the au property set to true. AU parts are defined in a similar way. The start of a new AU part in an AU is indicated by an auPart property set to true and the extent is specified by the auMode property. In the sequential mode, the AU part extends until a new element has an auPart property set to false or true (in the latter case, a new AU part follows immediately) or until the end of the AU. In the tree mode, the AU part is the bitstream segment corresponding to the sub-tree below the element flagged by the auPart property. The auPart property provides a way for indicating AU parts within an AU in a coding format independent way. In this way, a streaming server that is not aware of the format of the streamed media content


Figure 4.4: Examples of different auModes

may nevertheless meet the requirements of a specific RTP payload format, e.g., special fragmentation rules. Other information about AUs is specified by the properties of the anchor element. In particular, the AU is a random access point if the rap property of the anchor element is set to true. The rap property is inheritable and, thus, it is therefore possible to inherit this property to each AU (i.e., each AU is a RAP) by setting the rap property of the XML root element to true. The time information of the AU (CTS and DTS) is also specified by the properties of the anchor element as explained below. The Media Streaming Instructions use an absolute and a relative mode for specifying time information. In absolute mode, the CTS and DTS of an AU are specified independently from other AUs. In relative mode, the CTS and DTS are calculated relatively to the CTS and DTS of the previous AU. Both modes can be used in the same document. For example, an absolute date can be applied to a given AU, and the CTS and DTS of the following AUs are calculated relatively to this AU. In both modes, CTS and DTS conform to a time scale, i.e., they are specified as a number of ticks. The duration of a tick is given by the time scale which indicates the number of ticks per second, which allows for fine granular timing of AUs. The time scale is specified by the timeScale property. The two properties cts and dts define the CTS and DTS of the AU, expressed as an integer number of ticks. They are not inheritable and may be applied to an anchor element for specifying the CTS and DTS of the corresponding AU. Alternatively, two properties named dtsDelta and ctsOffset allow calculating the DTS and CTS of the AU relatively to the previous AU. The dtsDelta property indicates the time interval in ticks between the current AU and the previous one. The ctsOffset property indicates the time interval in ticks between the DTS and the CTS of the current AU. Some media codecs do not require a CTS information. In this case, the cts and ctsOffset properties are not used and may be undefined.


For each AU anchor element, the properties of the corresponding AU are then calculated as follows, where dts(n), cts(n), timeScale(n), dtsDelta(n), ctsOffset(n), rap(n) represent the Media Streaming Instructions properties of the nth anchor element, and DTS(n), CTS(n), TIME SCALE(n), DTS DELTA(n) and RAP(n) represent the properties of the associated nth access unit: i f isPresent ( dts (n) ) { DTS( n ) = d t s ( n ) ; } else { i f n = 0 { // i . e . , f i r s t AU DTS( n ) = 0 ; } else { DTS( n ) = ( (DTS( n−1) + DTS DELTA( n−1) ) /TIME SCALE( n−1) ) ∗ } } i f isPresent ( cts (n) ) { CTS( n ) = c t s ( n ) ; } else { CTS( n ) = DTS( n ) + c t s O f f s e t ; } TIME SCALE( n ) = t i m e S c a l e ( n ) ; DTS DELTA( n ) = d t s D e l t a ( n ) ; RAP( n ) = rap ( n ) ;

TIME SCALE( n ) ;

An example of a gBSD including XML and Media Streaming Instructions can be found in Section 4.5.2.

4.3.3

Properties Style Sheet

It is also possible to specify the XML and Media Streaming Instructions properties without adding XML attributes to the original document. This is in particular useful when associated Digital Rights Management (DRM) information forbids editing the original document and/or where the properties are set according to a regular pattern, as this reduces the overhead introduced by the Streaming Instructions. It also eases the management of multiple media contents (and their related metadata) which are fragmented and timed in the same way. Then, instead of annotating each XML document a single Properties Style Sheet can be used. This external document specifies a set of properties which should be set for all elements matching a given pattern. For expressing such patterns, we introduce a new expression language named Lightweight Expression language (LXPath) based on STXPath. STXPath is an expression language developed in the context of STX (Streaming Transformations for XML) [35], a transformation language enabling the streamed transformation


of an XML document, i.e., without building a tree in memory. The syntax of STXPath is similar to XPath [65], but its semantics differ. Whereas an XPath expression is resolved against the full document, an STXPath expression is resolved against a limited context consisting of the current element, its ancestor’s stack and its position within siblings. For example, in XPath, the expression /node1/node2 returns a sequence containing all node2 elements, whose parent element is the document element and is named node1. In LXPath, on contrary, the same expression returns a sequence containing a single node from this nodeset; the one which is an ancestor of the current node. The use of STXPath expressions as matching patterns enables filtering an XML document without loading the full tree into memory, and is suitable for efficient SAX-based architectures. In our approach, we define a limited subset of STXPath required for locating elements in an efficient and simple way. Listing 4.1: Properties Style Sheet XML schema

MatchPattern BoolExpr Expression PathStep WildCard Predicate PredicateExpr OrExpr AndExpr ComparisonExpr GeneralComp AdditiveExpr MultiplicativeExpr PrimaryExpr AttrExpr Fun cti on StringLiteral NumericLiteral IntegerLiteral DecimalLiteral Digits NCName QName Char

: := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : := : :=

BoolExpr Expression ( ” | ” Expression ) ∗ ( ” / ” | ” // ” ) ? PathStep ( ( ” / ” | ” // ” ) PathStep ) ∗ (QName | WildCard ) P r e d i c a t e ∗ ” ∗ ” | ( ” ∗ ” ” : ” NCName) | ( NCName ” : ” ” ∗ ” ) ” [ ” PredicateExpr ” ] ” OrExpr AndExpr ( ( ” o r ” | ” | ” ) AndExpr ) ∗ ComparisonExpr ( ”and” ComparisonExpr ) ∗ A d d i ti v e E x p r ( GeneralComp A d d i ti v e E x p r ) ? ”=” | ” != ” | ”=” M u l t i p l i c a t i v e E x p r ( ( ”+” | ”−” ) M u l t i p l i c a t i v e E x p r ) ∗ PrimaryExpr ( ( ” ∗ ” | ” d i v ” | ” i d i v ” | ”mod” ) PrimaryExpr ) ∗ AttrExpr | Fun cti on | S t r i n g L i t e r a l | N u m e r i c L i t e r a l ”@” NCName ” position () ” ” ’ ” Char∗ ” ’ ” IntegerLiteral | DecimalLiteral ( ”−” | ”+” ) ? D i g i t s ( ”−” | ”+” ) ? ( ” . ” D i g i t s ) | ( D i g i t s ” . ” [ 0 − 9 ] ∗ ) [0 −9]+ [ h t t p : //www. w3 . o r g /TR/REC−xml−names/#NT−NCName ] [ h t t p : //www. w3 . o r g /TR/REC−xml−names/#NT−QName ] [ h t t p : //www. w3 . o r g /TR/REC−xml/#NT−Char ]


Listing 4.2: Grammar for LXPath in EBNF notation


As shown in Listing 4.1, the Properties Style Sheet consists of a sequence of templates specified by a matching pattern expressed in LXPath and containing a list of properties defined by a qualified name and a value. This Properties Style Sheet and LXPath are designed in a way such that properties can be applied on-the-fly in a SAX-based architecture. While parsing the original document with a SAX parser, each new element is matched against each of the templates, and the corresponding properties are set accordingly. The complete grammar of LXPath is shown in Table 4.2 specified in Extended BackusNaur Form (EBNF) notation with MatchPattern as entry point. An example of a Properties Style Sheet can be found in Section 4.5.2.

4.3.4

Summary

In this section we introduced Streaming Instructions for fragmenting content-related metadata into metadata samples, associating the media segments and metadata samples with each other, and streaming and processing them in a synchronized manner. The Streaming Instructions extend an XML metadata document by providing additional attributes to describe the fragmentation and timing of media data and XML metadata such as to enable their synchronized delivery and processing. In addition, a style sheet approach provides the opportunity to dynamically set such streaming properties without actually modifying the metadata. It must be noted that while the Streaming Instructions are intended to generally apply to XML documents, their design focus was on gBSD-based adaptation. In particular the introduced PU modes (but also the AU modes) might not provide the required fragmentation for other kinds of metadata and/or application scenarios. Additional modes and thus fragmentation/composition rules may need to be introduced in this case. Additionally, the Streaming Instructions inflict an additional overhead to the existing metadata overhead which might not always be desired. A custom implementation can avoid this overhead at the cost of additional implementation complexity. Moreover, the Media Streaming Instructions are restricted to XML-based metadata which describes the media content in a linear way and at a sufficiently fine level of granularity. For example, if the metadata describes a media content at scene-level, no fragmentation at a finer level (e.g., GoPs or AUs) is possible.


4.4

Synchronized storage, processing and transport of Process Units and media fragments

4.4.1

Introduction

In Section 4.3 we described how the Streaming Instructions can be used to facilitate the fragmentation of content-related metadata and the association of media and metadata fragments with each other. However, the Streaming Instructions themselves do not define a protocol for transport or storage of the resulting Process Units, similarly to how a video codec does not define how the pictures are transported towards terminals. In this section we therefore analyze and extend existing transport and storage mechanisms for this purpose. In particular, we consider the Real-time Transport Protocol (RTP) [66] and the ISO Base Media File Format (IBMFF) [48].

4.4.2 4.4.2.1

Evaluation of media and metadata transport mechanisms Introduction

In many use cases streaming of media resources such as audio/video content is required, which is facilitated by the Real-time Transport Protocol (RTP). Due to its size it is unfeasible to transport the content-related metadata in one big chunk. Thus, it should be transported using RTP as well, e.g., to exploit the synchronization mechanisms offered by RTP. However, there are still three possibilities how to transport these different metadata assets with the actual media content which are evaluated in the following: • One combined stream containing media and metadata. • One metadata stream and one media data stream. • Multiple metadata streams (one for each type of metadata) and one media data stream. 4.4.2.2

One combined stream containing media and metadata

Some RTP payload formats, such as the RTP payload format for transport of MPEG-4 Elementary Streams (RFC 3640) provide means for including arbitrary data, e.g., metadata, within the auxiliary header [46].


The advantages of this approach are that it is straight forward to implement (all the requisites are already specified) and that there is little processing and bandwidth overhead because there is only one stream to handle. Moreover there is no synchronization necessary between different streams, which reduces complexity. However, this approach also has several disadvantages. The first one is based on the assumption that metadata is more valuable than the media data. Metadata, for example the gBSD, may describe many media access units (AUs) in a single Process Unit (PU) and therefore a large segment of the stream would be affected if such a PU were lost. This issue raises the need for reliable transport mechanisms for PUs. While re-transmission can be used to fulfill this requirement, it results in the need for large buffers, which increases the start up delay. A better solution would be to reserve enough bandwidth for the metadata stream in advance, in order to avoid packet loss which is caused by oversaturated network links. However, this is not possible with the approach of a combined stream. Another - maybe the biggest - disadvantage is that the combined stream approach depends on a specific payload format (e.g., RFC 3640) which provides the auxiliary header section where the metadata can be transported. Other payload formats might not provide such an auxiliary section. That is, by following the combined stream approach one would create a solution which is limited to a specific type of resource and thus in conflict with the codec-agnostic adaptation method. Moreover, while one saves processing overhead by having only one stream, there is some additional overhead due to the necessary (de)multiplexing of the media data and the metadata.

4.4.2.3

One metadata stream and one media data stream

In this scenario, the different types of metadata are multiplexed into one metadata stream. The advantages of this approach are that it allows the metadata stream to be treated differently from the media stream. This makes it possible to reserve sufficient bandwidth for the metadata stream in order to avoid packet loss due to oversaturated network links. Additionally, the transport of metadata is no longer bound to a specific media payload format. The disadvantages of this approach are the additional processing overhead caused by the (de)multiplexing of the metadata, the additional bandwidth overhead due to the second RTP stream and the necessary synchronization between the two streams, which increases the complexity of the implementation.


Table 4.4: Advantages and disadvantages of different metadata transport mechanisms

(De-)Multiplexing efforts (De-)Packetizing efforts Number of streams Transport overhead Processing overhead Synchronization efforts Interoperability issues Protection flexibility Asynchronous transport Scalability

4.4.2.4

One Combined Stream

One Metadata Stream

Multiple Metadata Streams

High Low 1 Low High Low Yes Low No Yes (limited)

Medium Medium 2 Medium Medium Medium No Medium Yes (limited) Yes (limited)

Low High 3+ High Low High No High Yes Yes

Multiple metadata streams (one for each type of metadata) and one media data stream

In this scenario, separate streams for each type of metadata are used, for example, one for the gBSD and one for the AQoS. In addition to the advantages listed for the approach in Section 4.4.2.3 this mechanism offers the possibility of handling each kind of metadata by specialized adaptation nodes, e.g., an adaptation node with dedicated hardware for XML processing, thus making this a very scalable solution. This is also possible for the other approaches, by de-multiplexing the stream(s) and then sending each type of metadata to the specialized adaptation nodes for processing. It would, however, introduce additional delay into the streaming chain. Another advantage lies in the possibility to send the AQoS and the gBSD slightly in advance (asynchronous transport) in order to be processed by the adaptation node before the media data arrives and is adapted. This results in lower start up delay as the adaptation node can more efficiently use its resources. The final advantage is that no metadata (de-)multiplexing is needed. This last advantage of course comes at the price of the disadvantage of high bandwidth and packetizing overhead due to having multiple media and metadata streams. Moreover, synchronization of these three streams is needed and leads to additional complexity. Table 4.4 summarizes the advantages and disadvantages of each of these possibilities. As a conclusion we will concentrate on the third option in the remainder of this thesis.


4.4.2.5

Summary

In this section we introduced different means to transport content-related metadata and compared them. We concluded that for our codec-agnostic adaptation concept, separate metadata streams shall be used which is transported using a generic RTP payload format. It must, however, be noted that we focused on gBSDs for this evaluation and that our results are therefore only valid for gBSDs.

4.4.3 4.4.3.1

Transport formats and strategies Introduction

In this section, we will first concentrate on the transport format and subsequently provide an example of it. Following the discussions in Section 4.4.2, we consider separate RTP streams for the transport of the content-related metadata. We investigate the RTP payload format for content-related metadata. First, the payload format is investigated and then the header fields which are used to signal information about the payload are discussed.

4.4.3.2

Payload format options

The approach described in Section 4.3 requires the fragmentation of the content-related metadata into Process Units (PUs). We have identified three options for the payload format of such PUs. One can either transport them (1) in plain text, (2) compressed using a generic or an XML-aware compression algorithm, or (3) compressed with the MPEG-7 Binary Format for Metadata (BiM) which allows streaming of XML-based data as described in [5]. While complete PUs would be transmitted in the first two cases, BiM enables to signal a complete PU only once and subsequently just the nodes which changed, together with information on how and where to include them into the previously sent complete PU. This enables to transport only the information which changed compared to the last PU and thus helps to save bandwidth. The transport encoding of the PU is signaled using the payload type in the RTP header. Independent from the encoding, each RTP packet may contain a fragment of a PU, a complete PU or many PUs.


4.4.3.3

Using RTP for transporting Process Units

Generally, the RTP header fields defined in RFC 3550 [66] and shown in Listing 4.3 are used. The following fields have particular semantics: • The marker is a one bit flag which signals PU boundaries in the packet stream. That is, if it is set to 1, the packet includes either a complete PU or the final fragment of a PU. • The payload type is a 7 bit field which signals the PU and its particular encoding. • The timestamp is a 32 bit field which corresponds to the processing time stamp of the PU. Listing 4.3: RTP header (adopted from [66]) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ |V=2|P |X| CC |M| PT | s e q u e n c e number | +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | timestamp | +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | s y n c h r o n i z a t i o n s o u r c e (SSRC) i d e n t i f i e r | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | c o n t r i b u t i n g s o u r c e (CSRC) i d e n t i f i e r s | | .... | +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+

Figure 4.5 shows how the Streaming Instructions attributes are mapped to the RTP header fields for the packet(s) which contain the PU and the packet(s) that contain the media AU which is described by this PU. The figure shows the usage of RFC 3550 (i.e., RTP without a special payload format) for the PUs and RFC 3640 [46] (i.e., the RTP payload format for transport of MPEG-4 elementary streams) for the media content. Generally the XML Streaming Instructions attributes are mapped to the RTP header of the metadata packet (i.e., RFC 3550 in this case) and the Media Streaming Instructions attributes are mapped to the RTP header of the media packet (i.e., RFC 3640 in this case). The pts attribute serves as a timestamp for the metadata packet and the cts serves as a timestamp for the media packet. This enables synchronization of the media and metadata streams by comparing the timestamps of the packets which belong to the individual streams. Additionally, the cts and dts attributes are used to compute the CTS-delta and DTS-delta fields of the media packet. Finally, the rap attribute is used to set the RAP-flag of the


Figure 4.5: Mapping of Streaming Instructions attributes to RTP header fields media packet accordingly. Note that the encodeAsRAP attribute signals that the PU which describes this media AU should also be encoded as a random access point (RAP) which is an input to the encoding rather than the packetization process (c.f. Section 4.5). As described above, advanced compression / encoding mechanisms (such as BiM) support incremental updates of sequential XML fragments (e.g., Process Units). In this case the encodeAsRAP attribute can be used to signal BiM to encode the PU as a RAP (i.e., without considering previous PUs) and thus synchronize media and metadata RAPs. This enables random access into the media / metadata streams. We provide more details on this mechanism in [5]. Note that the figure shows only a minimal PU which only contains the mandatory attributes and the Streaming Instructions attributes. 4.4.3.4

Summary

In this section we specified how gBSD PUs can be transported, in synchronization with the media which they describe, by facilitating the generic RTP payload format. It must be noted that this solution comes with certain limitations. The requirement to have a separate metadata stream for each kind of metadata increases (de-)packetization and synchronization efforts depending on the amount of additional metadata streams. Moreover, the transport protocol overhead is increased which can be rather expensive, in particular


Figure 4.6: Relationship between the different file formats (adopted from [2])

if the metadata packets are very small. By relying on the RTP timestamps for the synchronization of metadata and media packets this solution depends on RTP as a transport protocol. A more generic solution which does not depend on a specific transport protocol would be desirable.

4.4.4 4.4.4.1

Storage of Process Units Introduction

MPEG specifies several file formats for the storage of timed media. The MP4 file format [67] is specified for MPEG-4 audio, scenes and video. The AVC file format [68] is standardized for the particular requirements of MPEG-4 Advanced Video Coding (AVC). MPEG-21 also specifies a file format [69] which is particular for the storage of Digital Items. Moreover, there are also file formats for (motion) JPEG-2000 and 3GP [70]. All of these file formats have in common that they are derived from the ISO Base Media File Format (IBMFF) [48] as illustrated in Figure 4.6, i.e., the IBMFF includes all base features which may be needed by derived file formats. In this section we will first introduce the necessary core concepts of the IBMFF and consequently detail our enhancements for the storage of PUs.


4.4.4.2

ISO Base Media File Format

The IBMFF does not encapsulate/frame the described file, i.e., the described media streams are not modified by the file format. That means that the media streams are external to the IBMFF and the byte addresses of the individual media segments which are described are referenced from the IBMFF. In contrast to this, other file formats change the media streams by inserting file format headers into the media stream. The IBMFF consists of boxes with a type and length which describe the file in order to meet the IBMFF’s target applications, which are (as described in [2]): • capture, • exchange and download, including incremental download and play, • local playback, • editing, composition, and • streaming from streaming servers. An ISO Base Media File (IBMF) describes the media streams on different levels of granularity, i.e., movie, track and sample level as illustrated for the video stream in Figure 4.7. On movie level, the overall presentation, i.e., all the media streams are described. The Movie Header Box therefore includes, among others, the creation time, modification time, timescale and duration of the presentation. The IBMFF also supports metadata descriptions on movie level, which contain descriptive or annotative metadata, e.g., an MPEG-21 Digital Item. On track level, usually a single media stream is described, e.g., its creation time, modification time, duration and so on. Each track also has a unique ID and a handler. The handler specifies the nature of the track, e.g., video or audio. Additionally one can specify references between tracks, e.g., when one track serves as a description of the referenced track. Finally, on sample level, one or more samples, i.e., video pictures or audio frames, are described. This includes the timing and location in the media stream. Moreover, for samples of a video track, e.g., the spatial resolution is provided and for audio samples the number of channels (mono, stereo), sample size, sample rate and so on are provided. Finally the coding name, i.e., the decompressor, is uniquely identified through a registered identifier.


4.4.4.3

Storage of Process Units in the ISO Base Media File Format

Process Units represent descriptive metadata and would therefore belong, at least at first sight, to the metadata box on movie level. However, PUs actually share distinctive properties with media streams, i.e.: • They are timed. • They represent a sequence of samples. • They require fine-granular synchronization with the described media stream (similarly to how an audio stream is synchronized with a video stream). For these reasons it is more practical to treat PUs in the same way as media streams. That is, they should be described as a timed metadata track which is flagged as describing a referenced media track as shown in Figure 4.7. Our original approach was to extend the MPEG-21 file format to specifically allow the storage of gBSD PUs, as described in [71]. However, when proposing this idea to MPEG, it quickly became apparent that the proposed mechanism was of more general interest than just for storing timed gBSD metadata. Therefore, the original mechanism was generalized in order to be applicable to any type of XML-based and text-based timed metadata [72] and proposed as an extension of the IBMFF, which eventually resulted in an extension of this standard [7]. In order to accomplish this, we amended the IBMFF as follows. We added a dedicated handler type for timed metadata, therefore the IBMFF now includes, among others, the following handlers: • vide =⇒ Video track • soun =⇒ Audio track • hint =⇒ Hint track • meta =⇒ Timed Metadata track We specify that this is a general handler for metadata streams of any type. The specific format is identified by the sample entry, as for video or audio, for example. If the samples are in text, then a MIME1 format is supplied to document their format. If the samples are 1

MIME Media Types, http://www.iana.org/assignments/media-types/


Figure 4.7: Timed metadata in the IBMFF in XML, each sample is a complete XML document (as it is the case for PUs), and the XML namespace is also supplied. Metadata tracks are linked to the track they describe using a track-reference of type cdsc. Listing 4.4 shows the updated specification of the sample description box which corresponds to our amendment. As shown, if the handler of the track is of type meta, a MetadataSampleEntry() is expected. Listing 4.4: Sample Description Box a l i g n e d ( 8 ) c l a s s SampleDescriptionBox ( unsigned i n t (32) handler type ) e x t e n d s FullBox ( ’ s t s d ’ , 0 , 0 ) { int i ; unsigned i n t (32) entry count ; f o r ( i = 1 ; i < !−− . . . and so on . . . −−> < !−− . . . and so on . . . −−> < !−− . . . and so on . . . −−> < !−− . . . and so on . . . −−>

Alternatively, the Properties Style Sheet provided in Listing 4.10 provides the Streaming Instructions externally, without changing the gBSD itself. As specified in Section 4.3 the Properties Style Sheet consists of a sequence of templates specified by a matching pattern expressed in LXPath and containing a list of properties defined by a qualified name and a value. This Properties Style Sheet sets the same attributes as shown in the example in Listing 4.9. Listing 4.10: Example of Properties Style Sheet

< !−− . . . and so on . . . −−>

Listing 4.11: First PU resulting from processing the gBSD in Listing 4.9 < !−− . . . and so on . . . −−>


4.5.3

Adaptation proxy implementation

The novel mechanisms for metadata fragmentation and synchronized storage and transport enabled us to implement a codec-agnostic dynamic and distributed adaptation framework in the scope of the DANAE EU IST project. Please refer to [14] for an overview of this project on “Dynamic and Distributed Adaptation of scalable multimedia coNtent in a contextAware Environment” (DANAE) and [16] for details on the codec-agnostic dynamic and distributed adaptation framework, some of which we will include here for completeness. The adaptation node in this framework corresponds to application-aware adaptation nodes as introduced in Section 2.4.4. Figure 4.9 shows the codec-agnostic adaptation framework which was implemented in the scope of DANAE. It has been tested for codec-agnostic adaptation of • MPEG-4 SVC (see Section 2.2.5) • MPEG-4 BSAC (see Section 2.2.5) • Embedded Zero Block Coding (EZBC) (see Section 2.2.3) The actual adaptation mechanism corresponds to the elaborations in Chapter 3 (i.e., normative gBSD-based adaptation which uses XSLT for the transformation process) including several of the enhancements described in this chapter, i.e.: • Fragmentation and timing of metadata based on XML Streaming Instructions. • Extraction and timing of media AUs based on Media Streaming Instructions. • Synchronized processing of Process Units and media AUs based on Streaming Instructions. • Synchronized transport based on the Real-time Transport Protocol (RTP) as described in Section. 4.4. That is, RFC 3550 is used for BiM-compressed PUs and RFC 3640 is used for media AUs. Additionally the architecture relies on RTSP/SDP [78], [79] for session setup. It also supports the Real-time Control Protocol (RTCP) which serves as a feedback on the current


Figure 4.9: DANAE adaptation architecture usage environment for the adaptation proxy. A walkthrough for this framework therefore looks as follows, given that “adaptation proxy” refers to the adaptation node closer to the terminal and “adaptation server” refers to the adaptation node which has the actual content available. 1. The user browses a Website and chooses to consume a specific media content. 2. The terminal sends an RTSP DESCRIBE request for this content, including its static usage environment information, to the adaptation proxy. 3. The adaptation node is initialized and forwards this RTSP request to the content selector on the adaptation server. 4. The content selector on the adaptation server 1) selects a specific content variation based on the static usage environment information and 2) initializes its adaptation engine with the static usage environment information since the chosen content variation might not be an exact match for the terminal’s usage environment. 5. The content selector on the adaptation server computes an SDP for the chosen content variation and sends it to the adaptation proxy. The SDP includes the description of the media stream(s) plus all the streams that will contain metadata needed for the adaptation process.


6. The adaptation proxy strips the description of the metadata streams from the SDP, as they are not needed on the terminal, and forwards it to the terminal. 7. The terminal sends an RTSP SETUP request for each stream which is described in the SDP followed by an aggregated RTSP PLAY request to the adaptation proxy (only media streams). 8. The adaptation proxy sends an RTSP SETUP request for each stream which is described in the SDP followed by an aggregated RTSP PLAY request to the adaptation node (media and metadata streams). 9. The adaptation server performs codec-agnostic dynamic adaptation as described in Section 4.5.1 and sends the adapted media AUs and their PUs towards the adaptation proxy. 10. The adaptation proxy performs another adaptation step as described in Section 4.5.1, but without the metadata fragmentation and media AU extraction, and sends the adapted media towards the terminal. 11. The terminal displays the adapted media while the usage environment is continuously updated (e.g., through RTCP) to the adaptation proxy. Note that in this implementation the adaptation granularity is as follows. For SVC/EZBC the adaptation granularity is at GoP level and thus each gBSD PU describes a complete GoP. For BSAC the adaptation granularity is at frame level and thus each gBSD PU describes a single frame. The adaptation proxy also supports stream replication as described in Section 3.2 in which case the steps above are repeated, but without interaction with the adaptation server. Rather, the existing stream is replicated. Both the adaptation proxy and the adaptation server are based on the Darwin Streaming Server5 .

4.5.4

Intercepting adaptation proxy implementation

The implementation described in Section 4.5.3 represents an adaptation server and an adaptation proxy with complete session handling (RTSP/SDP), RTCP support, usage environment feedback and terminal awareness. As such it corresponds to an application-aware 5

Darwin Streaming Server, http://developer.apple.com/opensource/server/streaming/


adaptation node as introduced in Section 2.4.4. This showed the feasibility to implement such an adaptation node and enabled us to measure the performance of it as described in Section 6.6. While these evaluations show that this adaptation node is able to process several media streams concurrently, it is also obvious that it cannot compete with existing, codec-specific adaptation nodes with regards to performance. This is due to the fact that the overall implementation was 1) complex due to the requirement to act as a regular proxy and 2) a prototype which was not optimized with regards to performance. For this reason, we re-designed and re-implemented this adaptation node based on the concepts of an intercepting adaptation proxy which corresponds to the adaptation node introduced in Section 2.4.3. The motivation behind this was not only to show the feasibility to do so, but also to measure the performance of such an adaptation proxy. Since the intercepting adaptation proxy is much less complex (e.g., it is transparent and therefore does not have to act as a client/server), this also enables us to more exactly measure the actual cost of the gBSD-based adaptation mechanism. 4.5.4.1

Linux mechanisms for intercepting (media) packets

The intercepting adaptation proxy facilitates Linux firewall mechanisms to capture packets and forward adapted packets (or drop them). The system relies on packet filtering mechanisms which have been present in Linux kernels since version 1.1. As depicted in Figure 4.10, each packet traverses one or more chains as it traverses the Linux kernel, i.e., INPUT, FORWARD and OUTPUT. A chain is a filter which consists of different rules which specify how a packet shall be processed. If a packet is intended for a local application it is routed through the INPUT chain and is processed by a local process. This local process can also create packets which first traverse the OUTPUT chain before they are routed towards their destination. If a packet is not intended for a local process, it only traverses the FORWARD chain before it is routed towards its destination. As mentioned above, rules can be defined for each chain. These rules are based on source address, destination address, protocol and several other characteristics of the current packet. Based on these characteristics the rules define how a packet shall be processed, e.g., to drop a packet (specified by the DROP target) or to keep it (specified by the KEEP target). A target is the part of a rule which indicates how a packet is processed and is specified using the -j command as shown in Listing 4.12. This Listing shows an example of an IP Tables rule that specifies to drop all UDP packets with destination port 222 which are intended


Figure 4.10: IP Tables overview for a local application (INPUT chain). Listing 4.12: Example of an IP Tables rule i p t a b l e s −A INPUT −p udp −−d p o r t 222 −j DROP

Listing 4.13: IP Tables rules for gBSD-based intercepting adaptation proxy i p t a b l e s −A FORWARD −p udp −−d p o r t 4950 −−s o u r c e 1 9 2 . 1 6 8 . 0 . 1 −j CONNMARK −−s e t −mark 1 i p t a b l e s −A FORWARD −p udp −−d p o r t 4951 −−s o u r c e 1 9 2 . 1 6 8 . 0 . 1 −j CONNMARK −−s e t −mark 1 / s b i n / i p t a b l e s −A FORWARD −m connmark −−mark 1 −j QUEUE

Since our proxy acts as an intercepting adaptation proxy which is placed between the server and client, the packets which traverse it are not intended for a local application and are therefore put into the FORWARD chain. Thus, the proxy should process all packets which are put into the FORWARD chain in order to compute whether an adaptation is needed. To this end we facilitate another target which is offered by the IP Tables implementation, i.e., the QUEUE target. If a packet matches a rule with this target, it is queued for user space processing, instead of immediately accepting or dropping it. This user space application is the implementation of our intercepting adaptation proxy, which processes the packet and decides whether to drop it or to accept it. It might also first modify the packet before accepting it, e.g., removing the part of the packet which belongs to a specific enhancement layer. Listing 4.13 shows our corresponding IP Tables rules. The first two rules mark every packet which have specific source address and destination port. This marking is performed using the CONNMARK target. The third rule then queues all marked packets for user space processing. Note that there are rules for two different destination ports, since


the media and gBSD packets are sent as separate streams as described in Section 4.4. For simplicity we mark packets based on their destination IP and source IP. Depending on the deployment of the adaptation proxy other rules may be used. For a complete overview of IP Tables and possible rules, please consult [80]. By relying on such an optimized and proven mechanism we can assume that the overhead introduced by it is minimal, which is a good premise for the measurements conducted in Section 6.6. 4.5.4.2

Intercepting adaptation proxy architecture

The intercepting adaptation proxy has been tested for the adaptation of • MPEG-4 VES (see Section 2.2.5) • MPEG-4 SVC (see Section 2.2.2) • MPEG-4 BSAC (see Section 2.2.4) Figure 4.11 shows the architecture of this adaptation framework and a walkthrough for this framework looks as follows. 1. The server reads a sample of the bitstream from the hard disk (NALU, Frame, VOP). 2. The server generates a gBSD PU for the sample. 3. The server packetizes the sample and the gBSD PU into UDP packets. 4. The server sends the sample and gBSD PU as UDP packets towards the terminal in a time-aware fashion. 5. The adaptation proxy captures the packet based on Linux firewall mechanisms. 6. The adaptation proxy transforms the gBSD based on regular expressions as detailed in Section 4.5.4.3. 7. The adaptation proxy modifies the packet’s data (truncation of NALUs, VOPs, or BSAC enhancement layers) using the normative gBSDtoBin processor. 8. The adaptation proxy updates the UDP header according to the modified data (checksum, length).


Figure 4.11: Codec-agnostic gBSD-based adaptation in an intercepting adaptation proxy 9. The adaptation proxy forwards the modified media packet to the client (based on Linux firewall mechanisms) or drops the complete packet (based on the adaptation decision). 10. The client stores the adapted bitstream to the hard disk where it can be decoded to validate the adaptation. Note that the adaptation granularity for this proxy is as follows. For SVC the adaptation granularity is at NALU level and thus each gBSD PU describes a NALU. For VES the adaptation granularity is at picture level (i.e., a frame) and thus each gBSD PU describes a picture. For BSAC the adaptation granularity is at frame level and thus each gBSD PU describes a single frame. In this implementation we operate on the UDP layer rather than on the RTP layer. This enables to keep the complexity of the implementation low in order to focus on measuring the performance of the actual adaptation process. We note that since this adaptation node is transparent to the client, other than for the adaptation proxy in Section 4.5.3 no direct usage environment feedback from the terminal is possible. The adaptation decision can therefore only be based on generalized factors (as also mentioned in Sections 2.4.3 and 2.4.2) such as: • Preconfigured knowledge of the usage environment, e.g., maximum display size of terminals connected to the network which the proxy serves. • Buffer size of the adaptation proxy. • Currently available bandwidth as provided by a bandwidth measurement tool.


4.5.4.3

gBSD transformation using regular expressions

As shown in Section 6.6 the gBSD transformation and the gBSDtoBin process cause a reasonable amount of processing load, which is multiplied by our dynamic adaptation approach based on Streaming Instructions, since the transformation does not only happen once but once for each media fragment (e.g., a GoP). As introduced in Section 3.3 the method for the transformation of the gBSD is not specified in the DIA standard [3]. Current implementations use either XSLT or STX for the transformation process. Both XSLT and STX support the codec-agnostic adaptation approach by providing a generic transformation process that is controlled by a codec-specific style sheet which can be provided together with the media content. Regular expressions [81] allow to specify a pattern for matching/replacing a substring in a string. They correspond to a type 3 grammar according to the Chomsky hierarchy [82] which creates a regular language recognizeable by a finite state automaton. Regular expressions are thus much less expressive than XSLT, which is Turing-complete [83] and therefore represents a type 0 grammar. Similar to style sheets, regular expressions can be provided together with the multimedia content, only requiring a generic regular expressions processor at the adaptation node, thus also supporting the codec-agnostic adaptation method. Figure 4.12 shows the changes (compared to the traditional gBSD-based adaptation approach as shown in Figure 4.8) which are caused by adopting regular expressions. As can be seen these changes affect the PU transformation, which is now based on regular expressions rather than XSLT. Furthermore the AQoS description needs to include the regular expression as output parameters. In order to test the applicability of this approach to our application scenario of transforming gBSD PUs, we implemented XSLT and STX style sheets for SVC and BSAC adaptation and then tried to realize the same functionality using regular expressions. We show an example gBSD PU for SVC in Listing 4.14, which describes start and length of every NALU together with a marker which indicates priority, temporal id, spatial id and quality id, thus identifying which enhancement layer the described NALU belongs to. For SVC, the transformation involves disregarding gBSDUnits from the gBSD (PUs) if the value of the marker indicates that the NALU belongs to a layer which shall be dropped according to the adaptation decision. For BSAC, the transformation additionally requires to update certain values in the gBSD PUs. Listings 4.15 and 4.16 show the XSLT and STX style sheets for SVC adaptation and


Figure 4.12: Dynamic gBSD-based adaptation approach using regular expressions


Listing 4.17 shows the transformed gBSD PU after applying any of the two style sheets. Listing 4.14: gBSD PU describing an SVC AU

Listing 4.15: XSLT style sheet describing the transformation of SVC < !−− Match a l l : default t e m p l a t e −−> < !−− Test and remove l a y e r s −−>

Listing 4.16: STX style sheet describing the transformation of SVC < !−− Parameters d e c l a r a t i o n −−> < s t x : i f t e s t=” not ( s u b s t r i n g −b e f o r e ( s u b s t r i n g −a f t e r ( @marker , ’ P ’ ) , ’ :T ’ ) &g t ;= $ P r i o r i t y L i m i t and s u b s t r i n g −b e f o r e ( s u b s t r i n g −a f t e r ( @marker , ’ :T ’ ) , ’ :S ’ ) &g t ;= $ TemporalLimit and s u b s t r i n g −b e f o r e ( s u b s t r i n g −a f t e r ( @marker , ’ :S ’ ) , ’ :Q ’ ) &g t ;= $ SpatialCGSLimit and s u b s t r i n g −a f t e r ( @marker , ’ :Q ’ ) &g t ;= $ Q u a l i t y L i m i t ) ”>

For both cases, we were able to implement the corresponding regular expressions. Listing 4.18 shows the regular expression for SVC. This regular expression matches gBSDUnits


with quality id between 1 and 9, i.e., removes quality enhancements layers 1 and 2 in our example. However, we encountered certain limitations which we describe below together with our approaches to counter them. Listing 4.17: Transformed gBSD PU describing an SVC AU - quality layers 1 and 2 are dropped

Listing 4.18: Regular Expression for SVC

Unlike style sheets, regular expressions are by themselves not parameterizable, which is however needed to implement a certain adaptation decision provided by the ADTE. There are two solutions to this problem: The obvious solution would be to extend regular expressions to be parameterizable, i.e., to introduce placeholders to the regular expressions which are replaced by the output from the ADTE. This replacement, i.e., the customization of the regular expression can again be performed by a regular expression. However, in order to enable this, additional control structures which steer this customization of the regular expression are necessary. These are traditionally provided by the programming language which uses the regular expressions and need to be defined, since they are not available in the generic regular expression processor in our application scenario. A simpler solution is proposed which does not require any extensions to the normative regular expressions. The ADTE is a generic process which is steered by the AQoS. One possible layout for the AQoS is to contain tables which map a specific UED to an adaptation decision, e.g., the number of quality layers which shall be dropped from the media content in case that the available bandwidth drops to a certain value. We propose to align the


AQoS description to include a regular expression instead of the number of quality layers in the above example. This design change of the AQoS, which does not imply any changes to the DIA standard, allows to use the generic ADTE to provide regular expressions which steer the transformation of the gBSD (PUs) in order to react to the given UED. In order to illustrate this approach, a simple example of an AQoS description which includes regular expressions is shown in Listing 4.19. This AQoS description provides a mapping from different types of devices (which are identified through a classification scheme) to regular expressions which remove spatial enhancement layers according to the device class. That is, this AQoS description describes which spatial resolution shall be used for a specific device class. For this AQoS description, the mapping is very simple, i.e., the first device class maps to the first regular expression, the second device class maps to the second regular expression, and so on. For example, the UED in Listing 4.20 indicates that the terminal is a PDA. This information is used by the ADTE which processes the AQoS description and computes that for the PDA the second regular expression from the AQoS description should be used. This regular expression is then used to steer the adaptation by removing all spatial enhancement layers with ID ≥ 2 and thus reducing the media quality to a spatial resolution which can be processed by the PDA. Listing 4.19: AQoS description with regular expressions < !−− PC −−> u r n : m p e g : m p e g 2 1 : 2 0 0 3 : 0 1 −DIA−DeviceClassCS−NS:1 < !−− PDA −−> u r n : m p e g : m p e g 2 1 : 2 0 0 3 : 0 1 −DIA−DeviceClassCS−NS:2 < !−− M o b i l e Phone −−>


u r n : m p e g : m p e g 2 1 : 2 0 0 3 : 0 1 −DIA−DeviceClassCS−NS:5 s/& l t ; ( . ∗ ? ) gBSDUnit ( . ∗ ? ) marker=\”P[0 −9] :T [0 −9] : S [3 −9] :Q [0 −9]\ ”/&g t ; / / s/& l t ; ( . ∗ ? ) gBSDUnit ( . ∗ ? ) marker=\”P[0 −9] :T [0 −9] : S [2 −9] :Q [0 −9]\ ”/&g t ; / / s/& l t ; ( . ∗ ? ) gBSDUnit ( . ∗ ? ) marker=\”P[0 −9] :T [0 −9] : S [1 −9] :Q [0 −9]\ ”/&g t ; / /

Listing 4.20: UED which indicates the device class < !−− D e v i c e C l a s s 1 i s a PC, 2 i s a PDA and 5 a m o b i l e phone −−>

The BSAC requirement to update certain values in the gBSD (PUs) leads to an additional requirement. That is, the regular expressions need to indicate whether they replace the matching substring by an empty string (i.e., disregard elements) or by another string (i.e., update values). For this we propose to adopt the corresponding Perl6 syntax as shown 6

Perl, http://www.perl.org


in Listings 4.21 and 4.19. Listing 4.21: Perl syntax for replacing a regular expressions match by string s ///

To conclude, the design decisions described above allow the regular expressions to fulfill the same tasks as the style sheets for our application, but with an improved performance as shown in Section 6.4. It must be noted that while our approach of using regular expressions for XML transformation results in an improved performance, as described in Section 6.4, it is also much less flexible that the existing XSLT and STX based approaches. Adaptation scenarios which require complex updates of the Process Units may lead to increasingly complex regular expressions. Additionally, regular expressions are much less readable than XSLT or STX style sheets. 4.5.4.4

Optimizing gBSDtoBin

As shown in Section 6.6 the gBSDtoBin process cause a considerable amount of processing load, which is multiplied by our dynamic adaptation approach based on Streaming Instructions. This is mainly due to the amount of memory copy operations which are needed to implement the process. One approach to counter this is to write an optimized implementation of the gBSDtoBin process [19]. The alternative, which we implement in this thesis, is to try to avoid running the gBSDtoBin process if possible. For example, if no adaptation takes place, i.e., if the original gBSD corresponds to the transformed gBSD, then no gBSDtoBin process is needed. Additionally, when adaptation is performed at a very fine granularity (frame for VES, NALU for SVC), then, at least for VES and SVC, the adaptation becomes trivial, i.e., the corresponding frame or NALU is either dropped or kept and therefore gBSDtoBin can again be avoided. This unfortunately is not possible for BSAC, where only parts of the sample are truncated, thus for BSAC gBSDtoBin always needs to run. In this particular case, an optimized implementation of the gBSDtoBin processor could lead to an increase in performance. It must be noted that this optimization of the gBSDtoBin process can only be applied when the adaptation is performed at a very fine granularity and thus leads to increased metadata overhead. This limitation is further discussed in Section 6.6.3.


4.5.5

Summary

In this chapter we described how the novel concepts regarding metadata fragmentation, synchronization and processing that are introduced in this chapter can be integrated with the gBSD-based adaptation approach. This results in an architecture for dynamic and distributed gBSD-based adaptation. We introduced a regular proxy and an intercepting proxy implementation which rely on this architecture and introduced several optimizations which we applied to the gBSD-based adaptation approach.

4.6

Conclusions and original contributions

In this section we introduced the Streaming Instructions which represent a novel mechanism for fragmentation and transport of content-related XML metadata and synchronization with the described media. One particular contribution is the introduction of the concept of “samples” for metadata by employing Streaming Instructions for XML-based metadata. We showed how the Real-time Transport Protocol can be used to transport the PUs in synchronization with the media samples. We also introduced our enhancements of the ISO Base Media File Format which enable the synchronized storage of PUs and media samples. In order to validate these novel concepts we implemented them in two different adaptation nodes, i.e., a regular adaptation proxy and an intercepting adaptation proxy. The second implementation uses regular expressions rather than XSLT or STX for the transformation of the gBSD, which represents a novel approach to transforming gBSDs. We also pointed out the limitations of the tools introduced in this chapter, which may lead to future work items.

CHAPTER

5 5.1

Dynamic and distributed adaptation of scalable media based on the Generic Scalability Header


The DIA approach to enabling codec-agnostic adaptation is conceptually very pleasing, since it follows a modular approach to fulfill all the requirements by specific tools. In particular, by transferring the adaptation into the XML domain, one can benefit from a large amount of existing tools and XML’s readability also enables quick development for prototyping. However, these benefits come at the cost of performance, i.e., this modular approach may not result in the most performant solution. We therefore aim to enhance the efficiency of a codec-agnostic adaptation framework, by implementing the concepts of DIA in a performance-focused fashion. In this chapter we therefore introduce novel tools and mechanisms which aim to enable a high-performance codec-agnostic adaptation framework. In particular we introduce a mechanism based on a novel binary header which prefixes each media packet payload and enables codec-agnostic adaptation of media content. While this Generic Scalability Header (GSH) mechanism is based on the ideas behind MPEG21 DIA, it aims to enable codec-agnostic adaptation at a considerably lower performance cost. The basic approach is to prefix each media packet payload with a GSH which conveys adaptation-focused information on the contents of the packet, such as priority, scalability dimension or FGS layer truncation points. Each media packet which is described by a GSH is refered to as a “scalability unit”.

CHAPTER 5. DYNAMIC AND DISTRIBUTED ADAPTATION OF SCALABLE MEDIA BASED ON THE GENERIC SCALABILITY HEADER Page 101

5.2

Related work

Several approaches for adapting streamed scalable media in the network exist, some of which are shortly reviewed here. gBSD-based adaptation as introduced in Chapter 3 and extended for dynamic and distributed adaptation in Chapter 4 provides the adaptation metadata in the XML domain. Moreover, there are codec-specific mechanisms, such as a) the MPEG4 SVC RTP payload header[47] or b) a stream extractor, which solely rely on metadata available in the media bitstream. There are also completely media agnostic mechanisms such as the Type of Service (ToS) field [84] of the Internet Protocol with DiffServ [85] using it to enable a complete QoS framework. The ToS was originally specified as 8 bits in the IP header which can be divided into 5 subfields. These subfields contain 3 bits for Precedence (i.e., the importance of the datagram), 1 bit each for requesting low delay, high throughput, and/or high reliability and 3 bits for Explicit Congestion Notification (ECN). ECN enables end-to-end notification of congestion without dropping packets. DiffServ redefines the first 6 bits of the ToS field which now corresponds to a six bit Differentiated Services Code Point (DSCP) that is used to classify the packet. This is the basis of the DiffServ framework which relies on traffic classification, where each data packet is placed into a traffic class. Each router on such a DiffServ-enabled network is configured to treat packets of each class differently based on their classification. gBSD-based adaptation also represents related work to the GSH, as extensively described in Chapter 4 of this thesis. The MPEG-4 SVC RTP payload header includes the NALU header which co-serves as the RTP payload header. In addition to the NALU header the SVC RTP payload header introduces a new NALU, i.e., the Payload Content Scalability (PACSI) NALU. Rather than describing the contents of a single NALU, this NALU describes the largest common denominator of all NALUs which are included in an RTP packet, i.e., the PACSI header includes scalability characteristics that are the same for all the remaining NAL units in the payload of the packet. As such it is only used in aggregation packets which include more than one NALU. Finally, a stream extractor directly parses and interprets the NALU header, Parameter Sets, etc. in order to process the bitstream. It is therefore very media-specific, but can exploit the full scalability features of the media content.


The GSH mechanism attempts to combine the strengths of the different approaches described above. It relies on the codec-agnostic adaptation approach which ensures its applicability to past and future scalable codecs. At the same time the scalability features of the media content are comprehensively described in order to fully exploit them. However, it is also attempted to accomplish this at a minimal metadata and processing overhead. Thus, by attempting to combine the strengths and to eliminate the weaknesses of the above mechanisms, the GSH represents a novel approach to codec-agnostic adaptation.

5.3

Syntax and semantics of the Generic Scalability Header

In this section we introduce the syntax and semantics of the different fields and flags of the GSH. The GSH consists of an optional identifier, a base header and a number of optional extension headers.

5.3.1

Identifier

An Identifier field is needed if both regular and GSH-prefixed packets traverse a specific channel (identified, e.g., by UDP port). In this case the identifier enables to identify GSHamended packets. Note that this results in an overhead of 4 bytes which can be considerable depending on the packet size. As such it should be avoided if possible. One way to avoid this header is to send only GSH-prefixed packets over a specific channel and to, for example, announce this in the SDP. The GSH-based adaptation node would either be pre-configured to not look for the Identifier or a dedicated configuration step needs to be performed prior to a media session. Listing 5.1: Identifier Byte 0 Byte 1 Byte 2 Byte 3 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | I d e n t i f i e r ( ” !GSH” ) | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

5.3.2

Base Header

The pr id field specifies the priority of the packet. A lower value indicates a higher priority, e.g., a packet with priority 3 should rather be dropped than a packet with priority 2. The


pr id field provides means to describe adaptation paths for a given media stream, e.g., it allows to describe that all packets which belong to a specific scalability dimension shall be dropped first. The pr id field consists of 5 bytes which allow for 32 different priority levels in a stream. Listing 5.2: Base Header Byte 0 Byte 1 Byte 2 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | pr id | s |b| l | t id | s id | q i d | u | t | f |D|N| +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

Note that the pr id, together with t id, s id and q id may describe several adaptation paths, depending on at which quality the adaptation is started, as described in Section 2.2.2.5. Section 5.4.1 provides an example of this case. The t id field indicates the temporal priority of a packet. A lower value indicates a higher priority. Additionally, packets with a lower temporal priority may depend on packets with a higher temporal priority. That means, if packets belonging to the temporal dimension should be dropped, then those with the lowest priority should be dropped first in order to avoid decoding errors due to non-existing reference samples. Additionally, each t id field indicates the temporal resolution which a particular packet belongs to as shown in Table 5.1. Note that if the actual temporal resolution does not exactly match the value in the table, then the nearest neighbor should be chosen. For example, if a video is streamed at 18 FPS at a particular temporal scalability layer, then the temporal priority 6 should be chosen to identify this temporal layer. This mapping from temporal priority to (approximate) temporal resolution provides a straight-forward way to take adaptation decisions based on the supported temporal resolution of the end device. As such this mapping fulfills a similar task as the DIA AQoS description which also provides such mappings. Note that we used common temporal resolutions for this mapping, based on SVC encoding experiments. However we also considered the common TV resolutions for PAL (24 FPS) and NTSC (30 FPS). The s id field indicates the spatial priority of a packet. A lower value indicates a higher priority. Additionally, packets with a lower spatial priority may depend on packets with a higher spatial priority. That means, if packets belonging to the spatial dimension should be dropped, than those with the lowest priority should be dropped first in order to avoid


Table 5.1: Mapping of temporal priority to temporal resolution Temporal priority

Temporal resolution (FPS)

0 1 2 3 4 5 6 7 8 9 10 11 12 - 15

≤3 3 5 7.5 12 15 20 24 30 40 50 60 ≥ 60

decoding errors due to non-existing reference samples. Additionally, each s id field indicates the spatial resolution which a particular packet belongs to as shown in Table 5.2. Note that if the actual spatial resolution does not exactly match the value in the table, then the nearest neighbor should be chosen. For example, if a video is streamed with a resolution of 640x450 pixels at a particular spatial scalability layer, then the spatial priority 4 should be chosen to identify this spatial layer. Similar to above, this mapping from spatial to (approximate) spatial resolution provides a straight-forward way to take adaptation decisions based on the supported spatial resolution of the end device. As such this mapping fulfills a similar task as the DIA AQoS description which also provides such mappings. Note that we used common spatial resolutions for this mapping. The q id field indicates the quality priority of a packet. A lower value indicates a higher priority. Additionally, packets with a lower quality priority may depend on packets with a higher quality priority. That means, if packets belonging to the quality dimension should be dropped, than those with the lowest priority should be dropped first in order to avoid decoding errors due to non-existing reference samples. In addition to these base header fields, there are a number of flags which indicate the presence of additional headers and one flag indicating the discardability of the current scalability unit as shown in Table 5.3. Examples of this header can be found in Sections 5.4.1, 5.4.2 and 5.4.3.


Table 5.2: Mapping of spatial priority to spatial resolution Spatial priority

Spatial resolution

0 1 2 3 4 5 6 7 8 9 11 12 13 14 15

SQCIF QCIF CIF 4CIF VGA NTSC PAL SVGA XGA HD720 SXGA 16CIF UXGA HD1080 ≥ HD1080

Table 5.3: Flags in the GSH base header Flag

Semantics

s b l u t f D

If set to 1, indicates the presence of the scalability unit size field If set to 1, indicates the presence of the bitrate info field If set to 1, indicates the presence of the layer boundaries field If set to 1, indicates the presence of the update data length field If set to 1, indicates the presence of the update truncation points field If set to 1, indicates the presence of the truncation points location field If set to 1, indicates that no other scalability unit depends on the current scalability unit, i.e., the current scalability unit can be dropped without causing any decoding problems Reserved for future use

N


5.3.3

Scalability Unit Size

The scalability unit size field provides the size in bytes of the current scalability unit. This is useful if multiple scalability units are aggregated into a single packet, e.g., for throughput reasons. In this case knowing the scalability unit size enables to quickly jump to the next GSH in the packet. It also allows to compute whether the current scalability unit is the last scalability unit in this packet by summing up the sizes of all scalability units in this packet and comparing it with the packet size. Listing 5.3: Scalability Unit Size i f ( s == 1 ) { Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | | scalability unit size +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ }

Note that this is different from the case of aggregating multiple semantically identical scalability units (e.g., all scalability units of a specific temporal layer) into a single scalability unit, in which case there would only be a single GSH for this aggregated scalability unit.

5.3.4

Bitrate Info

The layer avg bitrate and the decoded layer avg bitrate fields provide information on the bitrate of the media stream.

The layer avg bitrate field indicates the average bitrate

per second which is needed by all scalability units which belong to the current scalability layer, i.e., which have identical values for the t id, s id and q id fields. Additionally, the decoded layer avg bitrate field provides the average bitrate per second which is needed by all scalability units which belong to the current scalability layer (i.e., the same as layer avg bitrate) plus the bitrate of all scalability units which the current scalability layer depends on. For example, a scalable video stream with 4 spatial layers (SQCIF, QCIF, CIF and 4CIF) needs a bitrate of 4250 kbps. Of these 4250 kbps, 50 kbps are needed for the SQCIF layer, 200 kbps are needed for the QCIF layer, 800 kbps are needed for the CIF layer and 3200 kbps are needed for the 4CIF layer. Table 5.4 shows the values for layer avg bitrate and decoded layer avg bitrate for all spatial layers, which enables an adaptation node to take decisions, e.g., with the available bandwith as a constraint.


Table 5.4: Example of layer avg bitrate and decoded layer avg bitrate values Spatial layer

layer avg bitrate (kbps)

decoded layer avg bitrate (kbps)

SQCIF QCIF CIF 4CIF

50 200 800 3200

50 250 1050 4250

Note that for media streams which support more than one scalability dimension, the layer avg bitrate field depends on the adaptation path which is described by the priority id. Please refer to Section 5.4.1 for an example. Listing 5.4: Bitrate Info i f ( b == 1 ) { Byte 0 Byte 1 Byte 2 Byte 3 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | layer avg bitrate | decoded layer avg bitrate | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ }

Examples of this header can be found in Section 5.4.1 and 5.4.3.

5.3.5

Layer Boundaries

The max t, max s, max q and max p fields provide the maximum values for t id, s id, q id and pr id respectively. This information is valuable for an adaptation node in order to initialize its filtering rules based on these fields, which dictate which scalability units to keep or drop. Without knowing these maximum values, the adaptation node would have to constantly analyze the GSH to get these maximum values which can result in a start up delay before the filtering rules become effective. Listing 5.5: Layer Boundaries i f ( l == 1 ) { Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | max t | max s | max q | max p | +−−−−−−−−−−−−−−−+−−−−−−−+−−−−−−−+ }


Thus, if a pr id based adaptation node already knows the maximum pr id used in the current media stream in advance, then it can set its pr id filter to this maximum value and reduce this limit in case there are problems (e.g., congestion), therefore effectively countering the problems by dropping those packets with the lowest priority. However, if it does not know this maximum value, then it can only set its pr id filter to the maximum allowed value of 16 and reduce it if problems occur. This results in potential startup-delay for the adaptation, because the adaptation node has to first analyse the stream in order to find out the maximum priority id value. Examples of this header can be found in Sections 5.4.1 and 5.4.3.

5.3.6

Update Data Length

The update data length field consists of the byteAddress, off and len fields. Contents encoded by some scalable codecs, such as the BSAC audio codec, require the size of the sample to be included in their header. If this sample is truncated due to adaptation, then this header needs to be updated. Since the GSH-based adaptation node is not codec-aware, the address of and length of the field which needs to be updated with the length of the adapted scalability unit needs to be specified. This is performed using the update data length field, by indicating the byte address, bit offset (from the byte address) and length in bits of this field. It is then the task of the adaptation node to fill this field with the new size of the scalability unit after adaptation. Listing 5.6: Update Data Length i f ( u == 1 ) { Byte 0 Byte 1 Byte 2 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | byteAddress | off | len | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ }

An example of this header can be found in Section 5.4.2.

5.3.7

Update Truncation Points

The update truncation points field consists of the byteAddress, off and len fields. Contents encoded with some scalable codecs, such as the BSAC audio codec, require the number of truncation points to be included in their header. If some FGS layers of a sample are


truncated due to adaptation, then this header needs to be updated. Since the GSH-based adaptation node is not codec-aware, the address of and length of the field which needs to be updated with the remaining number of truncation points of the adapted scalability unit needs to be specified. This is performed using the update truncation points field, by indicating the byte address, bits offset and length in bit of this field. It is then the task of the adaptation node to fill this field with remaining truncation points of the scalability unit after adaptation. Listing 5.7: Update Truncation Points i f ( t == 1 ) { Byte 0 Byte 1 Byte 2 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | byteAddress | off | len | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ }


5.3.8

Truncation Points Location

The truncation points location field enables to describe an additional scalability dimension where the current scalability unit can be truncated at specific byte-aligned addresses for fine granular adaptation (i.e., dropping FGS layers). Each ByteAddress value indicates a truncation point. The GSH allows to describe any number of such truncation points, where the last truncation point is indicated by two 0 bytes which follow it. If one or more FGS layers of the scalability unit are truncated and if the update truncation points field is present, then the value at the address indicated by this field needs to be updated accordingly (i.e., by subtracting the number of truncated FGS layers from this value). Listing 5.8: FGS Info i f ( f == 1 ) { w h i l e ( byte [ c u r ] != 0 | | byte [ c u r +1] != 0 ) { Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | ByteAddress ( 1 6 ) | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ } }



5.3.9

Summary

In this section we introduced the Generic Scalability Header. This in-band binary header prefixes each scalability unit. It aims to combine the information found in the AQoS and gBSD from the gBSD-based adaptation approach, i.e., it contains information both on the high level bitstream structure as well as on resulting bitrates when a bitstream segment is removed. It must be noted that what we describe above is the result of a lengthy design process. One of the driving factors for this design process was the desire for efficient processing of the GSH and to keep the overhead resulting from this header as small as possible. In order to achieve this, all fields were designed to be byte aligned. Real world use cases were taken into account to minimize the size of the header fields in order to reduce the size of the GSH. For example, scalable video streams were analyzed to get the maximum size of the scalability units appearing in these streams and thus derive the size of the scalability unit size field. Similarly, the sizes of the pr id, t id, s id, q id and thus the max t, max s, max q and max p fields were derived by similar analysis of real world use cases. Generally, the design of the GSH was driven by existing, codec-specific header fields. For one, the existing SVC NALU header, BSAC header, VES header, EZBC header, among others, were analyzed together with the corresponding payload format headers. For SVC, the SEIs, and in particular the Scalability SEI, were analyzed and layer bitrate information fields from it were adopted into this header. This eases the adaptation process, since no SEIs need to be buffered and cross-referenced anymore to adapt based on the bitrate of the scalability layer. For the same reason, i.e., to ease the adaptation and to make it as efficient as possible, the mapping from t id to frame rate and from s id to spatial resolution was introduced. This enables a very self-contained adaptation process which can support stateless adaptation nodes. Additionally, depending on the type of scalable media, not every field might be facilitated. For example, the truncation points location field is only used in case the described media supports FGS. Therefore, the mandatory base header was designed to include flags which signal the existence of these optional headers. In comparison to the gBSD, the GSH is much less flexible. While almost anything can be described using a gBSD and in particular its marker attribute, the GSH does not offer


Table 5.5: SVC test content for GSH annotation Spatial (pixel)

resolution

176x144 176x144 176x144 176x144 176x144 176x144 352x288 352x288 352x288 352x288 352x288 352x288 704x576 704x576 704x576 704x576 704x576 704x576

Temporal (FPS) 7.5 15 30 7.5 15 30 7.5 15 30 7.5 15 30 7.5 15 30 7.5 15 30

resolution

Quality (QP)

Bitrate (kbps)

40 40 40 33 33 33 40 40 40 33 33 33 40 40 40 33 33 33

19.728 29.844 43.620 97.032 120.324 151.212 291.804 369.924 469.692 474.852 580.788 713.556 1083.54 1348.992 1671.756 1571.208 1913.064 2324.112

such adaptivity. While there is room for one additional header field which may be signaled trough the currently unused flag in the base header, this represents a clear limitation of the GSH.

5.4

Examples of describing the scalability features of a codec with the GSH

Below we provide concrete examples of how various scalable media streams can be described using the GSH.

5.4.1

MPEG-4 Scalable Video Codec

For this example we assume an SVC stream with three temporal (7.5, 15 and 30 FPS), three spatial (QCIF, CIF and 4CIF) and two quality (quantization parameters of 33 and 40) layers. The characteristics of this test stream are given in Table 5.5. Since we assume that this content is an action movie with fast scenes, the temporal resolution is most important. Therefore, for a given resolution, an adaptation node shall


Figure 5.1: SVC fully scalable bitstream representation with adaptation paths first adapt in the quality dimension and only at the end in the temporal dimension. This results in three different adaptation paths, depending on the initial temporal resolution, e.g., depending on the terminal capabilities, as shown in Figure 5.1. The corresponding values for the GSH are shown in Table 5.6 where pr id values 0-5, 6-11 and 12-17 indicate the three adaptation paths. As described in Section 2.2.2.5, the values of s id, t id and q id need to be analyzed in order to discover the end of an adaptation path, as predetermined by the intra-layer and inter-layer dependencies. The values of layer avg bitrate are set in accordance to these adaptation paths, i.e., according to the order in which enhancement layers shall be dropped. Additionally, the values of s id, t id and q id are set in accordance to the specification (and mappings for s id and t id) in Section 5.3.2. In accordance to this the values for max t, max s, max q and max p are set to 8, 3, 1, 17 respectively. The header fields described in Sections 5.3.6, 5.3.7 and 5.3.8 are not needed to describe this SVC content and are therefore not used.

5.4.2

MPEG-4 Advanced Audio Coding: Bit Slice Arithmetic Coding

For this example we assume a BSAC stream with a base layer and one enhancement layer which can be truncated in a fine-granular way, i.e., at each layer element. The characteristics of this test stream are given in Table 5.7. Since this particular codec only supports FGS scalability in the quality domain, the


Table 5.6: GSH values for SVC test content s id

t id

q id

pr id

D

decoded layer avg bitrate

layer avg bitrate

1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

3 5 8 3 5 8 3 5 8 3 5 8 3 5 8 3 5 8

0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1

0 6 12 1 7 13 2 8 14 3 9 15 4 10 16 5 11 17

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

19.728 29.844 43.62 97.032 120.324 151.212 291.804 369.924 469.692 474.852 580.788 713.556 1083.54 1348.992 1671.756 1571.208 1913.064 2324.112

19.728 29.844 43.62 77.304 90.48 107.592 194.772 249.6 318.48 183.048 210.864 243.864 608.688 768.204 958.2 487.668 564.072 652.356

Table 5.7: BSAC test content for GSH mapping Layer type Base layer Enhancement layer


Bitrate (kbps)

30 30

11.484 77.813


pr id, t id, s id and q id are set to 0. For this codec the headers specified in Sections 5.3.6, 5.3.7 and 5.3.8 are facilitated and the flags in the base header are set accordingly. Listing 5.9: Adaptation-related headers of a BSAC sample +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | h leng | s | top layer | ... | | frame length +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

The values for the first two headers are shown in Listings 5.10 and 5.11. The adaptationrelated headers of a BSAC sample are shown in Listing 5.9. As introduced in Section 2.2.4 the frame length header is located right at the start of the BSAC sample, thus both the byteAddress and off fields of the update data length field are set to 0. Furthermore the frame length header field has a length of 11 bits, therefore the len field is set to 11. The top layer header field starts at byte 2 of the BSAC sample and has a length of 6 bits. The fields of the update truncation points field shown in Listing 5.11 are set accordingly, i.e., byteAddress is 2, off is 0 and len is 6. An adaptation node, which is aware of the syntax and semantics of the GSH would therefore process the GSH update data length and update truncation points fields, thus updating the BSAC header fields (frame length and top layer) accordingly in order to produce a valid bitstream. Listing 5.10: BSAC Update Data Length example Byte 0 Byte 1 Byte 2 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 0 | 0 | 11 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

Listing 5.11: BSAC Update Truncation Points example Byte 0 Byte 1 Byte 2 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 2 | 0 | 6 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

In contrast to the two headers above, the values of the FGS header are not constant, since the size of the enhancement layer varies depending on the audio to be encoded. An


example Truncation Points Location header for a specific BSAC sample is shown in Listing 5.12. As specified in Section 5.3.8 the Truncation Points Location header is terminated by two zero bytes. Listing 5.12: BSAC Truncation Points Location example Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 64 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 92 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ ..... Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 328 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ Byte 0 Byte 1 +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+−+ | 0 | +−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−+

5.4.3

MPEG-4 Visual Elementary Streams

For this example we assume a VES stream with the GoP structure IBBBBBBBBB. The characteristics of this test stream are given in Table 5.8. Note that other than SVC, VES does not natively support such a temporal hierarchy. However, since no other picture depends on B pictures, they can be disregarded in any order. Thus by disregarding every second B picture the hierarchy derived from the listing in Table 5.8 is created. Since there is only one scalability dimension available in the content there is also only


Table 5.8: VES test content for GSH mapping Layer nr.

Spatial resolution (pixel)


Bitrate (kbps)

0 1 2 3

352x288 352x288 352x288 352x288

3.75 7.5 15 30

251.689 319.713 478.436 795.881

Table 5.9: GSH values for VES test content Layer nr.

s id

t id

q id

pr id

D

decoded layer avg bitrate

layer avg bitrate

0 1 2 3

2 2 2 2

1 3 5 8

0 0 0 0

1 2 3 4

1 1 1 1

251.689 319.713 478.436 795.881

251.689 68.024 158.723 317.445

one adaptation path which is described by the pr id. This adaptation path is shown in Table 5.9. The values of layer avg bitrate are set in accordance to this adaptation path. Additionally, the values of t id are set in accordance to the specification in Section 5.3.2. In accordance to this the values for max t, max s, max q and max p are set to 8, 2, 0, 4 respectively. The header fields described in Sections 5.3.6, 5.3.7 and 5.3.8 are not needed to describe this VES content and are therefore not used.

5.4.4

Summary

In this section we introduced, by example, how the scalability features of different types of media codecs (which vary in their scalability possibilities) can be described using the GSH.

5.5

Enabling different types of adaptation nodes with the GSH

In this section we illustrate how the GSH supports three different types of adaptation nodes, i.e., a stateless adaptation node, a stateful adaptation node and an application-aware adaptation node. These types of adaptation nodes were already introduced in Section 2.4 and are concretized here for the GSH.


5.5.1

Stateless adaptation node

Stateless network devices are not very intelligent or flexible, however they provide good performance. Their main task is to receive a packet from the input network, process the packet in some way and then forward the packet to the output network. In case that there is too much traffic from the input network, the buffer of the device will get full and will eventually overflow. At this stage random packets will get lost which, not only in case of media packets, can have serious impacts on the quality experienced by the end user. Traditionally, the tail drop algorithm is used which simply drops, as its name indicates, any new packets which cannot be buffered anymore. However, this is considered unfair and a more balanced solution was desired. This lead to the Random Early Detection (RED) algorithm [86]. RED drops the packets based on probabilities, i.e., as the number of packets in the buffer increases, the probability of dropping an arriving packet increases as well. While this is fairer among different traffic flows than the simple tail drop algorithm, it still drops random packets without considering their importance. A GSH-aware network device, i.e., an adaptation node, on the other hand could facilitate information from the GSH to counter the buffer overflow problem in an intelligent way: Instead of dropping random packets, it could extend RED by investigating the GSH and only dropping those incoming packets which RED indicates to drop and which have the D flag set. In this way, only media packets which are not necessary for the decoding process are dropped and the quality experienced by the end user is reduced by a lesser amount. Alternatively (or additionally) the scalability unit may be truncated by investigating the Truncation Points Location header. If any of the update fields is present, then they need to be taken into consideration.

5.5.2

Stateful adaptation node

While a stateless adaptation node has no state, i.e., no memory, the stateful adaptation node has this ability. This means that it can, for example, remember which adaptation decisions it applied to packets of the current stream in the past. This memory can not only be used to store previous adaptation decisions, but also to make the adaptation node session aware by remembering source IP, destination IP and port(s) for the current session. This ability makes such an adaptation node much more flexible with regards to adaptation options. A stateful, GSH-aware adaptation node can of course still use the D flag to perform


simple adaptation of a media stream. However, it can also follow (one of) the adaptation path(s) provided by the pr id field. In this case it would set its pr id limit (which is stored in the memory specific to the media stream) to max p + 1. In case of problems (e.g., buffer limit or an alarm from a dedicated bandwidth measurement component) it can reduce the pr id limit by 1 and therefore disregard all packets with the highest pr id. If the problems persist it can further reduce the pr id limit until the end of the adaptation path is reached, at which point no further adaptation is possible. Alternatively, if the problems disappear it can start raising the pr id limit again. Note that in order to detect the end of the adaptation path, the adaptation node will also have to remember the t id, s id and q id of the current adaptation path in order to detect its end (as described in Sections 5.3.2 and 5.4.1). Alternatively (or additionally) the scalability unit may be truncated by investigating the Truncation Points Location header. If any of the update fields are present, then they need to be taken into consideration.

5.5.3

Application-aware adaptation node

An application-aware adaptation node includes the features of the stateful and stateless adaptation nodes. In addition to this it has dedicated knowledge of the intended receiver(s) of a media stream and of the currently available bandwidth towards the receiver(s), e.g., through RTCP feedback. Furthermore it might include application-layer functionality, such as the application-layer multicast which is presented in the use case in Section 3.2. This raises the requirement for such an adaptation node to act as a regular proxy, i.e., as a server to the receiver and as a client to the upstream server. Such an adaptation node will be able to benefit from the full description of the GSH. Most specifically it can analyze the Bitrate Info header. This enables it to get a very exact idea of the current bitrate of the complete stream and the bitrate of the current layer in order to take sophisticated adaptation decisions, e.g., based on the available bandwidth. It can also take adaptation decisions based on other usage environment characteristics, such as the display resolution of the terminal or the processing power of it in order to adjust the stream based on the s id, t id and q id.

5.5.4

Summary

In this section we described how the different types of adaptation nodes which we introduced in Section 2.4 are supported by the GSH.


Figure 5.2: Codec-agnostic GSH-based adaptation in an intercepting adaptation proxy

5.6

Using the Generic Scalability Header to enable dynamic and distributed adaptation in an intercepting adaptation proxy

We implemented an adaptation node based on the concepts of an intercepting adaptation proxy which corresponds to the adaptation nodes introduced in Sections 2.4.3 and 5.5.2. The motivation behind this was not only to show the feasibility to do so, but also to measure the performance of such an adaptation proxy as described in Section 6.6. This adaptation framework has been tested for the adaptation of • MPEG-4 VES (see Sections 2.2.5 and 5.4.3) • MPEG-4 SVC (see Sections 2.2.2 and 5.4.1) • MPEG-4 BSAC (see Sections 2.2.4 and 5.4.2) Figure 5.2 shows the architecture of this adaptation framework and a walkthrough for this framework looks as follows. 1. The server reads a sample of the bitstream from the hard disk (NALU, Frame, VOP). 2. The server generates a GSH for the sample. 3. The server packetizes the sample including the GSH into a UDP packet. 4. The server sends the sample as a UDP packet towards the terminal in a time-aware fashion.


5. The adaptation proxy captures the packet based on Linux firewall mechanisms. 6. The adaptation proxy parses and interprets the GSH. 7. The adaptation proxy modifies the packet’s data (truncation of NALUs, VOPs, or BSAC enhancement layers) based on the GSH. 8. The adaptation proxy updates the UDP header according to the modified data (checksum, length). 9. The adaptation proxy forwards the modified media packet to the client (based on Linux firewall mechanisms) or drops the complete packet (based on the adaptation decision). 10. The client stores the adapted bitstream to the hard disk where it can be decoded to validate the adaptation. As mentioned in the walkthrough, we facilitate Linux firewall mechanisms to capture packets and forward adapted packets (or drop them) as described in Section 4.5.4. By relying on such an optimized and proven mechanism we can assume that the overhead introduced by it is minimal. Finally we note that since this adaptation node is transparent, no direct usage environment feedback from the terminal is possible. The adaptation decision can therefore only be based on aggregated characteristics (as also described in Sections 2.4.3 and 5.5.2) such as: • Preconfigured knowledge of the usage environment, e.g., maximum display size of terminals connected to the network which the proxy serves. • Buffer size of the adaptation proxy. • Currently available bandwidth as provided by a bandwidth measurements tool.

5.7


In this section we introduced the GSH which represents a novel, adaptation-oriented mechanism for the codec-agnostic adaptation of scalable media. In order to validate this novel mechanism we showed how it can be used to describe the scalability features of three different types of scalable media. Additionally we described three different types of adaptation


nodes (as originally introduced in Section 2.4) based on the GSH. Finally, we implemented an adaptation node based on the GSH and tested it for various types of scalable media.

Part III

Evaluation and discussion

122

CHAPTER

6 6.1

Evaluation and comparison

Introduction

The first three sections of this chapter quantitatively evaluate our enhancements to the gBSD-based adaptation approach. First, we evaluate the Streaming Instructions by measuring the performance of the media and XML fragmenters which process the Streaming Instructions. Second, we evaluate different means for encoding/compression of the gBSD PUs, which is important to keep the metadata overhead low for in-network adaptation scenarios. Third, we compare the three different approaches to gBSD transformation, i.e., XSLT-based transformation, STX-based transformation and our new approach which is based on regular expressions. Section 6.5 evaluates the performance of the adaptation proxy which we introduced in Section 4.5.3. As described there, in server mode this adaptation proxy uses the XML and media fragmenters to process the Streaming Instructions and BiM to encode the gBSD PUs. The encoded gBSD PUs are then streamed in synchronization with the media as described in Sections 4.4.2 and 4.4.3. As such the adaptation server which is evaluated in Section 6.5 integrates the mechanisms which we evaluate in the first two sections of this chapter into an adaptation server in order to enable in-network gBSD-based adaptation. Finally, Section 6.6 compares the performance of the gBSD-based and GSH-based intercepting adaptation proxies which are introduced in Section 4.5.4 and Section 5.6 respectively. Additionally, a codec-specific intercepting adaptation proxy is introduced for this comparison in order to measure the cost of codec-agnostic adaptation compared to codec-specific adaptation. Subsequently we also evaluate how much CPU load the different adaptation-related tasks of the gBSD-based proxy (i.e., gBSD PU decoding/decompression,

CHAPTER 6. EVALUATION AND COMPARISON

Page 124

gBSD PU transformation, gBSDtoBin and gBSD PU encoding/compression) require, given the optimizations which we describe in Section 4.5.4. In Section 6.7 we conclude this chapter by discussing the evaluation results.

6.2

Fragmentation of media and metadata

In this section we evaluate the Streaming Instructions processors which we introduced in Section 4.3 with regards to their performance, i.e., their maximum throughput.

6.2.1

Test setup

The Streaming Instructions processors, i.e., the media and XML fragmenter were implemented in C++ as stand-alone applications. The libxml XmlTextReader interface1 (an XML Pull Parser) was chosen for accessing the XML information in contrast to the more traditional DOM and SAX interfaces. DOM was not selected because it requires to load the complete XML document into memory for processing, which is inefficient given the size of our gBSDs. SAX on the other hand is very capable of processing large XML documents, because of its event-based nature, however, its push-interface makes it complex to implement. The XmlTextReader combines the benefits of DOM (i.e., a simple, pull-based API) and SAX (i.e., efficient parsing of large XML documents) and was therefore chosen for this implementation. These tests were performed on a Dell Optiplex GX620 desktop PC with an Intel Pentium D 2.8 GHz processor and 1024 MB of RAM using Windows XP SP2 as an operating system. Time measurements were performed using the ANSI-C clock method. Table 6.1 provides an overview of the test data. Media and the corresponding gBSDs for three of the previously introduced media codecs (see Section 2.2), i.e., BSAC, EZBC and SVC were selected. The considerable size differences between the SVC and the EZBC content (both for media and metadata) are due to the fact that the EZBC content was encoded with 6 spatial layers and the SVC content was encoded with only a single spatial layer. Both for SVC and EZBC each PU describes a complete GoP which includes 16 pictures (i.e., AUs). For BSAC a PU describes a single AU. This results in significantly different sizes for the gBSD PUs and the corresponding media fragments (i.e., AUs for BSAC and GoPs for EZBC/SVC), in order to measure how the performance of the media 1

Libxml2 XmlTextReader Interface, http://xmlsoft.org/xmlreader.html


Page 125

Table 6.1: Characteristics of test data for Streaming Instructions evaluation Media size [kB] Average AU size [kB] Average GoP size [kB] gBSD size [kB] Average PU size [kB] Number of AUs Number of GoPs Number of PUs Resolution Frame rate [fps]

MPEG-4 BSAC

EZBC

MPEG-4 SVC

284.3 0.22 N/A 1533.98 1.80 1275 N/A 1275 N/A 30

13652.92 N/A 197.87 4428.07 65.17 N/A 69 69 QCIF 12.5

1282.90 N/A 18.59 310 5.52 N/A 69 69 QCIF 12.5

and XML fragmenters are influenced by the size of the PUs and media fragments. For our tests, the gBSD is provided in the uncompressed domain, as this is where the fragmenters operate. We used Streaming Instructions embedded as attributes into the gBSD. These gBSDs are processed as described in Section 4.3 in order to generate the PUs and extract the corresponding media fragments. For the XML fragmenter the measurements cover: 1. Parsing the gBSD from a file using the libxml XmlTextReader interface. 2. Composing PUs based on Streaming Instructions attributes in the gBSD as described in Section 4.3.1. For the media fragmenter the measurements cover: 1. Parsing the gBSD PU from an in-memory buffer using the libxml XmlTextReader. 2. Extracting AUs from the media content, which is held in an in-memory buffer, as described in Section 4.3.2. Note that the scope of these measurements was chosen to closely resemble the intended tasks of the media and XML fragmenters as described in Section 4.5, where the media fragmenter is a part of the gBSDtoBin process, which receives the transformed PU and extracts the media fragment from the bitstream according to the Media Streaming Instructions. For the experiment the content (both media and metadata) has been concatenated 10 times in order to enable test sequences of a reasonable length. Additionally, each test run


Page 126

Figure 6.1: Maximum throughput of the media and XML fragmenters was repeated 20 times and the results from the first 200 PUs/AUs/GoPs were ignored in order to avoid any deviations caused by program initialization. The results show the mean values for all test runs.

6.2.2

Results

For the XML fragmenter the performance results depicted in Figure 6.1 show 639 PUs/s for BSAC, 314 PUs/s for SVC and 40 PUs/s for EZBC. The standard deviations of the results for the XML fragmenter are below 5.2 percent of the mean value. For the media fragmenter the performance results show 2215 AUs/s for BSAC, 286 GoPs/s for SVC and 47 GoPs/s for EZBC. The standard deviations of the results for the media fragmenter are below 2.7 percent of the mean value. Since each SVC/EZBC GoP includes 16 pictures (i.e., a framerate of 30 FPS roughly corresponds to two GoPs per second), these results show good real-time performance enabling the processing of at least 20 concurrent streams of our test content at 30 FPS.

6.3

Compression of metadata for transport

In this section we evaluate different techniques for the compression of metadata in order to enable its efficient transport.


6.3.1

Page 127

Test setup

The test setup with regard to hardware, operating system and test content for this evaluation corresponds to what has been described in Section 6.2.1. Three different compression methods are used, i.e., the generic WinZip compressor2 which uses a hybrid of the LZ77 and Huffman coding algorithms, XML-aware XMLPPM3 compression, and the XML-specific BiM reference software (with zLib optimized codec for strings and binary context path encoding enabled) [51]. This selection of different compression mechanisms will allow to assess how these mechanisms compare to each other given the application scenario of compressing / encoding gBSD PUs. Note that we only consider the compression factor, which is the result from dividing the uncompressed file by the compressed file, and not the runtime performance. We do evaluate the runtime performance of compression mechanisms in Section 6.6. Each test run was performed for all PUs of the respective test content and it was not repeated since the results are deterministic. The results show the mean values for all PUs of the test content.

6.3.2

Results

The results in Figure 6.2 show the different compression factors which are achieved by each compression mechanism for each test content. As can be seen, BiM performs best for BSAC and SVC with compression factors of 10.36 and 10.28 respectively, i.e., the encoded PU is about 10 times smaller than the unencoded PU. For larger XML files, BiM falls behind in performance as can be seen from the EZBC test case where XMLPPM achieves a better performance than BiM with a compression factor of 6.57. The lower compression factor of BiM for the EZBC codec is due to the fact that the BiM encoding mechanism cannot remove redundancy as efficiently as other mechanisms. This is caused by the fact that it does not simply compress the input file but encodes it, which enables its processing in the binary domain [87][88]. The standard deviations of the results for the compression experiment are below 2.5 percent of the mean value. We can conclude from these measurements that BiM provides the highest compression factor for smaller PUs where the redundancy, which the other mechanisms remove more efficiently, is low. For larger PUs, BiM therefore provides a smaller compression factor than 2 3

WinZip v11, http://www.winzip.com XMLPPM v0.98.2, http://xmlppm.sourceforge.net


Page 128

Figure 6.2: Compression factors for gBSD PUs the other mechanisms. For further work on evaluation of mechanisms for XML compression, we refer to [74] [75] [73].

6.4

Transformation of metadata

In this section we evaluate the transformation step of the gBSD-based adaptation approach. As introduced in Section 3.3, originally the complete gBSD is transformed in accordance to the adaptation decision. Our extensions to the gBSD-based adaptation approach mandate that this transformation is performed for each gBSD PU, rather than only once for the complete gBSD, which results in a high number of transformations, that depends on the adaptation granularity. In order to minimize the overhead caused by this high number of transformations, we introduced a novel transformation approach based on regular expressions (as described in Section 4.5.4.3). In this section we compare this approach with traditional XML transformation approaches, i.e., STX and XSLT, as introduced in Section 3.3.

6.4.1

Test setup

All tests were performed on a Dell Optiplex GX620 desktop with an Intel Pentium D 2.8 GHz processor and 1024 MB of RAM using Fedora Core 6 Linux with Kernel version 2.6.20


Page 129

Table 6.2: Characteristics of test data for gBSD transformation evaluation gBSD (PU) size (BSAC) [kB] gBSD (PU) size (SVC) [kB]

NALU

AU

GoP

Complete metadata

N/A 0.44

1.22 0.97

N/A 8.38

2528.73 1505.12

as an operating system. Memory consumption was measured using the process status (ps) tool and time measurements where performed based on the gettimeofday method. We use gBSD PUs for SVC and BSAC in this comparison. In order to cover the different possible adaptation granularities, we consider NALU, AU and GoP (with 16 pictures) granularity for SVC, i.e., a PU describes a single NALU, an AU or a complete GoP. For BSAC we can only consider AU granularity, since there is no concept of NALUs or GoPs in BSAC. Additionally we consider a complete gBSD for BSAC and SVC in order to cover the original gBSD-based adaptation approach. In this case the complete gBSD describes 3000 AUs/GoPs. Table 6.2 provides an overview of the test content, which shows the sizes of the different gBSD (PUs). In case of PUs it shows the average size. In this test we compare the time and memory needed for transforming the gBSD (PUs). For STX, we use Joost4 , for XSLT we use libxslt5 and for regular expressions we rely on the boost regular expressions library6 in order to apply the transformation. In contrast to STX, multiple implementations of XSLT and regular expressions are available. In this test we have chosen the implementations for regular expressions and XSLT based on what we believe are the most mature and commonly used implementations. We only measure the time needed for the actual transformation and ignore any start up overhead (including, e.g., parsing the style sheet or the regular expression), since their contribution to the overall CPU load is minor in dynamic adaptation scenarios where the style sheet / regular expression is parsed only once and then applied many times to the individual PUs. We repeated all tests 500 times and only used the last 100 test runs for our measurements in order to avoid any deviations caused by program startup. The particularly high number of ignored test runs was necessary because the Java-based Joost processor showed considerable startup variations for a significant time. Additionally, for each test case, we measured the performance for disregarding all gBSDUnits (dropall ), disregarding 4

Joost version 2007-07-18, http://joost.sourceforge.net libxslt version 1.1.21, http://xmlsoft.org/XSLT 6 Boost.Regex version 1.33.1, http://www.boost.org/libs/regex/doc/ 5


Page 130

no gBSDUnit (dropnothing) and disregarding half of the gBSDUnits (drophalf ) in order to cover different adaptation cases. The results show the mean values for all test runs.

6.4.2

Results

Figures 6.3, 6.4, 6.5 and 6.6 show the results for NALU, AU, GoP and 3000 AU granularity, respectively. As can be seen, regular expressions increase transformation performance at least by a factor of 4 compared to the other approaches. Since the transformation task takes a considerable amount of time compared to the other tasks of gBSD-based adaptation, the usage of regular expressions therefore significantly increases the throughput of the adaptation node. Additionally the measurements show that for small gBSD PUs (i.e., NALU and AU granularity) the XSLT mechanism performs significantly better than the STX mechanism. However, for larger gBSDs, STX is performing better than XSLT, which is particularly apparent for the gBSD describing 3000 AUs. This is due to the event-based approach of STX which does not mandate to keep the complete XML document in memory. The break-even point between STX and XSLT performance is at 60 kB / 74 AUs for BSAC and 4.8 kB / 8 AUs for SVC. Apparently the additional update operations for BSAC (as discussed in Sections 2.2.4 and 5.4.2), which also need to be performed on the gBSD PU, are the cause of this difference. The standard deviation of the results for the fragmentation experiment is below 3.2 percent of the mean value. Memory consumption for small gBSD PUs is insignificant, however for larger gBSDs, such as our gBSD example with 3000 AUs, memory consumption becomes significant for XSLT (i.e., 59.5 MB for SVC and 111.7 MB for BSAC) and slows down its processing considerably as can be seen in Figure 6.6. Generally, regular expressions again perform best with regard to memory consumption. Based on this quantitative evaluation, we can conclude that our approach of transforming gBSDs based on regular expressions results in a better performance (both in terms of throughput and memory consumption) than the alternatives.

6.5 6.5.1

gBSD-based adaptation proxy Test setup

The test setup with regard to hardware, operating system and the test content for this evaluation corresponds to what has been described in Section 6.2.1.


Page 131

Figure 6.3: Time needed to transform a PU which describes an SVC NALU

Figure 6.4: Time needed to transform a PU which describes an SVC/BSAC AU

Figure 6.5: Time needed to transform a PU which describes an SVC GoP


Page 132

Figure 6.6: Time needed to transform a gBSD which describes 3000 SVC/BSAC AUs We evaluate the performance of the adaptation proxy which operates in server-mode, as depicted in Figure 4.8 and on the left side of Figure 4.9. These measurements cover a complete adaptation server as introduced in Section 4.5.1 and detailed in Section 4.5.3. The memory utilization and CPU load of our adaptation server is measured in order to find out how many concurrent streams it is able to process. To this end, we access a single content (consisting of a media stream and a gBSD), fragment it according to the Streaming Instructions and adapt, packetize and stream it to the player on the end device. We then access another content, and so on, until there are five contents (five media streams and five gBSD streams) being processed and delivered concurrently. There is a single content (i.e., a media and a metadata stream) being processed for the first 40 seconds; then there are two contents until second 80; and so on, with every 40 seconds a new content being added and processed in parallel. Note that our approach of using regular expressions for the transformation step was developed after the implementation of this prototype and is therefore not considered here. However, we consider it as an optimization in our evaluations in Section 6.6. The transformation in this implementation is performed using XSLT, as described in Section 3.3. Each test run was repeated 20 times. The server was restarted after each test run. The first request for a stream was sent 30 seconds after restarting the server in order to avoid any deviations caused by program initialization. We performed these measurements by probing the memory consumption and CPU load once per second using a dedicated measurement tool7 . The figures show the moving average of these measuring points. The moving average 7

Freemeter Professional v2.8.2, http://www.tiler.com/freemeter/


Page 133

Figure 6.7: gBSD-driven adaptation of 1 to 5 BSAC streams: memory utilization and CPU load was used to smooth out deviations. As can be seen from the figures, deviations are still very prominent. This is due to the complex implementation of this adaptation server, which, as described in Section 4.5.3, is based on the feature-rich Darwin streaming server that provides full session management and a multi-threaded implementation. Thus, these measurements can only be used to analyse the overall performance of the adaptation server. For a more comprehensive analysis we refer to Section 6.6 in which we compare the different adaptation mechanisms, which are presented in this thesis, based on a much simpler architecture, which enables more detailed measurements.

6.5.2

Results

Figures 6.7, 6.8 and 6.9 show the results of these tests for the SVC, BSAC and EZBC contents. Generally one can see that even with 5 concurrent streams, our adaptation server’s CPU load is below 50 percent for BSAC and below 20 percent for the video contents. It could in fact process several more streams without getting to its performance limits. One can also see that many, smaller packets (such as the samples for BSAC) put much more load on the adaptation server than fewer, larger packets (such as the EZBC/SVC GoPs). Each additional stream which is requested every 40 seconds puts additional load on the adaptation server, both in terms of CPU and memory. However, these jumps in CPU load and memory consumption are not consistent. For example, the CPU load for BSAC increases much less


Page 134

Figure 6.8: gBSD-driven adaptation of 1 to 5 QCIF SVC streams: memory utilization and CPU load

Figure 6.9: gBSD-driven adaptation of 1 to 5 QCIF EZBC streams: memory utilization and CPU load


Page 135

for the fourth stream (i.e., at second 120 in Figure 6.7) than for the other additional streams at seconds 40, 80 and 160. The reason for this is the complex implementation, as described above. We conclude from these measurements that the adaptation server is able to process multiple concurrent streams of our test content simultaneously.

6.6

gBSD-based and GSH-based intercepting adaptation proxys

This section compares the performance of the gBSD-based and GSH-based intercepting adaptation proxies which are introduced in Section 4.5.4 and Section 5.6 respectively. Additionally, a codec-specific intercepting adaptation proxy is introduced for this comparison in order to measure the cost of codec-agnostic adaptation compared to media-specific adaptation. SVC, BSAC and VES contents are used for this evaluation. The results compare the performance of the GSH-based adaptation, gBSD-based adaptation and codec-specific adaptation with regard to throughput and metadata overhead. Subsequently we also evaluate how much CPU load the different adaptation-related tasks of the gBSD-based proxy (i.e., gBSD PU decoding/decompression, gBSD PU transformation, gBSDtoBin and gBSD PU encoding/compression) require, given the optimizations which we describe in Section 4.5.4.

6.6.1

Test setup

Our test setup for the test runs consists of three different types of modules, i.e., • UDP server • Adaptation proxies for gBSD-based, GSH-based and codec-specific adaptation • UDP client as described in Sections 4.5.4 and 5.6. 6.6.1.1

UDP server

The UDP server reads a media fragment from the harddisk, packetizes it into UDP packets and streams it at a specified frame rate towards the UDP client. For SVC a media fragment refers to a single NALU, while for VES and BSAC a media fragment refers to an AU. While


Page 136

this fine granularity for SVC results in additional processing and metadata overhead, it also enables the optimization described in Section 4.5.4.4, which in return considerably increases processing performance as evaluated in this section. As described in Section 4.5.4.4, when performing the adaptation at a fine granularity, the complete media fragment (e.g., a NALU) is either kept or dropped and therefore the gBSDtoBin process can be avoided. To counter the additional metadata overhead caused by this fine granularity, we introduce dedicated aggregation mechanisms, which we also detail below. The server supports multiple concurrent media streams in order to test the throughput of the adaptation proxy. In the case of multiple media streams we explicitly note that they are sent over the same socket. This corresponds to how most of the services on the Internet operate today, i.e., there is a dedicated port or a port range which identifies the service (e.g., port 80 for HTTP). Note that RTP streams are an exception to this, as they do not have a default port but rather negotiate the ports through RTSP. The gBSD stream in the case of the gBSD-based adaptation proxy is always sent as a separate stream as described in Section 4.4.3. The different streams can be separated based on their target IP addresses. Two different approaches to packetization (codec-specific packetization and codec-agnostic packetization) which result in three different packetization modes are supported in order to evaluate how the packet size influences the performance of each adaptation proxy. For codec-specific packetization, a packet contains 1) a single media fragment (i.e., AU for VES/BSAC, NALU for SVC) per UDP packet or 2) multiple media fragments which belong to the same scalability layer (i.e., NALUs of the same layer for SVC, B-VOPs for VES) up to a (configurable) maximum packet size. The first mode aggregates (at most) a single media fragment into a packet, which - in case of SVC - corresponds to the Single NAL Unit mode in the RTP SVC Payload Format [47]. The second mode aggregates all consecutive media fragments which belong to the same scalability layer into a single packet up to a maximum size (e.g., a maximum transmission unit of 1400 byte). For example, all consecutive NALUs which belong to temporal layer four are aggregated up to a maximum size of 1400 byte. These aggregated NALUs are from then on treated as a single media fragment which is either dropped or kept. As such, this adaptation-oriented aggregation mode increases adaptation throughput by reducing the number of media fragments but at the same time reduces adaptation flexibility, since individual media fragments of this aggregated media fragment can no longer be dropped. This aggregation mode also reduces metadata overhead since only a single description (gBSD or GSH) is needed for all media


Page 137

fragments which are aggregated that way, as they belong to the same scalability layer. This aggregation mode depends on the order of NALUs/AUs in the byte stream. To illustrate this we show the encoder output for four AUs of the content SVC 5 (see Table 6.3) in Listing 6.1. This content has several temporal, spatial and MGS layers. As can be seen from the T (temporal), L (spatial) and Q (quality) fields, each temporal layer has several (two or three) spatial enhancement layers and each spatial layer has three MGS enhancement layers. For our aggregation this means that aggregating in temporal and spatial modes is possible, since there are consecutive NALUs of the same temporal and spatial layer in the bitstream. Aggregation of MGS NALUs is never possible for our adaptation-oriented aggregation mode. Listing 6.1: NALU order for an SVC content with temporal, spatial and MGS layers AU 2888 : 2888 : 2888 : 2888 : 2888 : 2888 : 2888 : 2888 : AU 2884 : 2884 : 2884 : 2884 : 2884 : 2884 : 2884 : 2884 : AU 2892 : 2892 : 2892 : 2892 : 2892 : 2892 : 2892 : 2892 : AU 2882 : 2882 : 2882 : 2882 : 2882 : 2882 : 2882 : 2882 :

B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B

T1 T1 T1 T1 T1 T1 T1 T1 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T3 T3 T3 T3 T3 T3 T3 T3

L0 L0 L0 L1 L1 L1 L2 L2 L0 L0 L0 L1 L1 L1 L2 L2 L0 L0 L0 L1 L1 L1 L2 L2 L0 L0 L0 L1 L1 L1 L2 L2

Q0 Q1 Q2 Q0 Q1 Q2 Q0 Q1 Q0 Q1 Q2 Q0 Q1 Q2 Q0 Q1 Q0 Q1 Q2 Q0 Q1 Q2 Q0 Q1 Q0 Q1 Q2 Q0 Q1 Q2 Q0 Q1

QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP QP

36 31 23 38 31 23 38 31 37 32 24 39 32 24 39 32 37 32 24 39 32 24 39 32 38 33 25 40 33 25 40 33

Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

31.0947 34.7735 40.4087 34.3478 36.5089 41.3303 35.6110 37.4260 30.9536 34.5372 39.9140 34.9059 36.4778 40.7522 36.2227 37.5044 30.6174 33.8842 39.3896 33.8696 35.6278 40.3429 35.3276 36.8851 31.4220 34.4439 39.4760 34.9551 36.3176 40.2226 36.0387 37.1046

U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U

35.8245 38.3650 42.5715 39.9394 40.5172 43.4340 40.6215 40.9564 36.6715 38.7541 42.7840 40.4921 40.9378 43.3363 40.8607 41.1738 35.5510 38.1117 41.6758 39.3946 39.8354 42.6400 40.2062 40.6520 36.8357 39.1467 42.5234 40.3164 40.8984 43.2036 40.6206 41.0870

V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

34.1739 36.7852 41.5386 38.4406 39.1925 43.3538 41.0092 41.6267 34.3300 36.8106 41.3182 38.7581 39.3625 43.0326 41.2787 41.8640 33.7407 36.2960 40.5997 37.9257 38.6254 42.5590 40.5216 41.2838 34.8862 37.2437 41.0416 38.8157 39.7228 43.1572 41.3910 42.2059

5904 8320 25192 1520 11080 62760 5064 29472 2792 4448 14328 896 5176 38464 3120 13592 3656 4824 18144 968 6024 46960 3408 17648 1688 2040 7880 952 2840 21008 3128 7344

bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit bit


Page 138

For codec-agnostic packetization, the packetizer assumes a fixed size for the packet and simply copies these bytes from the media stream into the packet payload. This requires an adaptation proxy to inspect each byte of the payload in order to detect the end of the media fragment, i.e., the start code of the next media fragment. Note that this byte-perbyte inspection is not needed for BSAC, since the length of the BSAC AU is provided in its header, as described in Section 2.2.4, which allows to skip all bytes until the next BSAC AU starts. Additionally the adaptation proxy cannot be stateless given this kind of packetization since a media fragment may span multiple packets. The case of codec-agnostic packetization is only implemented for the codec-specific adaptation mechanisms. Note that while such a mechanism may not be implemented in a production system, it enabled to see the performance difference compared to such a basic approach. We further note that this mechanism always assumes a packet size of 5600 bytes. We have chosen this multiple of a regular MTU (1400 byte) in order to clearly see the performance difference of such a fixed-size approach to packetization. This was the first type of packetization which we implemented for this framework and should only be treated as a proof of concept. Whenever a prefix NALU appears in the SVC bitstream, the UDP Server follows the SVC RTP Payload Format [47] recommendation of packeting this prefix NALU together with the following AVC NALU. The server supports the generation of gBSD PUs and the generation of the GSH. A gBSD PU is generated for the contents of each media fragment and is sent on a separate socket. Two different compression mechanisms are supported, i.e., plain text, the generic zLib8 mechanism and the codec-specific BiM approach [51]. The synchronization of gBSD and media packets relies on the packet order with the assumption of not having any packet loss. A different synchronization mechanism was implemented before, which relied on RTP timestamps as described in Sections 4.4 and 4.5. However, this was avoided in this implementation since the aim of this framework is to minimize complexity in order to compare the throughput and metadata overhead of in-network adaptation given our three adaptation approaches.

8

zLib v1.23, http://www.zlib.net/


6.6.1.2

Page 139

Adaptation node(s)

Several adaptation proxies were implemented which support the different mechanisms presented in this thesis, i.e., a codec-specific adaptation proxy which interprets the codecspecific header of each media packet for adaptation, a codec-agnostic GSH-based adaptation proxy which interprets the GSH for adaptation and a gBSD-based adaptation proxy which applies gBSD-based adaptation. We measure the performance of each adaptation proxy in our test runs. All tests were performed on an IBM Thinkpad T60P with an Intel T2600 CPU and 2048 MB of RAM using Fedora Core 6 Linux with Kernel version 2.6.20 as an operating system. CPU measurements were performed using the ps tool and time measurements were performed based on the gettimeofday method. The codec-specific adaptation node: 1. captures the packet based on Linux firewall mechanisms as described in Section 4.5.4.1; 2. interprets the NALU, VOP or BSAC header(s); in case of codec-agnostic packetization the whole payload of the packet needs to be searched (except for BSAC, as described above); 3. modifies the packet’s data (dropping of NALUs, VOPs, or BSAC enhancement layers); 4. updates the UDP header according to the modified data (checksum, length); 5. forwards the modified packet to the client or drops the complete packet. The GSH-based adaptation node works as specified in Section 5.6 and the gBSD-based adaptation node works as specified in Section 4.5.4. 6.6.1.3

UDP client

The UDP client receives the media, removes the GSH if necessary and stores the media stream to the harddisk, where it can be decoded in order to validate it. 6.6.1.4

Test content

Table 6.3 provides an overview of the test content. Our test set includes seven different types of content. There is a single test content each for BSAC and VES since their scalability options are quite limited. For SVC (encoded with JSVM version 8.12) we have five contents


Page 140

on the other hand, since it provides scalability in several dimensions. This results in a comprehensive test set which covers all of these dimensions. All of these test contents have an equivalent frame rate (30) and AU count (3000). In addition to the media characteristics we already include a static evaluation of how beneficial our aggregation mechanism is for the different test contents, i.e., to which extent the number of packets can be reduced at different aggregation granularities. In Table 6.3, the efficiency of the aggregation depends on three factors: 1. The size of the aggregated packet. 2. The size of the media fragments in the bitstream, which is influenced by the media quality, e.g., for the VES content an aggregation up to 1400 bytes only results in 75 fewer packets, since most B VOPs are larger than 1400 bytes. 3. The scalability features. For example, SVC 5 has many scalability layers which result in a high applicability of our adaptation-oriented aggregation. SVC 1 on the other hand only offers three temporal layers and therefore fewer possibilities for our adaptation-oriented aggregation approach to be applied.

6.6.1.5

Test runs

Generally, we are interested in the characteristics of the three different adaptation mechanisms and their performance based on the various aggregation / packetization modes which are supported by the server. Specifically, the following characteristics are evaluated: • Throughput. • Metadata overhead. • Load distribution of gBSD-based adaptation nodes. Each of these characteristics is evaluated for: • Each test content (BSAC, SVC, VES). • An increasing number of concurrent streams. • Each adaptation mechanism (codec-specific, GSH-based, gBSD-based).


Page 141

Table 6.3: Characteristics of test data for performance comparison of different adaptation mechanisms BSAC VES SVC 1 SVC 2 SVC 3 SVC 4 SVC 5 Media size [kB] Nr. of AUs Average AU size [kB] Nr. of NALUs Nr. of VCL NALUs Nr. of prefix NALUs Nr. of PS/SEI NALUs Nr. of packets Frame rate [fps] GoP size Spatial layers Temporal layers Quality layers Packets (no aggregation) Packets (agg. 1400 byte) Packets (agg. 5600 byte) Packets (agg. 10000 byte) Packets (agg. 20000 byte)

1809 3000 0.6 N/A N/A N/A N/A N/A 30 N/A N/A N/A FGS N/A N/A N/A N/A N/A

9820 3000 3.27 N/A N/A N/A N/A 3000 30 10 1 N/A 1 3000 2925 1586 1111 814

8219 3000 2.74 6005 3000 3000 5 3005 30 4 1 3 1 3005 2343 2273 2252 2252

7907 3000 2.64 6005 3000 3000 5 3005 30 32 1 6 1 3005 2180 1121 829 603

27682 3000 9.23 12007 9000 3000 7 9007 30 16 3 5 1 9007 5536 2751 2068 1445

18790 3000 6.26 12007 9000 3000 7 9007 30 16 1 5 3 9007 4725 2352 1678 1121

29195 3000 9.73 27017 24000 3000 17 24017 30 16 3 5 3 24017 8717 4026 2775 1708


Page 142

• Each metadata (gBSD) compression mechanism (none, zLib, BiM). • Three different packet sizes for aggregation. • Codec-specific and codec-agnostic (i.e., fixed-size) packetization. Note that we also refer to codec-agnostic packetization as fixed-size packetization since other than for codec-specific adaptation the size of the packet payload is fixed. Each test run was repeated 20 times and the maximum deviation of the results was at 13.8 percent of the mean value. The first 20 seconds of each test-run were ignored in order to avoid any deviations caused by program initialization. Generally we always consider the case where no adaptation takes place (i.e., no packets dropped or truncated) in our test-runs. We still inspect each packet and also, e.g., run gBSDtoBin when applicable. We consider this to be the worst case scenario where the most system resources are needed. 6.6.1.6

Expected findings

We attempt to answer the following questions in this experiment: • What is the throughput of each adaptation mechanism? • How does the packet size influence the throughput of each adaptation mechanism? • Which other costs (i.e., metadata overhead) are connected with each adaptation mechanism? • How does the packet size influence the metadata overhead for GSH-based and gBSDbased mechanisms? • To which extent can the gBSD metadata overhead be reduced with different compression mechanisms? • What is the load distribution of the different steps for in-network gBSD-based adaptation (decompression, transformation, adaptation, compression)?

6.6.2

Results

Figures 6.10, 6.11, 6.12, 6.13, 6.14 and 6.15 show the results of our measurements. For each measurement we identify the type of codec, type of metadata (gBSD or GSH), compression method used for metadata (if applicable) and aggregation value in bytes. For example,


Page 143

“SVC (gBSD/plain) s1 t6 q1 g32 1400” indicates the SVC 2 test content (SVC with one spatial layer, six temporal layers, one quality layer and a GoP size of 32) with uncompressed gBSD metadata (gBSD/plain) and aggregation up to 1400 bytes ( 1400). For codec-agnostic packetization with packets of a fixed size of 5600 bytes we use “ fixed” in the identifier. For the throughput measurements we use the metric “kbps/CPU percent”, which requires further elaboration: As described in Section 6.6.1.5 we perform each test run for an increasing number of concurrent streams. Ideally the results would be plotted as curves, which are linear up to a high CPU load and then change into a curve with a decreasing gradient as the 100 percent mark is approached. However, given all possible test runs, the number of resulting curves (and thus figures) would be prohibitive. We therefore stayed below 60 percent CPU load for our test runs, where the curve is almost linear. The throughput for each amount of concurrent streams was divided by the corresponding CPU load which results in the “kbps/CPU percent” metric. Since we perform each test run for an increasing number of streams, we could validate that the behavior is linear by comparing the resulting “kbps/CPU percent” values. For example, a single stream has a bitrate of 1000 kbps and causes a CPU load of 5 percent at the adaptation proxy. According to our metric this results in 200 kbps/CPU percent. With two concurrent streams (i.e., 2000 kbps) the CPU load would raise to 10 percent at the adaptation proxy which again results in 200 kbps/CPU percent. This linearity is enabled by the simple architecture of our adaptation proxy and by the fact that we send multiple media streams over the same socket, as described above. The overhead of managing multiple streams is therefore very low and in fact the deviation of the kbps/CPU percent values from their mean value is always below 10 percent. That is, for any test run with an increasing number of streams up to 60 percent CPU load, the deviation of the kbps/CPU values from the mean value for this test run was always below 10 percent. This deviation is specific to the throughput measurements. Note that all figures show the bitrate of the incoming streams at the adaptation proxy. Figure 6.10 shows the CPU load distribution for gBSD-based adaptation for each PU / media fragment. The measurements show how long each step of the adaptation process, i.e., transformation, adaptation and optional compression / decompression of the PU takes on average. The measurements already include the optimizations to the gBSD-based adaptation approach, i.e., the usage of regular expressions for the transformation of the gBSD, which were introduced in Section 4.5.4.3 and evaluated in Section 6.4. The optimization described in Section 4.5.4.4, i.e., when performing the adaptation at a fine granularity, the


Page 144

complete media fragment (e.g., a NALU) is either kept or dropped and therefore the gBSDtoBin process can be avoided, are not yet applied. The results from this optimization are, however, shown in Figure 6.11. Subsequently, Figures 6.12, 6.13 and 6.14 show the throughput of codec-specific, gBSDbased and GSH-based adaptation nodes, including metadata overhead in kbps/CPU percent. Again we show both the unoptimized and optimized measurements for the gBSD-based adaptation approach in order to illustrate the performance gain which is achieved. Finally, Figure 6.15 shows the metadata overhead in percent (calculated over the complete bitstream) for both GSH and gBSD-based adaptation.

CHAPTER 6. EVALUATION AND COMPARISON Page 145

Figure 6.10: CPU load distribution for gBSD-based adaptation


Figure 6.11: CPU load distribution for gBSD-based adaptation (optimized by avoiding gBSDtoBin)


Figure 6.12: Throughput of gBSD-based adaptation


Figure 6.13: Throughput of gBSD-based adaptation (optimized by avoiding gBSDtoBin)


Figure 6.14: Throughput of codec-specific and GSH-based adaptation


Figure 6.15: Metadata overhead of GSH-based and gBSD-based adaptation


6.6.3

Page 151

Discussion of results

Generally, all of the different adaptation nodes benefit from the aggregation of packets (if the encoding allows it), i.e., fewer larger packets are better than more smaller packets. This is valid both in terms of metadata overhead (which is reduced by aggregation) and throughput (which is increased). However, this comes at the cost of adaptation flexibility, since aggregation means that the adaptation becomes more coarse granular. The BSAC codec which has small packets would benefit most from aggregation, however this is not easily possible because of the fine grained scalability structure of the BSAC packets as described in Section 2.2.4. In order to enable this for BSAC, it would be required to packetize the BSAC base layer and the BSAC enhancement layer(s) into different packets and reorder the packets / AUs, i.e., additional processing would be needed on the server and the client. Metadata overhead for uncompressed gBSD PUs is considerable, in particular for BSAC. For SVC and VES it can be reduced by aggregation (if the encoding allows it) to a minimum of around 10 percent as Figure 6.15 shows. Further reduction might be achieved by describing the media at a more coarse granular level, however this would exclude the optimization which we describe in Section 4.5.4.4, i.e., gBSDtoBin could not be avoided anymore. A mixed approach is possible, where the PUs are transmitted at a coarse granular level (e.g., GoP) but processed at a fine granular level (e.g., NALU). This results in additional processing overhead at the adaptation node for fragmenting the PUs, but enables to minimize transport overhead and also enables to reduce processing overhead in other steps as described in Section 4.5.4.4. Such a mixed approach is proposed in [19]. Compression of the metadata, in particular together with aggregation, reduces the metadata overhead to well below 10 percent and particular the BiM approach results in an insignificant metadata overhead. However, this comes at the price of processing overhead as shown in Figures 6.12 and 6.13. BSAC is an exception here. While the BSAC AUs are quite small, the metadata overhead is considerable due to the scalability properties of BSAC which need to be described in order to exploit them. Even with BiM compression (the processing overhead of which considerably limits the throughput), the metadata overhead is still considerable. The processing overhead caused by the two compression mechanisms results in smaller throughput (including the metadata overhead) than the approach without compression as shown in Figures 6.12 and 6.13. While uncompressed gBSD PUs result in the highest


Page 152

throughput of the adaptation node, one might still consider using one of the compression mechanisms, in particular when bandwidth represents the bottleneck. Figures 6.10 and 6.11 show that compression and decompression take a considerable amount of CPU time, particularly for the optimized approach where gBSDtoBin is not needed (for SVC and VES). This clearly motivates to consider additional compression mechanisms with a special scope on those which have a low CPU footprint. For gBSD-based adaptation, the actual media adaptation is the most expensive part of the adaptation process. The larger the media packet, the longer the media adaptation process takes. For BSAC the media adaptation process takes a considerable amount of time even as the AUs are very small. This is due to the 10 FGS layers which are described in a BSAC gBSD, i.e., 10 memory cut/copy operations are performed on each media packet instead of just one. Similarly, the larger the (uncompressed) gBSD PU, the longer the transformation takes. ZLib decompression is very efficient and takes only a very small amount of time. ZLib compression takes more time but still not as much as BiM compression. The usage of regular expressions reduces the transformation time for SVC/VES from an average of 180 microseconds to an average of 20 microseconds (and by a similar factor for BSAC), which results in an increased throughput of our adaptation node of about 2 times. Our optimization, where the gBSDtoBin process can be avoided, saves between 200-1000 microseconds for each AU/PU as Figures 6.10 and 6.11 show. This results in an increase in throughput by another 3-10 times as can be seen in Figures 6.12 and 6.13. The results of Figures 6.10 and 6.11 also show that the media adaptation strongly depends on the size of the media fragments, while the compression and transformation tasks show no strong relationship to the size of the PUs. A very interesting fact, which is apparent when comparing Figures 6.12 and 6.13, is that for the higher throughputs of the optimized architecture, the gap between BiM-based adaptation throughput and zlib-based adaptation throughput becomes narrower. In fact, for some cases the throughput of BiM-based adaptation equals or slightly outperforms that of zlib-based adaptation. For SVC/VES the throughput of codec-specific adaptation proxies is still around ten times higher than the codec-agnostic gBSD-based adaptation proxy (the gap is considerably higher for BSAC, for the above mentioned reasons). However, given the achieved throughputs of 200 to 400 Mbps at 50 percent CPU load on our regular desktop computer and the qualitative advantages, such as implementation complexity advantages of the codec-agnostic


Page 153

adaptation mechanism, we still conclude that gBSD-based adaptation is a very viable alternative to conventional, codec-specific adaptation. As expected, the fixed-size packetization mode with packets of a fixed size of 5600 bytes generally performs worst, except for BSAC, because of a) the generally very small BSAC AUs and b) the fact that for BSAC the payload does not need to be inspected byte-per-byte. Similarly to the gBSD-based adaptation approach, the GSH-based adaptation approach also enables codec-agnostic adaptation in the network. However, due to its more efficient representation and processing, it comes at fewer performance cost and metadata overhead. In fact it is only around 1.25 times slower than the codec-specific approach, making it even more attractive than the gBSD-based adaptation approach. The metadata overhead for the GSH-based adaptation approach is always below two percent. In particular Figure 6.14 illustrates that the metadata overhead due to the GSH is generally insignificant, in fact its contribution to the overall bitrate is so small that it is sometimes barely visible in the figure. Again, BSAC is the exception here, for the same reasons as the ones detailed for the gBSD above. The GSH-based adaptation approach is very attractive and future work will further evaluate this mechanism’s applicability to additional application scenarios.

6.7


In this section we evaluated the novel concepts introduced in this thesis. We start with the individual processes, where we evaluate fragmentation, compression and transformation performance. We then move on to evaluating the regular proxy approach. Finally we compare the intercepting proxy approach, where we implemented both the GSH-based and gBSD-based adaptation approaches, to traditional, codec-specific adaptation approaches. The results show that the codec-agnostic, GSH-based adaptation approach suffers only minimal performance penalties due to metadata overhead and is, given the lower complexity, a very viable alternative to codec-specific approaches. For the gBSD-based adaptation approach, the performance penalties are considerably higher, but can be reduced by several optimizations, such as our novel approach to the gBSD transformation based on regular expressions. However, the major reduction in performance penalties can be achieved by reducing the adaptation granularity to the case where media fragments are no longer truncated but only dropped or kept. This approach, which is possible for SVC and VES, allows to skip the gBSDtoBin process, thus resulting in a major performance gain. This comes


Page 154

at the cost of additional, but tolerable, metadata overhead. Overall, the results show that, given the qualitative advantages of the codec-agnostic adaptation approaches, they are a viable alternative to existing, codec-specific adaptation mechanisms.

CHAPTER

7

Summary and conclusion

This thesis focuses on codec-agnostic dynamic and distributed adaptation of scalable multimedia content. To this end, Part I introduces this research area. First, different types of scalable media, i.e., three scalable video codecs and one scalable audio codec, are introduced. For SVC, we describe adaptation paths which steer the order in which enhancement layers are dropped. We subsequently describe the difference between traditional codecspecific adaptation and codec-agnostic adaptation. Finally we introduce different types of adaptation nodes which differ with regard to their efficiency and adaptation flexibility. This is relevant for subsequent chapters where we show how our novel mechanisms support these different types of adaptation nodes. This part is concluded by introducing an existing codec-agnostic adaptation mechanism which is based on the idea of transferring the adaptation into the XML domain. This adaptation mechanism attempts to provide all of the media codec specific information as XML descriptions together with the media content, thus allowing for codec-agnostic adaptation. This approach to enabling codec-agnostic adaptation is refered to as gBSD-based adaptation, since it relies on a generic Bitstream Syntax Description (gBSD) for describing the media content in the XML domain. As described in this part, this is conceptually very pleasing, since it follows a modular approach by fulfilling the different requirements by specific tools. While this modular approach does not result in the most efficient solution, as shown in Part III, it provides the conceptual foundation for the novel mechanisms in this thesis, which are described in Part II. In Part II we therefore start by extending this adaptation mechanism towards our goal, i.e., towards dynamic and distributed adaptation. In order to realize this, special focus is put on the treatment of the sometimes large gBSD metadata which is introduced by this

CHAPTER 7. SUMMARY AND CONCLUSION

Page 156

adaptation mechanism. To this end, Chapter 4 introduces novel mechanisms for fragmentation, storage and transport of content-related XML metadata. In particular, Section 4.3 introduces the concept of samples for metadata by employing Streaming Instructions which steer the fragmentation of and provide timing for XML-based metadata. The Streaming Instructions extend an XML metadata document by providing additional attributes to describe the fragmentation and timing of media data and XML metadata such as to enable their synchronized delivery and processing. In addition, a style sheet approach provides the opportunity to dynamically set such streaming properties without actually modifying the metadata. This enables the synchronized processing of such a metadata stream with the described media samples. Consequently, we further explore this topic in Section 4.4 by investigations of the ISO Base Media File Format with regard to how such metadata streams can be stored for later processing. Additionally, the applicability of the Real-Time Transport Protocol (RTP) is analyzed for the transport of such metadata streams. Section 4.5 concludes this chapter with a description of a codec-agnostic adaptation framework, including several optimizations to this adaptation mechanism. This chapter shows it is possible to enhance the static gBSD-based adaptation approach towards dynamic and distributed application scenarios. We further show that specific types of metadata, such as the gBSD, have very similar requirements to media data when processing them in streaming scenarios. Subsequently, Part II moves on by introducing an alternative adaptation mechanism for codec-agnostic adaptation. This mechanism is based on the findings and experience gained in our work on the gBSD-based codec-agnostic adaptation approach. The mechanism is based on a novel binary header in order to enable codec-agnostic adaptation of media content. This Generic Scalability Header (GSH) prefixes each media packet payload. It provides information on both the bitstream syntax and the adaptation options. Its aim is to enable codec-agnostic adaptation at a considerably lower performance cost than gBSDbased adaptation. In Section 5.3 we introduce the syntax and semantics of the GSH and subsequently we show examples of how various scalable media codecs can be described by the GSH in Section 5.4. Section 5.5 describes how various types of adaptation nodes are supported by the GSH, ranging from stateless and simple to stateful and feature-rich. This section again concludes with a description of a codec-agnostic adaptation framework which is realized based on the concepts presented in Chapter 5. We show in this chapter how the metadata needed to accomplish codec agnostic adaptation can be reduced compared to


Page 157

the gBSD-based adaptation approach and that the GSH approach is applicable to support various types of adaptation nodes, i.e., stateless, stateful and application aware adaptation nodes. Finally, Part III concludes this thesis by evaluating the novel concepts which are introduced in the previous part. It starts with evaluating the Streaming Instructions processors as stand-alone implementations where the maximum throughput of these processors is measured. The results of this section show that the processors are able to process at least 20 concurrent streams of the test content at 30 FPS on a regular desktop computer. Section 6.3 continues by evaluating three compression mechanisms for the gBSD metadata of the gBSD-based adaptation mechanism. The results show that the XML-specific BiM compression mechanism provides the highest compression factors for smaller metadata samples, while it is out-performed by the other compression mechanisms for larger metadata samples. In Section 6.4 we evaluate the different approaches to transformation of metadata samples which is a part of the gBSD-based adaptation mechanisms. We compare our novel approach for this transformation, which relies on regular expressions, to traditional approaches. The results from this evaluation show that the approach based on regular expressions results in a better performance (both in terms of throughput and memory consumption) compared to the alternatives. After evaluating these specific contributions, Part III moves on to evaluating complete adaptation nodes which are implemented based on the adaptation mechanisms introduced in this thesis. Most notably the measurements show that for MPEG-4 Scalable Video Codec and MPEG-4 Visual Elementary Streams the GSHbased mechanism’s throughput is only about 1.25 times lower than for the codec-specific mechanism and the metadata overhead is less than 1 percent. The gBSD-based mechanism comes at a higher cost (about 10 times lower throughput and a maximum of 10 percent metadata overhead). Additionally, we conclude that, depending on the application scenario, and given the qualitative advantages of codec-agnostic adaptation which were described in Section 2.3, both the gBSD-based and the GSH-based mechanisms are viable alternatives to existing codec-specific adaptation approaches. In particular in scenarios where contents encoded with diverse (and potentially changing) scalable media codecs need to be adapted, the flexibility of codec-agnostic approaches can outweigh their reduced performance. It must again be noted that the Properties Style Sheet mechanism which was presented in this thesis originated from ideas by Sylvain Devillers. The other innovative mechanisms which were described in this thesis originate from the author’s ideas. However, several of


Page 158

them were submitted to MPEG and went through a standardization process where several experts from MPEG contributed to refine the mechanisms. This is the case for the Streaming Instructions and also the extensions to the ISO Base Media File Format. The Streaming Instructions were also refined during the DANAE EU IST project where partners contributed to their development. The other mechanisms, in particular the so far unpublished Generic Scalability Header mechanism, represent the author’s original ideas which were not yet reviewed or refined by the scientific community. As described in the corresponding sections and summarized in the following, our proposals do not come without limitations. We pointed out that while the Streaming Instructions are meant to apply to a wide variety of metadata, their design was very much directed by a specific use case (dynamic and distributed adaptation) and a specific set of metadata (gBSDs). Similarly we pointed out the deficiencies of our extensions to the ISO Base Media File Format (e.g., the overhead inflicted by the requirement to store PUs) and of our approach to transporting PUs in synchronization with the media fragments, which is currently restricted to RTP. We also emphasized that while the GSH enables a much more performant adaptation node, it is also much less flexible than the gBSD, which may lead to required extensions, which are incompatible to the current version, when introducing new application scenarios and / or scalable codecs. Since some of the mechanisms introduced in this thesis were adopted by MPEG and are now therefore representing an industry standard, the obvious question of industry adoption arises. While the author is not aware of industry adoption, the mechanisms were implemented, demonstrated and evaluated within the DANAE EU IST project. Additionally they are implemented in MPEG Reference Software which was tested for conformance. Given that it takes considerable time before a new standard is adopted by industry, as can be seen by the example of MPEG-4, the author is still optimistic that industry adoption of the presented mechanisms will take place. We foresee several smaller topics which should be covered in future work: Evaluations of the Properties Style Sheet mechanism should take place to see if an additional performance gain can be provided by it. We expect such a gain in performance, since the Properties Style Sheet, other than Streaming Instructions included in the gBSD, only needs to be parsed once in order to fragment the complete gBSD. However, this is only feasible if the composition of all PUs for a gBSD is the same and if the timing of all PUs is linear, i.e., if the time stamp difference between any two PUs is always the same.


Page 159

The fact that for high throughput the performance gap between zlib and BiM compression gets narrower is quite interesting and should be further analyzed. Generally, our extensions of the gBSD-based adaptation mechanism should be further evaluated. Most notably, we mainly used the generic gBSDs to evaluate our mechanisms in this thesis and not the codec-specific BSDs. However, we expect similar results for BSDs. In fact, during the MPEG standardization process of the Streaming Instructions, evaluations also included BSDs which showed the Streaming Instructions’ applicability to BSDs. Further quantitative evaluations should also consider the related work presented in Section 4.2. It would be particularly interesting to try to apply the Streaming Instructions to alternatives to the gBSD, such as XFlavor, BFlavor or gBFlavor. Similar evaluations may be conducted for the remaining related work which was introduced in Section 4.2. Finally, we aim to test the applicability of the GSH-based adaptation mechanism to additional scalable media.

Bibliography [1] D. Singer, M. Zubair Visharam, Y. Wang, and T. Rathgen, editors. ISO/IEC 1449615:2004/Amd2: SVC File Format. International Standardization Organization, 2007. [2] D. Singer and M. Visharam. nical

report,

International

MPEG-4 File Formats white paper. Standardization

Organization,

October

Tech2005.

http://www.chiariglione.org/mpeg/technologies/mp04-ff/. [3] C. Timmerer, S. Devillers, and M. Ransburg, editors. ISO/IEC 21000-7:2007 Part 7: Digital Item Adaptation 2nd Edition. International Standardization Organization, 2007. [4] M. Ransburg and H. Hellwagner. Generic Streaming of Multimedia Content. In Proc. IASTED International Conference on Internet and Multimedia Systems and Applications, Grindelwald, Switzerland, February 2005. [5] M. Ransburg, C. Timmerer, and H. Hellwagner. Transport Mechanisms for Metadatadriven Distributed Multimedia Adaptation. In Proc. First International Conference on Multimedia Access Networks, pages 25–29, July 2005. [6] C. Timmerer, S. Devillers, and M. Ransburg, editors. ISO/IEC 21000-7:2004/Amd 2: Dynamic and Distributed Adaptation. International Standardization Organization, 2006. [7] M. Ransburg and D. Singer, editors. ISO/IEC 14496-12:2005/Amd 1: Description of Timed Metadata. International Standardization Organization, 2005. [8] M. Ransburg, editor. ISO/IEC 14496-4:2002/Amd 24: File Format Conformance. International Standardization Organization, 2002. 160

BIBLIOGRAPHY

Page 161

[9] M. Ransburg and S. Devillers. Workplan for CE on AdaptationQoS terms for stream timing. International Standardization Organization, January 2006. MPEG Output Document N7860. [10] M. Ransburg, T. Thang, and S. Devillers. Workplan for CE on DIA extensions for BSDbased adaptation. International Standardization Organization, January 2006. MPEG Output Document N7859. [11] DANAE Consortium. Whitepaper on MPEG-21 Digital Item Adaptation. International Standardization Organization, April 2006. MPEG Output Document N8083. [12] A. Hutter, P. Amon, G. Panis, E. Delfosse, M. Ransburg, and H. Hellwagner. Automatic Adaptation of Streaming Multimedia Content in a Dynamic and Distributed Environment. In Proc. International Conference on Image Processing, Genova, Italy, September 2005. [13] Poppe, C. and Ransburg, M. and De Zutter, S. and Van de Walle, R. Interoperable Affective Context Collection using MPEG-21. In Proc. International Conference on Wireless, Mobile & Multimedia Networks, Hangzhou, China, November 2006. [14] M. Ransburg, R. Cazoulat, B. Pellan, C. Concolato, S. De Zutter, and R. Van de Walle. Dynamic and Distributed Adaptation of Scalable Multimedia Content in a Context-Aware Environment. In Proc. European Symposium on Mobile Media Delivery, September 2006. [15] M. Ransburg, S. Devillers, C. Timmerer, and H. Hellwagner. Processing and Delivery of Multimedia Metadata for Multimedia Content Streaming. In Proc. 6th Workshop on Multimedia Semantics - The Role of Metadata, Aachen, Germany, March 2007. [16] M. Ransburg, C. Timmerer, H. Hellwagner, and S. Devillers. Design and Evaluation of a Metadata-Driven Adaptation Node. In Proc. International Workshop on Image Analysis for Multimedia Interactive Services, Santorin, Greece, June 2007. [17] M. Ransburg, H. Gressl, and H. Hellwagner. Efficient Transformation of MPEG-21 Metadata for Codec-agnostic Adaptation in Real-time Streaming Scenarios. In Proc. International Workshop on Image Analysis for Multimedia Interactive Services, Klagenfurt, Austria, May 2008.

BIBLIOGRAPHY

Page 162

[18] M. Mackay, M. Ransburg, D. Hutchison, and H. Hellwagner. Combined Adaptation and Caching of MPEG-4 SVC in Streaming Scenarios. In Proc. International Workshop on Image Analysis for Multimedia Interactive Services, Klagenfurt, Austria, May 2008. [19] R. Kuschnig, I. Kofler, M. Ransburg, and H. Hellwagner. Design Options and Comparison of In-network H.264/SVC Adaptation. Submitted to Journal of Visual Communication and Image Representation. [20] M. Granitzer, M. Lux, and Spaniol M., editors. Multimedia Semantics: The Role of Metadata. Springer, 2008. [21] M. Ransburg, C. Timmerer, and H. Hellwagner. Dynamic and Distributed Multimedia Content Adaptation based on the MPEG-21 Multimedia Framework. Multimedia Semantics: The Role of Metadata. Springer, 2008. [22] T. Wiegand, J. Ohm, G. Sullivan, and A. Luthra. Special Issue on Scalable Video Coding - Standardization and Beyond. IEEE Transactions on Circuits and Systems for Video Technology, 17(9), September 2007. [23] T. Wiegand, G. Sullivan, H. Schwarz, and M. Wien, editors.

ISO/IEC 14496-

10:2005/Amd3: Scalable Video Coding. International Standardization Organization, 2007. [24] M. Karczewicz and R. Kurceren. The SP- and SI-frames design for H.264/AVC. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):637–644, July 2003. [25] H. Schwarz, D. Marpe, and T. Wiegand. Analysis of Hierarchical B Pictures and MCTF. In Proc. IEEE International Conference on Multimedia and Expo, Ontario, Canada, July 2006. [26] G. Sullivan and T. Wiegand. Video Compression - From Concepts to the H.264/AVC Standard. Proceedings of the IEEE - Special Issue on Advances in Video Coding and Delivery, 93(1):18–31, January 2005. [27] S.-T. Hsiang and J. W. Woods. Embedded image coding using zeroblocks of subband/wavelet coefficients and context modeling. In Proc. Data Compression Conference, May 2000.

BIBLIOGRAPHY

Page 163

[28] ISO/IEC 14496-3:1999 Part 3: MPEG-4 Audio Version 2. International Standardization Organization, 1999. [29] H. Purnhagen. An Overview of MPEG-4 Audio Version 2. In Proc. 17th International Conference on High-Quality Audio Coding, Florence, Italy, September 1999. [30] ISO/IEC 14496-2:2003 Part 2: MPEG-4 Visual 3rd Edition. International Standardization Organization, 2003. [31] T. Ebrahimi and C. Horne. MPEG-4 natural video coding - An overview. Signal Processing: Image Communication, 15(4–5):365–385, January 2000. [32] M. Kampmann, M. Vorwerk, M. Kleis, S. Schmid, S. Herborn, R. Aguero, and J. Choque. A Multimedia Delivery Framework for Ambient Networks. In Proc. Wireless World Research Forum, Helsinki, Finland, June 2007. [33] F. Hartung, N. Niebert, A. Schieder, R. Rembarz, S. Schmid, and L. Eggert. Advances in network-supported media delivery in next-generation mobile systems. IEEE Communications Magazine, 44(8):82–89, August 2006. [34] J. Clark. XSL Transformations (XSLT) Version 1.0. Technical report, W3C, November 1999. W3C Recommendation. [35] P. Cimprich, O. Becker, C. Nentwich, H. Jirousek, M. Batsis, P. Brown, and M. Kay. Streaming Transformations for XML (STX) Version 1.0. Technical report, April 2007. http://stx.sourceforge.net/documents/spec-stx-20070427.html. [36] I. Kofler, C. Timmerer, H. Hellwagner, A. Hutter, and F. Sanahuja. Efficient MPEG21-based Adaptation Decision-Taking for Scalable Multimedia Content. In Proc. 14th Multimedia Computing and Networking Conference, San Jose, USA, January 2007. [37] A. Vetro and C. Timmerer. Digital Item Adaptation: Overview of Standardization and Research Activities. IEEE Transactions on Multimedia, 7(3):418–426, June 2005. [38] S. Devillers, C. Timmerer, J. Heuer, and H. Hellwagner. Bitstream Syntax DescriptionBased Adaptation in Streaming and Constrained Environments. IEEE Transactions on Multimedia, 7(3):463–470, June 2005.

BIBLIOGRAPHY

Page 164

[39] C. Timmerer and H. Hellwagner. Interoperable adaptive multimedia communication. IEEE Multimedia Magazine, 12(1):74–79, January 2005. [40] D. De Schrijver, W. De Neve, K. De Wolf, R. De Sutter, and R. Van de Walle. An optimized MPEG-21 BSDL framework for the adaptation of scalable bitstreams. Journal of Visual Communication and Image Representation, 18(3):217–239, June 2007. [41] P. Fox, D. McGuinness, R. Raskin, and K. Sinha. Semantically-Enabled Scientific Data Integration. In Proc. Geoinformatics 2006, May 2006. [42] R.S. Atarashi, J. Kishigami, and S. Sugimoto. Metadata and new challenges. In Proc. Symposium on Applications and the Internet Workshop, January 2003. [43] D. Van Deursen, W. De Neve, D. De Schrijver, and R. Van de Walle. BFlavor: an optimized XML-based framework for multimedia content customization. In Proc. 25th Picture Coding Symposium, Beijing, China, April 2006. [44] D. Hong and A. Eleftheriadis. XFlavor: bridging bits and objects in media representation. In Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002. [45] W. Bailer and P. Schallauer. Detailed audiovisual profile: enabling interoperability between MPEG-7 based systems. In Proc. 12th International Multi-Media Modeling Conference, Beijing, China, January 2006. [46] J. van der Meer, D. Mackie, V. Swaminathan, D. Singer, and P. Gentric. RTP Payload Format for Transport of MPEG-4 Elementary Streams. Technical report, Internet Engineering Task Force, November 2003. Proposed Standard, RFC 3640. [47] S. Wenger, Y. Wang, and T. Schierl. RTP Payload Format for SVC Video. Technical report, Internet Engineering Task Force, January 2008. Internet Draft. [48] D. Singer, editor. ISO/IEC 14496-12:2005 Part 12: ISO Base Media File Format. International Organization for Standardization, 2005. [49] K.P. Diepold and F.W. Chang. MPEG-A: Multimedia Application Formats. IEEE Multimedia, 12(4):34–41, October 2005.

BIBLIOGRAPHY

Page 165

[50] E. Y. C. Wong, T. S. Chan, and H. Leong. Semantic-based Approach to Streaming XML Contents using Xstream. In Proc. 27th Annual International Computer Software and Applications Conference, Dallas, TX, USA, November 2003. [51] U. Niedermeier, J. Heuer, A. Hutter, W. Stechele, and A. Kaup. An MPEG-7 tool for compression and streaming of XML data. In Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002. [52] S. Pfeiffer, C. Parker, and A. Pang. The Continuous Media Markup Language. Technical report, Internet Engineering Task Force, March 2004. Internet Draft. [53] S. Pfeiffer, C. Parker, and A. Pang. The Annodex exchange format for time-continuous bitstreams. Technical report, Internet Engineering Task Force, March 2005. Internet Draft. [54] L. Rutledge. SMIL 2.0: XML for Web multimedia. IEEE Internet Computing, 5(5):78– 84, September 2001. [55] F. Simeoni, D. Lievens, R. Conn, and P. Mangh. Language bindings to XML. IEEE Internet Computing, 7(1):19–27, January 2003. [56] A. Hors, P. Hegaret, L. Wood, G. Nicol, J. Robie, M. Champion, and S. Byrne. Document Object Model (DOM) Level 3 Core Specification. Technical report, World Wide Web Consortium, April 2004. W3C Recommendation. [57] A. Quint and F. Design. Scalable vector graphics. IEEE Multimedia, 10(3):99–102, July 2003. [58] M. Amielh and S. Devillers. Bitstream Syntax Description Language: Application of XML-Schema to Multimedia Content. In Proc. 11th International World Wide Web Conference, Honolulu, Hawaii, May 2002. [59] W. De Neve, D. Van Deursen, D. De Schrijver, S. Lerouge, K. De Wolf, and R. Van de Walle. BFlavor: A harmonized approach to media resource adaptation, inspired by MPEG-21 BSDL and XFlavor. Signal Processing: Image Communication, 21(10):862– 889, November 2006.

BIBLIOGRAPHY

Page 166

[60] D. Van Deursen, W. De Neve, D. De Schrijver, and R. Van de Walle. Automatic generation of generic Bitstream Syntax Descriptions applied to H.264/AVC SVC encoded video streams. In Proc. 14th International Conference on Image Analysis and Processing, Modena, Italy, September 2007. [61] A. Eleftheriadis. Flavor: A language for media representation. In Proc. ACM Multimedia Conference, Seattle, Washington, November 1997. [62] ISO/IEC 14496-1:2004 Part 1: Systems. International Standardization Organization, 2004. [63] J. Cowan and R. Tobin. XML Information Set (Second Edition). Technical report, World Wide Web Consortium, February 2004. W3C Recommendation. [64] T. Sikora. The MPEG-7 visual standard for content description—an overview. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):696–702, June 2001. [65] J. Clark and S. DeRose. XML Path Language (XPath) Version 1.0. Technical report, World Wide Web Consortium, November 1999. W3C Recommendation. [66] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. RTP: A Transport Protocol for Real-Time Applications. Technical report, Internet Engineering Task Force, July 2003. Standard, RFC 3550. [67] ISO/IEC 14496-14:2003 Part 14: MP4 File Format. International Standardization Organization, 2003. [68] ISO/IEC 14496-15:2004 Part 15: Advanced Video Coding (AVC) file format. International Standardization Organization, 2004. [69] ISO/IEC 21000-9:2005 Part 9: MPEG-21 File Format. International Standardization Organization, 2005. [70] ISO/IEC 15444-3:2007 Part 3: Motion JPEG2000 File Format. International Standardization Organization, 2007. [71] M. Ransburg, C. Timmerer, and H. Hellwagner. Support for Timed Metadata in the MPEG-21 File Format. International Standardization Organization, April 2005. MPEG Input Contribution.

BIBLIOGRAPHY

Page 167

[72] M. Ransburg, C. Timmerer, and H. Hellwagner. Proposed amendments to 14496-12 and/or 21000-9 for storage of timed metadata. International Standardization Organization, July 2005. MPEG Input Contribution. [73] S. Harrusi, A. Averbuch, and A. Yehudai. XML Syntax Conscious Compression. In Proc. Data Compression Conference, pages 402–411, March 2006. [74] M. Cokus and D. Winkowski. XML Sizing and Compression Study For Military Wireless Data. In Proc. XML Conference & Exposition, December 2002. [75] S.J. Davis and I. Burnett. Efficient Delivery within the MPEG-21 Framework. In Proc. First International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, pages 205–208, Florence, Italy, November 2005. [76] R. De Sutter, S. Lerouge, W. De Neve, C. Timmerer, H. Hellwagner, and R. Van de Walle. Comparison of XML serializations: cost benefits versus complexity. Multimedia Systems Journal, 12(2):101–115, 2006. [77] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the Scalable H.264/MPEG4-AVC Extension. In Proc. International Conference on Image Processing, October 2006. [78] H. Schulzrinne, A. Rao, and R. Lanphier. Real Time Streaming Protocol (RTSP). Technical report, Internet Engineering Task Force, April 1998. Proposed Standard, RFC 2326. [79] M. Handley, V. Jacobson, and C. Perkins. SDP: Session Description Protocol. Technical report, Internet Engineering Task Force, July 2006. Proposed Standard, RFC 4566. [80] R. Russel. Linux 2.4 Packet Filtering HOWTO. Technical report, January 2002. http://www.netfilter.org/documentation/HOWTO/packet-filtering-HOWTO.html. [81] IEEE Std 1003.1. IEEE, 2004. [82] N. Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124, 1956. [83] S. Kepser. A Simple Proof for the Turing-Completeness of XSLT and XQuery. In Proc. Extreme Markup Languages, Montreal, Quebec, August 2004.

BIBLIOGRAPHY

Page 168

[84] J. Postel. Internet Protocol. Technical report, Internet Engineering Task Force, December 1981. Standard, RFC 791. [85] S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers. Technical report, Internet Engineering Task Force, December 1998. Proposed Standard, RFC 2474. [86] B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, and L. Zhang. Recommendations on Queue Management and Congestion Avoidance in the Internet. Technical report, Internet Engineering Task Force, April 1998. Internet Draft. [87] C. Timmerer, T. Frank, and H. Hellwagner. Efficient processing of MPEG-21 metadata in the binary domain. In SPIE International Symposium ITCom 2005 on Multimedia Systems and Applications VIII, Boston, Massachusetts, October 2005. [88] C. Timmerer, P. Lederer, and H. Kosch. Transforming MPEG-21 generic Bitstream Syntax Descriptions within the Binary Domain. In Proc. Fourth International Workshop on Content-Based Multimedia Indexing, Riga, Lativa, June 2005.

codec-agnostic dynamic and distributed adaptation of ...

m____________________ mpeg-21 digital item adaptation - ITEC

In-network Real-time Adaptation of Scalable Video Content on ... - ITEC

Dynamic Cross-Layer Adaptation of Scalable ... - Semantic Scholar

A Scalable and Distributed Dynamic Formal Verifier for MPI Programs

A Scalable and Distributed Dynamic Formal Verifier for MPI Programs

Dynamic Topology Adaptation and Distributed Estimation for ... - arXiv

A Knowledge and Component Based Multimedia Adaptation ... - ITEC

Scalable Distributed Job Processing with Dynamic Load Balancing

Scalable Distributed Databases Design

The Challenge of Scalable and Distributed Fusion

Centralized and distributed architectures of scalable video ...

The Challenge of Scalable and Distributed Fusion

An Expressive Approach to Distributed Applications Dynamic Adaptation

An Expressive Approach to Distributed Applications Dynamic Adaptation

Distributed Indexing: A Scalable Mechanism for Distributed ...

Efficient MPEG-21-based Adaptation Decision-Taking for ... - ITEC

An MPEG-21-driven Utility-based Multimedia Adaptation ... - ITEC

SCALABLE DOMAIN ADAPTATION OF CONVOLUTIONAL NEURAL ...

Dynamic adaptation to trial difficulty 1 Dynamic adaptation ... - CiteSeerX

A Generic, Distributed and Scalable Multimedia Information ...

Challenges: Building Scalable and Distributed ... - Semantic Scholar

Scalable Distributed Databases Design - CiteSeerX

Towards Dependable, Scalable, and Pervasive Distributed Ledgers ...

codec-agnostic dynamic and distributed adaptation of scalable ... - ITEC

codec-agnostic dynamic and distributed adaptation of scalable ... - ITEC

Suggest Documents