Design, Implementation and Evaluation of a Variable Bit ... - CiteSeerX

Design, Implementation and Evaluation of a Variable Bit-Rate Continuous Media File Server by Dwight J. Makaro M. Sc., University of Saskatchewan, 1988 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Doctor of Philosophy in

THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard

The University of British Columbia September 1998

c Dwight Makaro, 1998

Abstract A Continuous Media File Server (CMFS) is a computer system that stores and retrieves data that is intended to be presented to a client application continuously over time. The primary examples of this kind of data are audio and video, although any other type of time-dependent media can be included (closed-caption text, presentation slides, etc). The presentation of these media types must be performed in real-time and with a low latency for user satisfaction. This dissertation describes the design, implementation and performance analysis of a le server for variable-bit-rate (VBR) continuous media. A CMFS has been implemented on a variety of hardware platforms and tested within a high-speed network environment. The server is designed to be used in a heterogeneous environment and is linearly scalable. A signi cant aspect of the design of the system is the detailed consideration of the variable bit-rate pro le of each data stream in performing admission control for the disk and for the network. The disk admission control algorithm simulates reading data blocks early and storing them in memory buers at the server, achieving read-ahead and smoothing out peaks in the bandwidth requirements of individual streams. The network algorithm attempts to send data early and reserves bandwidth only for the time that it is required. The algorithms are sensitive to the variability in the bandwidth requirements, but can provide system utilization that approaches 100% of the disk bandwidth achievable for medium length video streams in the test hardware environment.

ii

Contents Abstract

ii

Contents

iii

List of Tables

vii

List of Figures

ix

Acknowledgements

xi

Dedication

xii

1 Introduction 1.1 1.2 1.3 1.4 1.5

Motivation . . . . . . . . . . . . . . . . . . . Current Issues in CMFS Design . . . . . . . . Thesis Statement . . . . . . . . . . . . . . . . Research Contributions . . . . . . . . . . . . Thesis Organization and Summary of Results

2 System Model 2.1 2.2 2.3 2.4

Overall System Design Model Design Objectives . . . . . . . Admission Control . . . . . . System Architecture . . . . .

. . . .

. . . . iii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

1

. 1 . 4 . 9 . 11 . 12 . . . .

15

16 19 27 30

2.5 Data Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Design Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Stream Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 System Design and Implementation 3.1 3.2 3.3 3.4 3.5 3.6

System Initialization and Con guration . . User Interface to CMFS Facilities . . . . . . Slot Size Implications . . . . . . . . . . . . Data Delivery and Flow Control . . . . . . Real-Time Writing . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . 3.6.1 Environment and Calibration . . . . 3.6.2 Implementation Environment . . . . 3.6.3 Transport Protocol Implementations 3.6.4 Server Memory Requirements . . . . 3.7 Example Client Application . . . . . . . . .

4 Disk Admission Control

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

4.1 Admission Control Algorithm Design . . . . . . . . . . 4.1.1 Experimental Setup and System Measurements 4.1.2 Simple Maximum . . . . . . . . . . . . . . . . . 4.1.3 Instantaneous Maximum . . . . . . . . . . . . . 4.1.4 Average . . . . . . . . . . . . . . . . . . . . . . 4.1.5 vbrSim . . . . . . . . . . . . . . . . . . . . . . 4.1.6 Optimal Algorithm . . . . . . . . . . . . . . . . 4.2 Analytical Evaluation of Disk Performance . . . . . . 4.3 Disk Admission Algorithm Execution Performance . . 4.4 Performance Experiments . . . . . . . . . . . . . . . . 4.4.1 Scenario Descriptions . . . . . . . . . . . . . . iv

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

42

43 45 55 57 59 60 60 63 63 66 68

72

73 74 78 79 82 84 94 94 97 101 103

4.4.2 Simple Maximum . . . . . 4.4.3 Instantaneous Maximum . 4.4.4 Average . . . . . . . . . . 4.4.5 vbrSim . . . . . . . . . . 4.4.6 Summary . . . . . . . . . 4.5 Full-length streams . . . . . . . . 4.6 Analytical Extrapolation . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 Network Admission Control and Transmission 5.1 5.2 5.3 5.4 5.5 5.6 5.7

Server Transmission Mechanism and Flow Control . Network Admission Algorithm Design . . . . . . . . Network Bandwidth Schedule Creation . . . . . . . . Network Bandwidth Schedule Creation Performance Stream Variability Eects . . . . . . . . . . . . . . . Network Slot Granularity . . . . . . . . . . . . . . . Network Admission and Scalability . . . . . . . . . .

6 Related Work 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11

Complete System Models . . . . . . . . Synchronization of Media Streams . . . User Interaction Models . . . . . . . . . Scalability . . . . . . . . . . . . . . . . . Real-Time Scheduling/Guarantees . . . Encoding Format . . . . . . . . . . . . . Data Layout Issues . . . . . . . . . . . . Disk Admission Algorithms . . . . . . . Network Admission Control Algorithms Network Transmission . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . v

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

104 104 106 108 133 134 138

143

144 147 153 160 164 167 173

175

176 180 182 184 186 189 190 193 196 198 200

7 Conclusions and Future Work

7.1 Conclusions . . . . . . . . . . . . . . . . . 7.2 Future Work . . . . . . . . . . . . . . . . 7.2.1 Long Streams . . . . . . . . . . . . 7.2.2 Disk and Network Con gurations . 7.2.3 Relaxing the value of minRead . . 7.2.4 Variants of the Average Algorithm 7.2.5 Reordering Requests . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

203

203 207 207 208 208 209 209

Bibliography

210

Appendix A CMFS Application Programmer's Interface

219

A.1 Object Manipulation . . . . . . . . . . . . . . . . . . A.1.1 CMFS Storage API . . . . . . . . . . . . . . A.1.2 Moving and Deleting . . . . . . . . . . . . . . A.2 Stream Delivery and Connection Management . . . . A.2.1 Stream Control . . . . . . . . . . . . . . . . . A.3 Meta Data Management . . . . . . . . . . . . . . . . A.4 Directory Service . . . . . . . . . . . . . . . . . . . . A.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . A.5.1 Conversions and Stream Display Information

Appendix B Stream Scenarios

B.1 Stream Groupings . . . . . . . . . . B.2 Scenario Selection . . . . . . . . . . B.2.1 Algorithm Comparison . . . . B.2.2 All Remaining Comparisons .

vi

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

220 220 221 222 225 228 229 229 229

231

231 232 232 241

List of Tables 2.1 Stream Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Stream Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Cmfs Interface Procedures . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 4.2 4.3 4.4 4.5 4.6

Block Schedule Creation Timings (msec) . . . . . . . . Admission Control Timings (msec) . . . . . . . . . . . . Stream Groupings . . . . . . . . . . . . . . . . . . . . . Selection of Stream Scenarios . . . . . . . . . . . . . . . Short Streams - Admission Results - Staggered Arrivals Long Streams - Admission Results - Staggered Arrivals .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

98 99 110 111 130 131

5.1 Network Bandwidth Characterization Summary . . . . . . . . . . . . 161 5.2 Network Admission Performance: Simultaneous Arrivals (% of Network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.3 Network Admission Performance: Staggered Arrivals (% of Network) 165 5.4 Network Admission Performance: Simultaneous Arrivals (% of Network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.5 Network Admission Performance: Staggered Arrivals (% of Network) 168 5.6 Network Bandwidth Schedule Summary for Dierent Slot Lengths . 170 5.7 Network Admission Granularity: Simultaneous Arrivals (% of Network)171 5.8 Network Admission Granularity: Staggered Arrivals (% of Network) 172 vii

6.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8

Disk Admission Control Stream Groupings . . . . . . . . Network Admission Control Stream Groupings - MIXED . Network Admission Control Stream Groupings - LOW . . Network Admission Control Stream Groupings - HIGH . . Stream Selection into Scenarios (First Tests) . . . . . . . Stream Selection into Scenarios (Remaining Tests) . . . . Selection of Extra Scenarios: First 22 streams . . . . . . . Selection of Extra Scenarios: Last 22 streams . . . . . . .

viii

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

231 232 233 233 240 245 246 247

List of Figures 2.1 Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Organization of System . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Software Structure of Server Node . . . . . . . . . . . . . . . . . . . 44 3.2 Prepare Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 First Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15

Typical Stream Block Schedule . . . . . . . . . . . . . . . . . Stream Block Schedule (Entire Object) . . . . . . . . . . . . . Server Schedule During Admission . . . . . . . . . . . . . . . Server Block Schedule . . . . . . . . . . . . . . . . . . . . . . Example of Server Schedule and Buer Allocation Vectors . . Admissions Control Algorithm . . . . . . . . . . . . . . . . . Modi ed Admissions Control Algorithm . . . . . . . . . . . . Buer Reclamation . . . . . . . . . . . . . . . . . . . . . . . . Streams Accepted by Admission Algorithms . . . . . . . . . Simultaneous Requests - Invalid Scenario . . . . . . . . . . . Acceptance Rate - Simple Maximum . . . . . . . . . . . . . . Acceptance Rate - Instantaneous Maximum . . . . . . . . . . Acceptance Rate - Average . . . . . . . . . . . . . . . . . . . Acceptance Rate - vbrSim . . . . . . . . . . . . . . . . . . . . Algorithm Performance Comparison - Simultaneous Arrivals . ix

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

76 77 79 81 87 88 91 93 95 96 105 106 107 109 112

4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28

Algorithm Performance Comparison - 5 second stagger . . . . . . Algorithm Performance Comparison - 10 second stagger . . . . . Stream Variability: Acceptance Rates for Simultaneous Arrivals . Stream Variability: Acceptance Rates for Stagger = 5 Seconds . Stream Variability: Acceptance Rates for Stagger = 10 Seconds . Observed Disk Performance: Stagger = 10 Seconds . . . . . . . . Buer Space Analysis Technique . . . . . . . . . . . . . . . . . . Buer Space Requirements: Simultaneous Arrivals . . . . . . . . Buer Space Requirements: Stagger = 5 seconds . . . . . . . . . Buer Space Requirements: Stagger = 10 seconds . . . . . . . . Short Stream Scenario . . . . . . . . . . . . . . . . . . . . . . . . Looped Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . Short Stream Excerpt . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

113 114 115 117 118 119 121 122 124 125 135 136 137

5.1 Network Admissions Control Algorithm . . . . . . . . . . . . . . . . 152 5.2 Network Bandwidth Schedule - Original (Minimum Client Buer Space)156 5.3 Network Bandwidth Schedule - Smoothed (Minimum Client Buer Space) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.4 Simultaneous Arrivals: Network Admission . . . . . . . . . . . . . . 163 5.5 Staggered Arrivals: Network Admission . . . . . . . . . . . . . . . . 164 5.6 Network Slot Characterization . . . . . . . . . . . . . . . . . . . . . 169

x

Acknowledgements There are many people who deserve thanks and credit for contributing to this part of my life. Their roles are many and varied, and I'm sure to miss someone. I'd like to thank the remainder of the CMFS design team, most notably my supervisor, Dr. Gerald Neufeld, and Dr. Norman Hutchinson, who clari ed both the big picture and the several small pictures of the design issues and the mechanisms for implementing these within the CMFS. Along with their own contributions in this area, the support of David Finkelstein and Roland Mechler with respect to the implementation of network-level and client-level code was invaluable. The technical management and assistance of Mark McCutcheon was also greatly appreciated. The personal encouragement, support and steadying in uence of many friends has helped me perserve. In no particular order, I'd like to mention a number of them: Peggy Logan, Margaret Petrus, Christina Chan, Kristin Janz, Bill Thomlinson, Alistair Veitch, Peter Smith, Tarik Ono-Tesfaye, Elisa Baniassad, Christoph Kern, and Holly Mitchell. I'd also like to acknowledge the particular editing skills of Nigel Todd, Kristin Janz, Peter Lawrance, Sharon McLeish, and Sarah Walker. Finally, I'd like to thank my family, most notably my mother and my sister, for encouraging and believing in me through this project. Dwight J. Makaroff

The University of British Columbia September 1998

xi

In memory of my father, Albert James Makaro.

xii

Chapter 1

Introduction 1.1 Motivation Within the last 100 years, communication technology has undergone several significant revolutions. Beginning with the telegraph and telephone, ideas expressed in text and audio could be transmitted almost instantaneously across vast distances. In subsequent years, radio and television technology has enabled instantaneous mass electronic communication. The mode of this communication has traditionally been broadcast, where one signal sent on a particular frequency is received by many listeners or viewers. The content of the data has therefore been controlled by organizations with the transmission capability. Although the technology necessary to communicate point-to-point using video has existed for several years, it has only been widely used in very speci c, high-end systems, such as video conferencing for large corporations and government organizations. One of the diculties related to point-to-point video communication has been the high bandwidths and other associated costs related to the transmission of analog signals. This produced a broadcast system where the receivers have very little direct control over what content they receive and when they receive it. Another technology enabling the long-distance sharing of ideas expressed in audio and/or video is the recording of analog signals for later playback, which began 1

with the phonograph and continued with tape and disk recording devices for both audio and video. These devices enable a user to control when s/he views or listens to this data, but the choice of content is still limited by the size of a video/audio library. Concurrent with these developments in continuous media (de ned as media which must be presented continuously over time) has been the use of computer systems for data communications and high-quality graphics display. High-speed networks and ecient compression techniques have permitted data to be transferred at extremely high rates over long distances. It has become feasible to transmit continuous media in a digital format between computer systems. As well, the processing power of today's computers and the development of specialized video encoder/decoder cards permits the conversion of digital audio/video data into pictures and sound for a human user in real-time. Software has been developed for presentation of continuous media by computer systems from local storage (magnetic disk or CD-ROM) in the form of Encyclopedia CD-ROMs and high-resolution video games. The content is still limited by the size of an individual's or organization's media library. These concepts were merged by the development of continuous media server technology. With a specialized server (known as a CMFS, for Continuous Media File Server), a service provider with large resources for storing a wide variety of media content can provide access to this data to clients in an individualized manner or on demand. This media content is stored on the server as presentation objects. A presentation object is a video clip, feature-length movie, an audio sound-track, closed-captioned text or any time-sensitive data, i.e. any continuous media object that is stored in the CMFS as a separate entity. The terms audio object and video object, etc. are used to denote presentation objects of a speci c media type. The infrastructure for this technology has been provided by independent technologies and much interest has been generated by the potential for a low-cost ap2

proach to demand-driven delivery of continuous media data. Two of these technologies previously mentioned are high-speed networks and video compression. Without compression, the bandwidth of television-quality video is enormous. Lossy compression techniques have been developed which can reduce the amount of data necessary to store a video by factors of up to 100 to 1, with acceptable levels of degradation in picture quality. Similar techniques have been used with audio. Even with compression, transmission of a single video object in real time consumes a signi cant portion of the capacity of rst generation local area networks. Thus, for a system to be capable of supporting a large number of simultaneous users, bandwidths in excess of 100 Megabits per second (Mbps) is required by the server. In this restricted environment, there are still signi cant technical problems to be overcome in the design and implementation of a CMFS. Even with Moore's law in erect for several decades, which brought the computer industry ever increasing processing speeds at steadily decreasing costs, handling continuous media by computers has remained a formidable task. The reason for this is that the basic design of computers and their operating systems has always favoured the ecient use of resources which essentially has been achieved by dynamic allocation of CPU cycles, memory and input/output capabilities to processes. This does not mean that designers do not know how to design hard real-time systems, but they must resort to static allocation of critical resources, if deadlines impose to processes have to be met. The applications of CMFS technology are wide ranging. Video-On-Demand [70, 75, 89] has been an application whose time may never come, but still remains interesting from a technical point of view. The ability to provide consumer-oriented, mainstream entertainment such as movies, concerts and sporting events retains the allure of convenience and exibility for content providers and customers alike. Although previous experiments in Video-On-Demand trials have been spectacularly unsuccessful for a variety of reasons, research continues. 3

Perhaps a more promising form of on-demand continuous media technology is News-On-Demand where short audio/video objects can be combined to form a \video newspaper" environment tailored to an individual's preferences. Specialized servers for distance education or corporate training can be deployed on a smaller scale with a narrower choice of content and a smaller user population. All of these technologies enable ideas, news, and entertainment to be made available to a community in ways that are more sensitive to the needs of that community. This dissertation investigates the technical challenges of developing a CMFS in the context of a high-speed network. Such a server must be capable of delivering multiple high-quality audio and video objects simultaneously to multiple client applications in an ecient and exible manner. The delivery of live video and/or audio objects is outside the scope of this dissertation as there is no storing or retrieval required in such a system. Many of the concepts can be modi ed or extended to be applicable to live presentations.

1.2 Current Issues in CMFS Design A CMFS is a computer system that stores and retrieves data that is intended to be presented to a client application continuously over time. The primary examples of this kind of data are audio and video, although any other type of time-dependent media can be included (closed-caption text, presentation slides, etc). The presentation of these media types must be performed in real-time and with a low latency for user satisfaction. Such a le server shares many characteristics with traditional le servers, but has added requirements which are unique to continuous media systems. All le servers need le access primitives to allow the creation of les and the storing of information, as well as for reading information out of these les. A CMFS also requires primitives to allow access via virtual-VCR functions (i.e. play, record, stop, fast-forward, slow-scan, and rewind), which are more natural methods for interacting 4

with this type of media. A le server is implemented in the context of a computer network with multiple \client" machines that request data from the server. Appropriate network protocols must be provided to ensure the proper delivery of request messages and raw data. In a continuous media environment, request messages must be delivered reliably and with very little delay. An ecient protocol which ensures correctness of data is required. In contrast, the delivery of continuous media data has dierent performance priorities than the request messages. Retrieval for playback/presentation to a user must be performed in real-time in the presence of deadlines. In this case missing or corrupted data can be tolerated more easily than large latencies. The transmission of continuous media for storage at the server has large bandwidth requirements and real-time performance is desirable, but the media data must be delivered reliably, so corruption or loss cannot be tolerated. There have been several research and commercial applications of continuous media technology employed in recent years. The continuous media servers have used parallel computing technology to create systems which retrieve large amounts of video data in relatively straightforward manners. These systems do not necessarily incorporate intelligent reservation mechanisms or attempt to maximize the performance and/or utilization level of scarce system resources such as disk and network bandwidth. Two methods of delivering continuous media are possible: store-and-display and streaming. Store-and-display requires that the complete object be copied to the client machine before display, whereas streaming allows presentation to begin as soon as enough data has been received so that the continuity of the presentation will not be disrupted due to a lack of data during the remainder of the presentation. The former method may be suitable when the media objects are relatively small or the network bandwidth is inadequate for streaming, but the latter is preferable because 5

the delay between the time the user requests the object and the time the object can be presented is minimal (1 to 2 seconds). With store and display, this delay can be minutes for a stream which is only minutes in playback duration. Because data is transferred sequentially in real-time, the term \stream" is often used to identify continuous media objects. The size of video data in uncompressed form is extremely large. At a resolution comparable to current North American (NTSC) television standards (640x480 colour pixels at 20 bits/pixel), a single image contains 6,144,000 bits. One second of full-motion video (30 frames per second) thus requires 184,320,000 bits. Computer networks are incapable of transmitting digital data at that rate for more than one object simultaneously. Additionally, the amount of storage for video objects of moderate length cannot be provided on conventional storage media. Thus, compression of the data is required. One of two approaches is taken in all existing compression techniques: constant bit-rate (CBR) encoding with varying quality (when compared with the original signal), or variable bit-rate (VBR) with consistent quality. CBR streams are easy for computer systems. The bandwidth required from the disk system and the network can be simply calculated and allocated, but user satisfaction is compromised. The bene ts of consistent quality VBR streams are obvious from a human user's point of view, but variable bit-rates introduce complexities for the computer system that delivers and displays the data. The algorithms for disk retrieval and network data transmission must either make conservative estimates of resource usage, or be made explicitly aware of the time-varying bandwidth needs of each object. The combination of the large size of continuous media objects and the capacity limitations of conventional disk technology restrict the amount of video data that can be stored on a single disk. Even with current compression technology, television-quality, full-motion video consumes approximately 3 to 7 Megabits/sec, depending on the complexity of the images. A 9 GByte disk can thus store between 6

10,500 and 24,000 seconds of video (less than 7 hours). The bandwidth that can be achieved o a single disk is less than 40 Mbps, which limits the number of independent users to between 6 and 12. From these numbers, it can be seen that a CMFS for a reasonably-sized user population with a moderate library of video objects must comprise multiple server components. A scalable server design permits the capacity of the server to increase by adding components. This is only possible when the model incorporates such component integration. The most common mode of interaction with a CMFS is the transfer of very large volumes of video/audio data in a time-dependent continuous ow from the server to the client. In order to guarantee the continuity of the ow of data, the allocation of network resources such as bandwidth must be guaranteed. The availability of other resources at the server, such as processor cycles, RAM, and disk bandwidth must also be guaranteed to properly service the client. The understanding of how these resources are guaranteed has been a fruitful area of research for many years. The process of reserving bandwidth at the server is known as admission control. Bandwidth must be reserved for both the disk and the network to ensure that resources exist to deliver the data to the client application in time for presentation to the user. An estimate of the bandwidth requirements is required for every stream and these requests are summed in some manner and presented to the admission control algorithm. If a new request results in fewer resources required than available, the request is accepted; otherwise it is denied. The choice of admission control algorithms at the server greatly in uences the system's ability to maximize performance as measured by the cumulative bandwidth of accepted streams. Thus, provision of ecient and accurate admission control algorithms is one of the most signi cant problems to solve in CMFS design. Methods that provide deterministic guarantees wherever possible result in conservative usage of server resources, but the results of this dissertation show that a system designed with deterministic guar7

antees and variability of bit-rates in mind is capable of utilizing the resources in a near-optimal fashion. Much research into admission control methods has been done in recent years, but most of the work provides results via simulation and has not been integrated into real systems. A CMFS should also be able to accommodate media of many encoding types (including any de ned in the future), be capable of running in dierent software and hardware environments, and be able to handle varying data rates of objects in order to eciently use system resources. As a result, an abstract design is implemented in this dissertation which has limited dependence on particular hardware performance characteristics. In this way, the design can remain constant when increases in disk access speed and improvements in compression techniques are encountered in the future. One of the characteristics of continuous media that helps alleviate the problems of variability is that some loss of data can be tolerated by an application without degrading the presentation enough to be detectable by a human user. One of the most compelling design decisions/dilemmas is how to limit the eect of data loss on the human user. The most conservative solution is to reserve bandwidth at both the disk interface and the network interface so that transient overload or other resource contention is not possible. Simplistic implementations of this policy under-utilize the system resources because they must account for peaks in the requirements of the data streams. Even this conservative reservation does not absolutely guarantee delivery to the client application, as there are various network hardware components (routers and switches) between the server and client machines which could drop or delay packets for reasons beyond the control of the server. The analysis of compressed VBR video transmission over lossy networks has been covered extensively [12, 26, 27, 34, 35, 59], and is beyond the scope of this dissertation. One of the bene ts to having CMFS technology is for the user to be able to request the media content on an individualized basis. This should extend to the 8

ability to only request portions of a stream, to view in fast-motion or slow-motion in either forward or reverse. As well, the choice of audio accompaniment should be available so that multiple languages can be supported. This requires exible user interface primitives that independently access portions of video and audio objects. One method of providing fast-motion is to skip the reading and transmission of some of the object to provide the illusion of fast motion, signi cantly increasing the speed of playback without signi cantly increasing the bandwidth usage (i.e. delivering every other video frame, or every other second of video footage).

1.3 Thesis Statement The development of continuous media le server technology has made straightforward use of high-performance computer components and sophisticated methods for storage and retrieval of media data possible. These techniques have most often been implemented in isolation from each other in the sense that simple procedures have been used in systems with large, parallel delivery mechanisms, whereas the elegant techniques for maximizing performance of particular components have been evaluated primarily via simulation only. A complete model and implementation of a CMFS can greatly improve the understanding of realistic performance and design issues in the use of continuous media in a distributed environment. Therefore, the thesis of this dissertation is: An ecient, scalable, real-time Continuous Media File Server (CMFS) can be implemented that is based on an abstract model of the disk and network performance characteristics, which explicitly considers the variable bit-rate nature of compressed continuous media, accommodates heterogeneous data format syntax and permits arbitrary client synchronization of multiple streams, using heterogeneous hardware components and a parallel, distributed software architecture. The model upon which the system is based includes all aspects of system design from a exible client interface to admission control algorithms and resource 9

management of both the disk and the network. An ecient server provides admissions decisions with a minimal amount of overhead (in terms of execution time for the decision) and maximizes the use of physical resources, such as disk bandwidth, server and client buer space, and network bandwidth, subject to the constraint that promised performance is guaranteed. A scalable system has the capability to add components and increase performance in a linear fashion as measured by the number of simultaneous users and the cumulative data transfer rate. A system is real-time if it can guarantee delivery of continuous media to a client workstation such that each client can maintain the continuity of presentation throughout the entire play back duration. An abstract model of the performance of the primary system resources (disk and network bandwidth) is one that is free from detailed dependence on hardware characteristics. For the disk, this means precise data layout and knowledge of the mechanical characteristics are not incorporated into the model, but subsumed by summary performance measures. Likewise, for the network, details of the transport layer are hidden by a simple performance metric. A heterogeneous system permits dierences in data format, as well as the software platforms and hardware platforms upon which the system can be installed. Variable bit-rate data (data rate heterogeneity) is accommodated by acknowledging that the sizes of presentation units of continuous media (video frames/audio samples) may vary considerably both within a short amount of time and on larger time scales, and by explicitly incorporating this variability into the resource allocation (admission) and resource utilization (retrieval and transmission) strategies. Various standards exist for encoding digital audio and video. A general purpose server is capable of storing dierent data formats (data format heterogeneity) on the same storage devices. Ideally, a server would be composed of commercially available workstation computer hardware to reduce the incremental costs associated 10

with this new functionality. It would also be capable of being con gured and installed on various architectures (heterogeneous hardware) with minimal changes to performance parameter settings. The continuous media data sent across the network can be displayed by client systems of equally varied avors, subject to the client application's ability to perform the appropriate decoding. A exible CMFS allows a client application to retrieve streams of continuous media (possibly from dierent locations) in a manner that easily permits synchronous playback of any combination of video, audio, continuous text (i.e closed captioning), or other continuous media. In particular, the choice of diering qualities/encoding formats of video, and dierent language audio or text streams is provided to the client application and provided transparently by the server. This is a substantial enhancement to the functionality of a video server in that it permits access to the system resources in a less constrained manner. Audio and video streams may be requested independently, possibly from dierent servers that could be in dierent locations for the presentation of a single multi-media document.

1.4 Research Contributions Previous research has considered many of the issues involved in the creation of a CMFS in isolation. Very few systems have been built that consider the issues of scalability, heterogeneity, and the variable bit-rate nature of the data itself. The majority of the analysis of variable bit-rate data has been in the context of tracedriven or statistical model-driven simulations that have not been integrated into a working system. The development of a CMFS and its associated admission control schemes provides the three main contributions of this dissertation: 1. A comprehensive CMFS model is developed and implemented on several hardware architectures, verifying the feasibility of the design objectives. To aid in 11

achieving scalability, the system is designed in a distributed fashion allowing the components to reside on dierent computers, and thereby to achieve parallel execution wherever possible. 2. A new disk admission strategy is designed and analyzed. This strategy examines the detailed bit-rate pro le of a continuous media object and the available server disk bandwidth and buer space to determine if sucient resources exist to deliver the object from the server in a guaranteed fashion. This algorithm simulates the future behaviour of the disk, accounting for variability in the requirements of the streams and is thus named the vbrSim algorithm. 3. A network bandwidth characterization scheme is developed and integrated with a network admission algorithm. The network admission algorithm is a relatively straightforward extension of algorithms presented in other research, but which have not been suciently integrated into a comprehensive CMFS.

1.5 Thesis Organization and Summary of Results The remainder of this dissertation is organized as follows: The system model and motivation for the speci c features of the CMFS are given in Chapter 2. This includes a description of the scope of the intended application domain and the type of continuous media streams that comprise the testing environment. The design details of the server components are described in Chapter 3. The user interface is also presented along with the manner in which the model of interaction in uenced the design of the system. The major functional components of data delivery (retrieval), server ow control, and storage of data are discussed in order. Finally, the implementation environment is introduced which includes the hardware and software context for the testing and validation of the unique VBRsensitive algorithms of the CMFS. 12

Disk admission control is the focus of Chapter 4. Several possible approaches to disk admission are presented and compared with the vbrSim algorithm. The vbrSim algorithm is shown to be both ecient to execute and superior to the alternatives in terms of admission performance. This is done both analytically and via performance experiments. Analytically, the vbrSim algorithm approaches the performance of an Optimal algorithm as the variability in disk performance decreases. A large number of performance experiments explore the eect of diering request patterns for dierent types of streams in terms of the number of simultaneous users and the amount of sustainable bandwidth that a single-disk server can support. In the performance experiments, a set of requests presented to a scenario as a group is de ned to be a scenario. The results show that the disk system can admit scenarios with larger cumulative bandwidth when the streams have lower variability than when the streams have higher variability. As well, the introduction of stagger into the arrival pattern permits contiguous reading which allows the admission control to accept more simultaneous streams. Groups of requests which utilize nearly 100% of the disk bandwidth available can be accepted for the short to medium length video objects considered. The addition of client buer space and increased stagger between requests allows a marginal increase in supportable request bandwidth, but only for the shorter streams (i.e. less than 3 minutes in length). The network admission strategy is discussed in Chapter 5. The rst aspect of network admission control is the development of an appropriate network bandwidth characterization. A simple smoothing technique which takes advantage of available client buer space is able to reduce the overall bandwidth reservation required as well as the variability in the bandwidth pro le. This characterization is then integrated with the network admission algorithm. The network admission algorithm is able to accept requests that use up to 90% of the network interface bandwidth. Performance tests also show that simultaneous requests for streams of low variability can successfully use more network resources than high variability streams. A 13

staggered arrival pattern, however, improves admission performance for the high variability streams more than for low variability streams. Network admission uses a larger time granularity than disk admission. A comparison of network slot sizes shows that relatively small network slots provide the best admission performance for the type of video streams being studied. A survey and evaluation of related work in distributed continuous media servers is given in Chapter 6. The contributions of the dissertation are summarized in Chapter 7 along with discussion of directions for future work.

14

Chapter 2

System Model Continuous media le servers must operate in the context of a network environment that is capable of sustaining the high bit-rates associated with high-resolution video. At one side of the network are the server components, which contain presentation objects, while at the other side are the client workstations (or set-top boxes) which request delivery of presentation objects. The rst major contribution of this dissertation is the development of an end-to-end system model that incorporates all aspects of design from the client application interaction with the server (and human user interaction) to the server composition and connection. This model is then veri ed by the design and construction of a CMFS that conforms to the model. This purpose of this chapter is to de ne the scope of the system design and describe the guidelines behind the design of the system. This model speci cally includes the application programmer's interface (API) available to client applications and the desired levels of abstraction provided for storage/retrieval of the media data. The details of the data delivery in the network itself, that is between the server's network interface and the client's network interface, are outside the scope of this dissertation except where they in uence the manner in which data is sent from the server.

15

2.1 Overall System Design Model The highest level of system description is shown in Figure 2.1. The client rst communicates with a database server (1) to obtain information about the presentation objects stored in the CMFS. It is not strictly necessary to have a separate database server, as there is some amount of database functionality in the CMFS itself. This is provided by an attribute facility for descriptions and annotations of presentation objects. Some attributes are necessary in order to retrieve the object, such as data rates, frame sizes, and disk locations, whereas others provide cataloging and description functions only, such as date of creation, service provider (CBC, Global, Paramount, etc), or genre (drama, news story, or music video). Each presentation object has a unique object identi er (hereafter called a UOI) that client applications require in order to access the object or any of its metadata. The UOI is unique over space and time.

Database Server

1

Client

2

Continuous Media File Server

3 4

Figure 2.1: Communication Model When a client application wishes to initiate the presentation of an object, it sends a request to the server (2) to establish a real-time data connection between the server and the client (3). If the connection is successfully established, separate out-of-band requests (4) are given by the client for the server to perform virtual VCR functions. Interactions (2) and (4) use a reliable request-response protocol, 16

whereas the delivery of the continuous media data (3) is unidirectional from the server to the client with no reliability guarantees. In continuous media applications, the timeliness of data retrieval is more important than precise delity [19] and methods of transmission requiring retransmission of lost or corrupt data introduce unacceptable worst-case latencies. Storing continuous media has the same high-bandwidth requirements as retrieval, but correctness of the data must be preserved in this situation, so a reliable connection is established in the reverse direction from interaction (3). The storage of data in such a le server can be concurrent with retrieval, but it is unreasonable in most cases to have real-time guarantees, since even if the pattern of data trac to be sent can be completely speci ed, a server cannot require the client application to push the continuous media data to the server in real-time. The data can be sent over a real-time connection as long as it is the client's responsibility to ll the channel with data. A modi ed version of TCP could be utilized that involves selective retransmission of lost or corrupted packets. Selective acknowledgments or negative acknowledgments from the server are necessary in such a protocol to let the client know when to release data that has been successfully stored at the server. Selective acknowledgment is absolutely necessary for \live" video, as client resources must be reserved until the data is safely on disk at the server. For stored media at the client, it may be sucient to simply re-read the blocks associated with the missing data. On a highly-reliable network, most of the data is transferred in real-time. Most continuous media servers divide time into intervals called slots or rounds, during which, for each active stream, sucient blocks of data are read o the disk and/or transmitted across the network to allow continuous playback at the client application. A reasonable length for such a slot is 500 milliseconds, as it provides a ne level of granularity while limiting the amount of overhead required for the operation of the server. The slot is the fundamental unit of time which drives the 17

software design of the CMFS. It is a unit of time which can be con gured in a dierent manner for dierent servers, if desired. Section 3.3 analyzes some of the major implications on system resource usage that are in uenced by the choice of slot size. Continuous media clients can tolerate some loss of data and still provide an acceptable presentation to the user. In some cases, this loss is caused by corruption of the network packets themselves, while in other cases, over ows at network interfaces of either the client or the server or intermediate points in the network between the server and the client result in dropped packets. The design of the CMFS must have appropriate methods to deal with disk and network overload. There are a number of possible approaches, ranging from ignoring potential overload to providing deterministic guarantees that the server will always be capable of reading the required data from the disk system and never over ow the network. Enforcing the actual delivery of the data at the client is beyond the scope of server design because it involves network components over which the server has no control. The server can determine at what points in time the requested bandwidth exceeds the capacity guarantee, since the bandwidth requirements of each object are known when the user requests real-time delivery. If the predicted overloads are short in duration, a reasonable server policy could be to accept the series of requests and not send all the data during over ow conditions. The loss would appear indistinguishable from a loss caused by network problems. The overload may be in the disk system or the network system, or both. Potential problems in the disk system could be ignored if it is expected that the disk is likely to achieve bandwidth in excess of the guarantee. In this case, the disk can read early and store data in server buers, thus reducing and perhaps eliminating the amount of overload time. This is a vague prediction, since neither the number of buers needed nor the future bandwidth can be accurately estimated. If the problem is in the network, then both the disk and the network systems must 18

adapt. The server does not send all the data during over ow rounds. The disk does not need to read data which cannot be sent, so the reading schedule should also be altered when over ow does occur. It is a complicated process for the server to determine which blocks of data should not be retrieved and/or not delivered so as to cause as little disruption as possible, but progress is being made in some research eorts [82, 87]. In particular, knowledge of the encoding format could allow a server to neglect inter-coded frames (i.e. B and P frames for an MPEG encoded object) while maintaining priority on intra-coded frames (I frames). Unfortunately, if this knowledge is required by the server to deliver only some data, this limits the exibility of the server in its ability to be useful for many heterogeneous encoding formats simultaneously. It is possible to achieve this adaptability by storing the dierent types of frames as separate streams and combining them at the client. When loss occurs due to network overload, the client could simply drop the request for the least important of the streams (i.e. the B-frames). This unpredictability and dependence on encoding format can be avoided by a deterministic guarantee that all the data will be sent from the server such that it will arrive at the client by the time it is due. Bandwidth reservation prevents overload, but also results in conservative resource utilization, since it is based on worst-case estimates. This is the approach taken in this dissertation. The server incorporates read-ahead and send-ahead policies to increase the eective resource utilization without causing the system to become overloaded.

2.2 Design Objectives In order to achieve the design goals of scalability, heterogeneity, and an abstract performance model of hardware resources in a natural and ecient manner, a distributed system model was chosen. Implementing the logical components of the server in a manner that allows them to be executed on separate hardware plat19

forms enables scalability and enhances the opportunities to support heterogeneity eciently. The disk devices and network interfaces have bandwidth limitations that prevent centralized servers with a single network interface from supporting more than a few dozen high bandwidth streams. It should be possible to store dierent media types or encoding formats on dierent (possibly heterogeneous) server components, potentially in dierent geographic locations. The user interaction model to support synchronization of multiple streams is complicated somewhat by having a distributed model, but has added exibility. A distributed design allows network latencies between servers and clients to be reduced by strategic placements of the server components within the network. For instance, it is possible to consider a server for audio data in French located nearer the French-speaking population, with the Chinese audio server in another location. The video servers could be located in a central location to serve users of all languages. The speci c design decisions made to achieve each of the objectives are described in the remainder of this section.

Heterogeneity. Several types of heterogeneity in uence the system design. The

rst aspect of heterogeneity is the data encoding format. Compression is necessary both to reduce the data rate required for real-time video data transmission and to enable the storage of reasonably long streams (1-hour in duration) on conventional disks. Several standards have been proposed (MPEG [65, 38], MJPEG [1], H.261 [92], and H.263 [39] to name a few). It is likely that the continuing work on data compression will lead to the development of additional standards with better compression speeds and higher quality (i.e. less lossy) results. To accommodate data format heterogeneity, the CMFS ignores any details of data encoding with respect to data placement, retrieval, or transmission. To achieve this independence, the CMFS uses two fundamental units of storage: the presentation unit and the sequence. A presentation unit could be a video frame, a second's worth of audio samples, a PostScript version of a slide, closed captioned text, or any other time-sensitive media. Presentation units are grouped 20

into units of storage called a sequence. The number of presentation units per sequence is de ned by the client application that stores the data to the CMFS and can vary within a stream according to the anticipated needs of client applications that present data in real-time. This implies a co-operation between display clients and storage clients. All encoding and decoding is performed at the client side, allowing the server to be used by many dierent types of clients. It is important to note that each monomedia object may be encoded and stored independently, or where the encoding permits, as a system stream. System streams have audio, video, and text as a single object, which simpli es admission control at the server as well as display at the client. The capability of separating the media types allows the server to retrieve each stream independently and for the client to combine them in various ways. In particular, various audio objects could be retrieved with the same video object where the audio is in dierent languages or from dierent providers (as in dierent news services from dierent networks). Some video encoding algorithms are capable of producing objects which can be decoded at diering resolutions [18, 22, 25]. The resulting encoded object consists of a \base stream" and one or more \enhancement streams". The CMFS treats each resolution as an independent presentation object and does not in uence the manner in which client applications can retrieve them or store them. A client with lower network bandwidth and decoding capabilities could provide a lower-quality display by requesting fewer enhancement streams. In order to provide constant quality video and audio, the compression techniques produce media streams which have variable bit-rates. The resource requirements of the streams exhibit both long-term and short-term variability [37, 96] and this variability adds to the heterogeneity of the system. Many components of the CMFS give explicit consideration to the VBR nature of the data. It has been noted that VBR streams tend to exhibit the burstiness characteristics of regular le trac [54]. This burstiness is precisely predictable, given knowledge of the size of each 21

presentation unit in every stream. The method of disk block allocation to stream data aects the heterogeneity that can be accommodated in the system. Previous work has focussed on careful disk layout [73], often at the expense of heterogeneity in data encoding format. When VBR data streams in varying encoding formats are placed on the same server, any attempt at using a layout policy optimized for a single data format has questionable validity. This becomes a particular problem for data encoded in MPEG format. If a disk layout method optimized the allocation based on an MPEG group of pictures (GOP) as a unit of storage, this may provide poor performance for video objects encoded in MJPEG, which has no concept of GOPs. For example, video streams could be stored on a system that is con gured as a redundant array of inexpensive disks (RAID). If 9 frames comprise a GOP in a 30 frame per second video, one choice could be to allocate blocks based on 9 frames, with consecutive GOPs on dierent stripes of a RAID system. Thus, the stripe size would correspond to 9-frames. An MJPEG video of the same average bit-rate would be constrained to use the same stripe size. A server may choose to implement fast motion based on retrieving alternating GOPs. While this makes sense for the MPEG objects, there is no logical reason for grouping MJPEG video in 9-frame units. The allocation design for MPEG unnecessarily restricts the way in which MJPEG video can be accessed. Thus, the CMFS makes no layout decisions based on sequential or striped access for a single stream or a group of streams. Some systems use striping to achieve higher disk bandwidth [15, 86, 56, 70]. Such a system must choose a stripe size appropriate for the common disk access patterns. If the streams have similar average data rates, the stripe size can be chosen so that the average number of seeks per stream per slot is very close to 1. In this case, the greatest amount of speed up is achieved. Even with variable bit-rate data, a system with sucient buering can provide good performance by reading in a striped fashion. If the average data rates vary widely, the desired stripe 22

size for the high bandwidth streams will con ict with the desired stripe size for the low-bandwidth streams.

Scalability. Disk bandwidth performance limits the number of streams that can

be simultaneously retrieved from a system with a single disk [76, 87, 89]. Consider moderate quality video streams which have an average bit-rate of between 3 and 5 Mbps, and that the disk device is capable of transferring data at 40 Mbps. No more than a dozen moderate quality video streams can be delivered simultaneously from a single disk. Depending on the intended user community, a CMFS should be capable of supporting hundreds or even thousands of simultaneous users [55, 89]. This requires a server with tens to hundreds of disks. For distance education or corporate training programs, smaller systems may be viable because the variety of content provided and the user populations are constrained. A CMFS intended to provide News-On-Demand or Video-On-Demand services to a large user community must be capable of signi cantly higher bandwidth. Simulations in [20] and [89] are but two of the studies which show the relative performance of dierently con gured systems with hundreds of disks and thousands of users, but give no support for their claim that such systems can be built. CMFS components are independent and can be executed on separate CPU and disk platforms to provide scalability. No particular hardware resource can be an ultimate performance bottleneck in a point-to-point network environment. As long as high-speed real-time connections can be established between the server and clients, the size of the system can be incrementally expanded. A shared network topology cannot provide this scalability, since the physical medium has a maximum bandwidth which is divided among the connected components according to their data transfer patterns. A small amount of administrative state is necessary to coordinate the various server components. Thus, more users can be supported by adding server nodes until the cumulative disk bandwidth exceeds the network's capacity to route continuous media trac to client destinations. The storage capacity 23

for administrative state is proportional to the total number of objects stored in the CMFS. Metadata storage is also required, which is proportional to the total display time of the objects, but this is only a small portion of the total data storage needs. Since bandwidth limitations of a single disk restrict the number of simultaneous retrievals to a small number (i.e. less than 10 for moderate bit-rate video streams), a method of increasing the disk bandwidth from the server is necessary, especially for popular video objects. There are at least two ways of providing this increase in bandwidth: striping and replication. The method chosen in the CMFS for bandwidth enhancement for individual objects is replication, because it is completely scalable. Multiple copies of objects can be stored on a single server, or on dierent servers that are con gured to share information. Thus, each instance of an object is treated as a separate presentation object for the purposes of bandwidth allocation. It serves no purpose to instantiate more than one copy of an object on the same disk, so replication is done on dierent disks, thus providing a minimal level of fault-tolerance as well. Although faulttolerance is not a major goal of this dissertation, the extra reliability is an added bene t from a system with replication. When a client requests an object, the copy that is most appropriate is selected. Striping is used in several other systems [9, 40, 70], but in these cases, allowing additional users to access an individual stream involves a complicated scheduling process whereby delivery of a new stream request is delayed until it can t in with the current retrieval pattern. Since multiple copies of an object may exist, a method of distinguishing between the copies is necessary while at the same time maintaining a record that the copies contain the same data. This is done by providing two types of unique identi ers for each object: a high-level UOI (HLUOI) and a low-level UOI (LLUOI). An HLUOI identi es the content of a presentation object (i.e. CBC News 01/01/1998, or Star Wars MPEG-2) and is independent of the object's location. An LLUOI refers to an instance of an object and identi es the location of raw data for the 24

object. Within an individual server, it may be advantageous to perform replication for the popular objects, as well as migration to perform load balancing activities. Migration facilities are discussed further in Chapter 3. When replication is used in a system, a method of choosing a particular copy is needed. If clients keep track of the individual copies, they may directly request the LLUOI of an object. Alternatively, this selection can be accomplished by a directory service, which can be implemented in several manners, including manual transcription. A directory service is also required when multiple self-contained servers are located in the same network environment and have been con gured to be able to share their data. One implementation of a directory service is a computer system known as a Location Server [46]. A Location Server contains metadata about presentation objects. It operates by providing the appropriate mapping between an object's HLUOI and the associated LLUOIs. The Location Server can choose a copy for the client, or provide a list of copies and locations that allows a client to choose which copy to retrieve, based on factors that may include geographic distance and current load at the servers. The details of location service functionality which has been integrated into the CMFS are described in Chapter 3 as they relate to the design of the CMFS. The full details are given in Kraemer [46] and are outside the scope of this dissertation.

Abstract Disk/Network Performance Model. In the CMFS, the only signif-

icant parameter in the disk subsystem is the number of I/O operations guaranteed per slot (hereafter referred to as minRead ). This \number of reads per slot" value is constant for each disk con guration, and is determined by running a calibration program on a logical disk device to determine the largest number of blocks that can be guaranteed to be read o the disk given the worst possible selection of disk 1

1 Depending

of RAID disks.

on the implementation environment, this could be a UNIX le, a raw disk, or a set

25

block locations from which to read. This is done by uniformly spacing the blocks across the disk, thus maximizing the seek times (assuming a SCAN algorithm [76]). This value most accurately re ects the actual capacity of the server since it includes all transfer delays (through I/O bus to memory) as well as server software overhead. With respect to network transmission, a xed maximum bandwidth exists. The number of blocks that can be transmitted during a slot is de ned to be maxXmit, because it represents the maximum amount of data that can be sent out from the server across its interfaces in a speci ed amount of time. The value of maxXmit can be calculated in the same manner as the value of minRead: by running a calibration test and observing the guaranteed number of packets that can be transmitted. This value depends on the network connection bandwidth as well as the packet size chosen. As far as the server is concerned, this is an upper bound. No more data can be sent in any slot, regardless of location of other factors.

Synchronization Support. The CMFS is designed to accommodate exible user

access. One aspect of this is availability of the server. For many applications, it is undesirable to have the server operate in two exclusive modes: playback (retrieve) and record (store). A potential application for the CMFS is News-On-Demand, which requires frequent updating of relevant stories. It must therefore be possible to store information concurrently with playback. The CMFS provides primitive operations to store objects during the normal operation of the server. The CMFS is capable of storing mono-media streams independently to facilitate the most exible method of client access. The simplest mechanism for the client is to store all media for a presentation as one object. The client application would then need only to send appropriate data to the peripheral devices as it is received without the need for synchronization mechanisms. In many cases it is desirable to decouple the audio from the video. The interface provided allows clients to synchronize the multiple objects in a way that does not require server knowledge of that relationship. It also eliminates correlated 26

data loss resulting from combined audio/video streams. Another bene t of having separate streams at the server for synchronization at the client is the support of scalable video, which enhances the support for heterogeneity. An example which highlights the synchronization support provided by the API is the viewing of lectures in a distance learning environment. Several independent continuous media objects could be created from this type of event. They include: video of the speaker, text of the presentation slides, postscript versions of the slides, audio of the speaker, multiple language translation of the audio, translation of the text, video of the audience, and even audio of the audience during question periods. These can all be obtained independently and stored in the CMFS with appropriate timing information for presentation to the user. A client application may combine these objects to view the lecture with two video windows (each of which may have more than one substream associated with the video), a text window, and an audio device (with the ability to switch between audio streams at any moment). Either of the video windows could be in \full-screen" resolution with the other in a \thumbnail version". The design of the CMFS allows for such a complex viewing scenario to be supported in a straightforward manner. In Wong et al. [90], such a scenario is presented which utilizes the CMFS as one component of the technology in that system. A detailed example of client synchronization is given in Section 3.7.

2.3 Admission Control To ensure that the server has the necessary resources to deliver a requested presentation object, it performs admission control. This process examines the bandwidth required by the object and compares it to the system's remaining capacity. The disk admission control algorithm uses minRead as its only system-dependent parameter. It is therefore independent of the mechanism used to lay out blocks on the disk or any other disk management technique such as striping. The approach does 27

not preclude using disk layout algorithms or striping as an independent method of increasing the bandwidth of the disk. A highly optimized disk management system that uses such techniques may have a higher value for minRead. In that case a higher level of service can be guaranteed which should be able to accept and schedule more simultaneous clients. The admissions scheme itself, however, is not aected by the details of these optimizations. The disk admission control algorithm addresses the allocation of the disk resource among client requests. At least two other resources have the potential to be in short supply at the server: processor cycles and network bandwidth. CPU scheduling for real-time applications, including continuous media, has been explored by many other researchers ([93] as one example), and their results are generally applicable to either dedicated continuous media servers or general workstation operating system environments. Since the rate at which processor speeds are increasing is faster than the rate at which disk access times are decreasing, a system can easily be con gured so that the system is not CPU bound. The major tasks of the CMFS that are performed at the server node are the admission control calculations and the protocol processing for the network packets. The performance analysis of Chapters 4 and 5 show under what speci c circumstances the CPU has extensive work to do. In general, for medium length streams, the CPU requirements for admission control are moderate, if requests for delivery are relatively infrequent. With respect to protocol processing, it is most often the network card that restricts the processing, but the CPU cycles required are proportional to the number of packets and total bandwidth transmitted. With regard to the network bandwidth, an admission control algorithm is also required to ensure that the accepted connections do not over-utilize the capacity of the network. The situation is slightly dierent from the disk bandwidth situation in that the receiving end of the data transmission is beyond the control of the server and so the server can only ensure that it limits the amount of data sent from the 28

network interface. It has no direct in uence on the rate at which the client can receive the data. Server memory is also a limited resource which may be the cause of a performance bottleneck. The requirements of server memory for a particular scenario of streams is directly related to the disk and network bandwidth. If the disk system is permitted to read ahead, server buers can be used to store the data that is read earlier than required. Any admission control algorithm which incorporates the ability to read data early, thus reducing the bandwidth needed to service the data for the remaining streams can take advantage of extra server memory for disk buers. This is because more streams can be accepted as the total future disk requirements at admission time are less than they would be in a system where no read-ahead was achieved. This can be further enhanced if client buer space and network bandwidth permit data to be sent early as well. The client buer space is not a resource that is directly under the server's control, so it is not given a great deal of consideration. It will be shown later that the server buer space is indirectly managed and accounted for by the disk admission control algorithm used in the CMFS. When a client stores an object, a presentation unit vector (or playout vector) is created which contains the number of bytes for each presentation unit of the stream. Because the stream may be VBR-encoded, the values for the vector may vary considerably. This playout vector is used to provide a bandwidth characterization of the stream, either in ne detail or in summary form. The bandwidth required does not necessarily match the playout vector, since dierent data is transferred if an alternate speed of delivery is requested or sequences are to be skipped. The exact bandwidth schedule (known as the stream schedule ) is calculated on demand when the request for playback is received.

29

2.4 System Architecture The design of the le server is based on an administrator node and a set of server nodes, each with a processor and disk storage on multiple local I/O buses. Each node is connected to a high-performance network (we currently use both ATM and Fast Ethernet) for delivering continuous media data to the client systems (see Figure 2.2).

Server Node(s) Processor

Controller

Controller

Client Application

Reader

Network Interface

Controller

Network

Network Interface

Client Application

Dedicated Network or I/O BUS

Administrator Node Processor

Writer

Controller

Network Interface

Attribute Database

Network Interface

Server Side

Client Side

Figure 2.2: Organization of System Disk drives are attached to the I/O buses in a con guration that will provide the bandwidth sucient to fully utilize the network interface. Multiple server nodes can be con gured together to increase the capacity of the server. Since the operation of each server node is independent of the other server nodes, any number of nodes can be added dynamically (without requiring the other nodes to be reset) subject only to the capacity of the network switch con guration. Thus, a server can be made up of dierent types of server nodes, ranging from powerful computers with 30

many large disks, possibly in RAID con gurations, to a smaller computers with 2 or 3 disks and somewhat slower network interfaces. Con gurations can even consist of server nodes which are simple processor cards interconnected via an I/O bus such as PCI. In the latter case, the nodes can communicate over the I/O bus rather than a dedicated network. This network does not need to be high-speed, since the trac between the administrator node and the server nodes is control information, which is comparatively tiny in size, so an ethernet is sucient. Depending on the physical con guration of the server components, control information could be transmitted on the same network as is used for the continuous media data itself. A similar architecture is used in other scalable, high-performance video servers [6, 47]. When a client wishes to retrieve a presentation object, the initial client open request goes from the client to the administrator node. This node determines which of the server nodes has the requested object and forwards the request to the appropriate server node. The node obtains the playout vector and other attributes necessary for presentation from the administrator before accepting the open request. Communication from then on takes place directly between a particular server node and the client. The server restricts each object to be totally contained on a single server node, although it may span several disks. If a very large object exceeds the capacity of a single server node, then it is up to the client application which stores the object to de ne an appropriate division into multiple objects. From the server's point of view, there is no relationship between objects which have been split in this fashion.

2.5 Data Delivery Continuous media streams must be delivered before real-time deadlines to preserve the continuity of presentation to the user. Data delivery can be controlled by the client requesting data packets (pull model), or by the server sending packets to the client as it has resources (push model). In the push model, if the transport layer 31

is incapable of receiving data at the rate it is being sent, a ow control mechanism (such as in TCP/IP) is often implemented to prevent the sender from ooding the receiver. In a continuous media server using the simplest push model, the server transmits bits at a constant negotiated rate and trusts the receiver to decode and present them to the user at the appropriate time. This is unacceptable for VBR streams because the client's requirements vary over time and the server would be unaware of when the client had resources (buers) available to accept the data. If the server sent at the maximum bit rate allowed, then the client buer utilization would grow over time until all the data had been sent. This is because when the server sends at the maximum transmission rate, for every k bits sent by the server, at most k bits (almost always fewer) are freed by displaying. Alternatively, sending data at the average bit rate could result in starvation of a stream whenever the cumulative bit rate necessary for display is above the average. A larger transfer rate than the precise average is necessary to avoid starvation. This rate can be easily computed. If this larger rate is used, the problem of buer build-up would occur again, because for the portion of the stream following time when the peak rate is required, buers are released by the client at a slower rate than they are received across the network. Some of these issues are discussed in greater detail by Feng [28]. A receiver-based ow control model is equally undesirable since the round trip delay in sending the request for more data may result in an under ow at the client. If the client correctly anticipates its needs for data and has sucient buering capabilities, it could request data early, avoiding this problem. This requires the server to be ahead in both reading and sending in order to be able to respond to the client's requests. As well, the request trac in the reverse direction could be signi cant if suciently detailed granularity is desired. The design of the CMFS utilizes an alternative method that provides ow control in the sense that the server never sends data faster than the client can handle 32

it, but does not require explicit client requests. The server has knowledge of the exact presentation requirements from the playout vector stored at the administrator node. Thus, it can send data at precisely the rate needed every second. This information, plus knowledge of the client buering capabilities and the rate at which the client can handle incoming packets, permits the server to send data in order to keep client buers from starvation or over ow. Details of this ow control mechanism are provided in Section 3.4.

2.6 Design Limitations In addition to the previous categories, design decision trade-os were identi ed in the design of the CMFS. The decisions made in these particular cases are described in the remainder of this section.

The time between the client request for delivery and the return of control to the client for the initiation of presentation has an upper limit. This is the longest amount of time needed to read and transmit the rst slot of data, plus round-trip message latencies. Other servers delay requests by a variable amount of time, ranging from several seconds to minutes in order to optimize on server resources. One of these resources is disk bandwidth, which may be saved by pre-fetching data at a slower rate than required and beginning to send when sucient buer has been retrieved so that constant rate retrieval and transmission will satisfy the client's requirements[37, 60]. As well, requests may be batched so that multiple requests for the same object that arrive within a small window of time are treated as one object from the disk retrieval point of view. It is further advantageous to wait in the hope that more disk bandwidth will be available in the near future. This feature is also an advantage. For one, limiting the response time for the delivery request allows playback to begin almost instantaneously. As well, 33

a precise knowledge of the time at which data will arrive permits a client to request streams from dierent server nodes, or dierent servers, which possibly have diering delay latencies and present them to the user in synchrony. If the response from one of the streams can be delayed an unknown amount of time, this mode of interaction becomes impossible.

All data for each stream connection is treated independently, and multiple users requesting the same stream concurrently could create a situation in which multiple buers containing the same data block may exist at the server. Other work on optimizing disk access [43] is tightly integrated into the server design.

2.7 Stream Characterization A video-on-demand service can provide audio and video streams of many diering characteristics, such as average bit-rates, stream playback durations, and peak rate of data transmission. Typical environments for such video servers can range from movie-on-demand, where the average length of a stream is 100 minutes, to news-on-demand, where many stories may be quite short in duration (i.e. less than one minute), and nearly all will be shorter than 10 or 15 minutes. Between these extremes may be environments such as video juke-boxes, with typical stream playback durations of between 2 and 5 minutes, distance learning servers with lessons or seminars that could range from several minutes to an hour in length, or live transmissions with inde nite duration. The quality of encoding can also dier greatly, depending on the intended viewing environment. Full-motion, high-resolution video would almost certainly be required for movies, while viewers of news may accept half-motion or lower resolution, or both. While it is possible for a general server to handle all of these types of streams at once, the admission algorithms and resource usage will be more ecient if tailored to a particular type of viewing environment. The types of streams that have been 34

chosen as the test bed for the admission algorithms developed for the CMFS are full-motion, medium-to-high quality VBR video streams. Most audio streams are constant-bit-rate if encoded with standard pulse-code-modulation (PCM) encoding techniques. Voice quality audio objects are also reasonably low bit-rate. Audio can also be encoded at a variable bit rate, at varying levels of quality, but does not typically provide bit rates of similar size. A large number of reasonably short video streams were digitized and compressed using a Parallax MJPEG encoder/decoder card attached to a SUN Sparc 10. The bandwidth requirements of the streams are summarized in Table 2.1. This card is capable of capturing VHS video at 30 frames per second at a resolution of 640x480. After compression, the average bit rates of the streams ranged from 1.9 Mbps to 7.28 Mbps. Some streams were in black and white, but most were in colour and the frame rates for the video were either 20, 24, or 30 frames per second. Multiple audio formats were available, but their low bit-rate and the fact that most encoders produce constant bit-rate streams makes them uninteresting from a performance point of view. Some streams were also encoded using an Ultimotion MJPEG encoder card installed in an IBM RS/6000 running the AIX operating system. The version of the card available was not capable of encoding at full-resolution at 30 frames per second, so it was used sparingly.

35

Stream Frames FPS Ads - MM 3596 30 Aerosmith - MM 11682 30 Akira 3822 30 Annie Hall 4503 30 Aretha Franklin - BB 12535 30 The Arrow - CBC 4446 30 Baseball - SI 2570 30 Basketball - SI 4072 30

Time in Bandwidth Required (Blocks/Slot) Seconds Min Max Ave Stdev Cov 119.867 2 9 3.82 1.01 0.26 389.4 1 6 3.6 0.851 0.24 127.4 2 8 4.745 1.24 0.26 150.1 3 6 3.711 0.84 0.225 417.833 2 7 3.71 0.68 0.184 148.2 2 7 3.31 0.933 0.28 85.667 1 7 3.65 1.1 0.3 135.733 3 13 6.69 2.23 0.354

36

Stream Frames FPS Beach Boys - MM 4039 30 Beatles - MM 6448 30 Bengals - SI 10498 30 Boxing - SI 10775 30 Buddy Holly - MM 5887 30 The Cars - MM 6473 30 Cartoon Trailers 1791 20 Cat In The Hat 1040 20 Clinton - CBC 3710 30 Country Music - CBC 4718 30 Chases - SW 9924 30 Coaches - SI 11023 30 Criswell - Plan-9 1517 20 Dallas Cowboys - SI 3805 30 Due South 16824 30 Eric Clapton 6246 30 Evacuation - Empire 13888 30 FBI Men - Raiders 5201 30 Fender Anniversary - MM 7637 30 Fires - CBC 4663 30 Fleetwood Mac - MM 6447 30 Forever Rivals - CBC 12177 30 George of the Jungle 1192 20 Island of Whales 2798 20 Joe Greene - SI 8804 30 37

Time in Bandwidth Required (Blocks/Slot) Seconds Min Max Ave Stdev Cov 134.633 2 6 3.93 0.66 0.17 214.933 1 5 3.19 0.7 0.22 349.933 2 7 4.62 0.98 0.212 359.167 2 11 5.17 1.3 0.25 196.233 2 5 3.92 0.68 0.18 215.767 2 7 3.83 1.12 0.29 89.55 2 9 5.26 1.59 0.302 52 1 6 4.664 1.067 0.223 123.67 2 6 4.4 0.829 0.186 157.267 3 9 3.75 0.84 0.224 330.8 1 9 4.14 1.03 0.249 367.43 2 8 4.95 1 0.202 75.85 1 3 1.71 0.49 0.287 126.83 2 11 5.62 1.9 0.34 560.8 1 4 2.63 0.53 0.2 208.2 3 6 4.43 0.59 0.13 462.93 1 7 3.33 0.77 0.23 173.36 2 6 4.45 0.896 0.201 254.57 2 5 2.94 0.6 0.2 155.433 2 8 4.34 0.973 0.224 214.9 2 7 3.19 0.74 0.23 405.9 2 8 3.67 1.18 0.32 59.6 2 9 5.95 1.32 0.222 139.9 1 5 2.86 0.776 0.272 293.47 3 8 5.09 0.94 0.185

Stream Frames FPS John Elway - SI 3117 30 The Kinks - MM 3994 30 Christian Laettner - SI 9973 30 John Major - CBC 4306 30 Maproom - Raiders 10843 30 Minnesota Twins - SI 5476 30 Moody Blues 6565 30 Moriarty - Star Trek 10906 30 Montreal Canadiens - SI 3490 30 Mr. White - RD 2086 20 NFL Deception - SI 17245 30 NFL Football - SI 13332 30 1998 Olympics 9123 30 Pink Floyd - MM 5922 30 Plan-9 3186 20 Raiders - Raiders 12048 20 Ray Charles - BB 8491 30 Bloopers 93 - SI 10810 30 Intro - SI 10560 30 Spinal Tap - MM 10199 30 Sports Page Hilites 4520 30 Star Trek - Voyager 3129 30 Death Star - SW 18455 30 Princess Leia - SW 18774 30 Rescue - SW 9299 30 38

Time in Bandwidth Required (Blocks/Slot) Seconds Min Max Ave Stdev Cov 103.9 3 10 5.93 1.77 0.299 133.13 2 5 2.9 0.541 0.19 332.43 2 11 5.944 1.86 0.312 143.53 2 6 3.66 0.821 0.224 361.43 1 9 4.24 1.46 0.34 182.53 3 14 6.03 2.11 0.35 218.83 2 5 2.96 0.575 0.19 363.533 2 5 2.73 0.578 0.212 116.33 3 9 5.86 1.36 0.23 104.3 1 3 2.163 0.51 0.237 574.83 2 8 4.62 1.08 0.235 444.4 2 11 5.6 1.64 0.29 304.1 2 8 3.59 0.79 0.221 197.4 1 7 3.5 1.06 0.3 159.3 2 4 2.285 0.459 0.20 602.4 1 6 2.9 0.75 0.258 283.033 5 9 7.28 0.86 0.119 360.33 2 9 4.52 1.29 0.287 352 2 9 4.74 1.35 0.285 339.967 2 9 5.41 1.3 0.241 150.667 2 8 4.42 1.16 0.26 104.3 1 4 2.5 0.621 0.25 615.167 2 9 4.17 0.93 0.22 625.8 1 6 3.55 0.736 0.207 309.967 2 6 3.38 0.71 0.209

Stream Frames FPS Snowstorm - Empire 7037 30 Summit Series 1972 14450 30 Super Chicken 599 20 Tenchi Muyo 2853 30 Tom Connors - MM 4097 30 Toronto Blue Jays - SI 4245 30 X-Files 14798 30 Yes Concert (24 fps) - MM 10253 24 Yes Concert (30 fps) - MM 13077 30

Time in Bandwidth Required (Blocks/Slot) Seconds Min Max Ave Stdev Cov 234.57 1 9 3.38 1.46 0.43 481.667 2 7 3.81 1.175 0.308 29.95 2 6 3.86 0.99 0.258 95.1 3 11 5.89 1.29 0.22 136.567 3 7 4.63 0.715 0.154 141.5 2 7 3.92 1.3 0.33 493.267 1 5 2.39 0.63 0.26 427.21 1 5 2.54 0.701 0.275 435.9 2 8 3.75 0.944 0.251

Table 2.1: Stream Characteristics Variability in the bit-rate of the compressed video signal comes from dierences in motion within a scene or complexity dierences between images. Since MJPEG encoding produces only intra-coded frames, no motion detection is performed. The MPEG compression standard incorporates motion detection, and can thus achieve higher compression in scenes with very little motion (i.e. newscaster footage). The variableness in the data rate of Motion-JPEG video objects comes from the dierence in the complexity of each frame, and by extrapolation, each scene. To achieve streams with substantial variability, clips were chosen that contained alternating scenes of complex action and primarily solid-colour low motion. Even though the server is designed to accommodate multiple heterogeneous formats, hardware limitations restricted the choice of formats readily available for performance testing. Only the two MJPEG hardware encoders were available. A software encoder for MPEG was available, but the performance in the existing systems made it unsuitable for encoding reasonable length video streams. Short clips of Quicktime, MPEG-1, and MPEG-2 video are available on the Internet, but they were not used in the experi39

ments because either the bit-rate was too low, the playback was too short, and the streams typically lacked audio. All of these formats have been used successfully in the CMFS for demonstration purposes, validating the heterogeneity of the system. The video streams were chosen to be representative of a News-On-Demand environment. Thus, many clips have alternating scenes of narration, followed by news footage with higher complexity. Other clips are short scenes from movies, or music videos that exhibit similar scene changes, although some of them are not as dramatic. The most signi cant variability is found in the sports highlight clips and movie scenes. The bandwidth requirements in Table 2.1 are stated in blocks per slot. In the version of the server used for the performance experiments of this dissertation, blocks are 64 KBytes in size and a slot is 500 msec in length. Thus, 1 block/slot is equivalent to 1 Megabit per second, which is just slightly more than 1 Mbps. Various abbreviations are used in identifying the source of the streams. These are listed in Table 2.2. 2

Abbreviation BB CBC MM RD Raiders SI SW Empire

Source Blues Brothers Canadian Broadcasting Corporation Much Music Reservoir Dogs Raiders of the Lost Ark Sports Illustrated Star Wars The Empire Strikes Back

Table 2.2: Stream Sources The length of the video streams ranges from 30 seconds (a cartoon theme song) to 10:25 (a scene from the movie Star Wars). The number of blocks required in any one slot ranges from a low of 1 (in many streams) to a high of 14 (in the 2 1 Megabit

= 1048576 bits.

40

sports highlight clip, Minnesota Twins). Variability can be measured in three ways: the standard deviation of the block schedule, the coecient of variation (which normalizes the variability) of the block schedule, and the ratio of peak to average bit-rates. These give dierent rankings for the streams, so the coecient of variation was used when dividing streams according to their variability. This measure captures the long-term variability and is not biased in favour of the large bandwidth streams. Streams with these characteristics were chosen because they are typical of a News-On-Demand environment: moderately high-quality video objects with a typical playback duration of less than 15 minutes. The average bit-rates are common for full-motion, 30 frame per second MJPEG video and moderate quality MPEG-2 video. The variability within the streams themselves covers a wide range. Again, this would be reasonable in a news-on-demand environment, with some stories consisting of only newscaster footage (low variability), while others may have rapid scene changes of varying complexity (high variability).

41

Chapter 3

System Design and Implementation The system model de ned in the previous chapter provides the context in which the Continuous Media File Server was built. The purpose of this chapter is to describe the particular design decisions made in the construction of the CMFS and thereby to validate the feasibility of the system model. The structure of the server components is rst described, with respect to client-server and server node - administration communication. The client interface to the server is presented which enables the facilities of the server to be tested in a practical environment. The description of the interface includes the corresponding server action that is triggered by each particular request, providing more insight into the organization of the server components and their utilization of system resources. Issues relating to the delivery of data from both the retrieval and the storage points of view are discussed along with the implications for the mechanisms/protocols necessary to implement the designed functionality. Finally, an example of how the server's exible interface permits a complex client application to access presentation objects is provided, complete with code fragments. Evaluation of a real server enables more directly applicable and convincing results than simulation modeling alone. The hardware con gurations available lim42

ited the extent to which testing could be performed on an actual server, but the design objectives were veri ed in the construction of the server.

3.1 System Initialization and Con guration As mentioned in Section 2.4, the server is designed as a distributed system, consisting of at least two components with dierent network identities: an administrator node and a server node. Server nodes register and De-register with particular administrator nodes on a particular network interface address. The network interface address is identi ed by an IP address/port number pair. The server node must identify which network address it wishes to use to communicate with continuous media clients. It is possible to use a dierent interface to communicate with the administrator. On the other side of the network are the client applications. The basic structure of the software on a typical client and the organization of the server is shown in Figure 3.1. The operations performed and the relationships between the tasks are described in the remainder of this section and in Section 3.2. The administrator node's primary function is to receive messages that request the establishment of the real-time connection between a server node and a client. The administrator also functions as the centralized database server for metadata/attributes associated with presentation objects, some of which are necessary for retrieval and admission scheduling. The administrator waits for requests from its clients, which include server nodes. The rst message that must be received prior to the opening of any connections is the registering of server nodes. Each server node must register with the administrator to enable the administrator node to forward the appropriate open connection requests. When a server node begins its processing, it must allocate main memory buer space for each of its disks. Memory also must be allocated for connection management, and the disk allocation vector (i.e. superblock) for each disk which is retrieved from the administrator. A permanent copy of the superblock for each disk 43

Open, Delete, Create PutAttr, GetAttr

Client Application Client Writer

Network Reader

Client Manager

Administrator Node Administrator

Prepare, Stop, Close

Worker

Worker Real-Time Data Connection Node Manager

Write Manager Stream Manager

Prepare Requests Disk Blocks Stream Queue

Disk Manager

Prepare Requests

Admission Queue

Connection Table Network Manager

Schedule

Server Node

Stream Schedules

Figure 3.1: Software Structure of Server Node is kept in the administrator database to prevent a disk crash or other node failure from leaving the server node with an inconsistent view of what data is stored on its disks. Then, the server node registers its availability to the administrator, and is ready to accept requests from clients. The organization of tasks and ow of control in each component of the server and client can be modeled by interdependent threads of execution which exchange messages and share resources, such as server buers and network bandwidth, and coordinate processing based on the availability of those resources. In Figure 3.1, the dotted lines represent message transfer between the client, the administrator and a server node. These messages contain requests for continuous media transfer or attribute data and must delivered with a reliable protocol. The thin solid lines indicate the manipulation of shared state by the threads at the server node. The 44

thick solid lines represent real-time transfer of media data, which must be realtime for the retrieval process. Writing objects to the server must utilize a reliable protocol, and may be near real-time. The software components involved in storing and retrieving an object are also shown in Figure 3.1. The major thread at the server node is the node manager, which receives client and administrator requests. For retrieval, the network manager apportions credit for sending data to client applications. Each disk has a disk manager thread which manages the disk buers allotted to that disk and enqueues blocks for transmission. Finally, each opened stream has a stream manager thread which dequeues buers that have been read o the disk and sends them on the network connection according to the credit issued by the network manager. When storing an object, the write manager is responsible for receiving the media data from the client and storing it on the disk as well as storing the object attributes at the administrator. More detail on the interactions and activities at the server as the result of user requests is given in the following section.

3.2 User Interface to CMFS Facilities The client interface calls can be categorized as follows: calls which relate to objects, calls which relate to connections for delivery of continuous media, and calls which relate to metadata for objects, and calls which involve directory service functionality, as summarized in Table 3.1. The details of the more signi cant interface calls are provided in the remainder of this section. Appendix A contains a complete description of the API, including the interpretation of each parameter required in each call. The most signi cant interface call is CmfsPrepare, which results in four major activities at the server: 1) the block schedule is calculated, 2) disk and network admissibility is determined, 3) the respective disk manager is informed of the new stream's disk block schedule, and 4) the transfer of the real-time data is initiated. Control only returns to the client when an initial buer of data has been sent. The 45

initial buer is sucient to ensure that the client will always have sucient data for presentation of the object to the user at the requested rate. Task Object Manipulation

Interface Routines CmfsCreate, CmfsWrite, CmfsComplete CmfsRemove, CmfsMigrate, CmfsReplicate Stream Delivery and CmfsOpen, CmfsClose Connection Management CmfsPrepare, CmfsReprepare, CmfsStop CmfsRead, CmfsFree Meta Data Management CmfsPutAttr, CmfsGetAttr Directory Service CmfsLookup, CmfsRegister Table 3.1: Cmfs Interface Procedures

Object Creation and Removal. Most client interaction with the server is in

retrieval mode. It is necessary, however, to store CM objects in the server before they can be retrieved. Over the course of time, these objects may also be moved, replicated, or deleted in response to user requests or server load-balancing needs. When an object is created, the client application uses the CmfsCreate call. Initially, a message is sent to the administrator to set up the identi cation and location of the object. A server node on which to place the real-time data is chosen by the administrator. In turn, the server node chooses the disk device(s) for the media data. The client receives UOI which is to be used in all further queries concerning the object. The server must know about the normal display rate of the presentation object in order to calculate the rates at which data must be transferred to the client. This is one of the parameters provided by the client application in the CmfsCreate call. Since many media types do not have a rate that can be expressed as an integer number of presentation units per second, a ratio of presentation units to milliseconds is used to allow speci cation of arbitrary display rates. For example, MPEG audio can be encoded at approximately 19.14 frames per second, but the speci cation for 46

the encoding is precisely 49 frames per 2560 milliseconds. The interface procedure CmfsWrite stores a sequence of continuous media at the server node. An individual sequence is stored in a contiguous fashion on the disk. Segmenting the object in this manner allows a client application to choose to only retrieve a certain portion of the stream in order to achieve fast-motion display at similar bandwidth levels to that required for full-motion display. This is further elaborated on in the section on CmfsPrepare. One interesting possibility is to store an MPEG video object in the following manner: each I-frame could be a sequence, and all the B and P frames which rely on that I-frame for interpretation could be another sequence. A client requesting every other sequence would then be able to retrieve I-frames only. Another possibility could be storing one video frame per sequence (in a purely intra-coded video object) so that retrieving every other sequence results in perfectly smooth fast forward at twice the normal frame rate. These details only aect the relationship between the client applications which store and retrieve the media data. After the last sequence has been stored, the client issues a CmfsComplete call, which informs the server node that it can commit the changes to the administrator database that are associated with this presentation object. This includes the attributes that are de ned by the server node which are necessary for stream retrieval. These are: 1) the location of the raw data, 2) the sequence map (an array of sequence beginning and end points, with associated display unit information) and 3) the presentation unit sizes for the entire object. In addition, the revised copy of the disk block layout (superblock) is stored at the server to ensure consistency in the event of a failure of the server node part-way through storing an object. Objects can also be removed from a server node via CmfsRemove. This could happen as a result of migration or direct removal by a client application. All attributes are removed from the administrator database and space on the disk device is reclaimed. 47

Object Replication and Migration. Each disk device has a limited bandwidth

that restricts the number of independent retrievals of high-quality, full-motion video streams to a relatively small number (i.e. fewer than 10). The access patterns of all types of media, including video rentals [36], show that at any given time, some objects are much more popular than others. If a server is to be capable of supporting dozens or hundreds of simultaneous users requesting presentation objects with a realistic distribution pattern, the bandwidth available for these objects must be greater than that provided by an individual disk. Replication provides the most bene t to a server like the CMFS. Replication can be done on at least three levels: between disks within a server node, between server nodes on an individual server, or between servers. The rst two levels of replication increase the number of simultaneous users of an individual object at an individual server. Replicating between nodes in a single server has the added bene t of load-balancing within the server and increases the reliability and availability of objects. The nal level of replication also increases the availability of objects should a server failure occur and may reduce the overall cost of retrieving remote objects by copying them closer to the location where they are frequently accessed. The main bene ts of migration are load balancing and reducing remote retrieval costs, but it does not increase availability, since the same number of copies of each object exist. Several servers may be installed on a particular wide-area network or throughout an internet. A location service can be added which allows a client application to determine the existence and location of the objects it wishes to present to the user [46]. This location service is independent of the structure of the CMFS. The only enhancement needed is that an administrator node must register with the Location Server if it is willing to export the objects stored on that server via CmfsRegister. If a client wishes to retrieve an object in a system with a location server without consideration of which instance is returned, the client can perform a CmfsLookup request. This call will contact the location service and return the location(s) 48

of all copies of the object. Replication and Migration are achieved via the CmfsReplicate and the CmfsMigrate interface calls, respectively. These can be initiated manually or performed automatically, based on some threshold of use of a particular stream or a threshold of load on a particular server. An unfortunate consequence of migrating due to heavy load may be that this condition of heavy load is prolonged by the migration process itself. If automatic moving of objects is enabled, a load monitoring facility is activated in each administrator node to determine when to initiate the copy operation. Replication and Migration take place on-line by utilizing server resources which are in excess of those required to perform the delivery for requested streams. During periods of heavy use, this may result in a very slow migration procedure. The analysis and implementation of location service and migration functionality is given a complete discussion in Kraemer [46].

Connection Establishment and Teardown. For connection maintenance, client

applications have two interface calls: CmfsOpen and CmfsClose. CmfsOpen establishes a transport layer connection from the server to the client for delivery of the stream data of an object. The caller provides the UOI for the object that it wishes to receive and sends a message to the administrator node. The request is then forwarded to the server node that contains the object. If the object which is to be opened does not exist in the directory, a corresponding failure status is returned immediately. A connection identi er (cid) is returned for use in all further communication with the server node regarding the object that it has just opened. In this respect, a connection identi er is analogous to a UNIX le descriptor. The other useful information returned from CmfsOpen is an upper bound on the amount of time that a call to prepare a stream for delivery (CmfsPrepare ) will take. It is based on the time necessary to perform the admission control and transfer an initial buer 49

of data over the network connection. The client application uses this information to coordinate the playback of multiple streams. If the client knows how long the preparation of streams A, B, and C will take, it can then determine the proper times to issue these prepare requests so that the reading and synchronized presentation of these streams can be accomplished with minimal buering at the client application. The Node Manager initializes an entry in the connection table and creates a Stream Manager thread for the object. This thread actually establishes the transport layer connection to the appropriate port on the client machine. The parameter list for CmfsOpen includes a callBack procedure which is executed at the client before accepting the connection. The callBack procedure evaluates the bandwidth parameters of the connection and the amount of client buer space to be dedicated to this connection. If the client has more resources than required, it informs the server of this fact, so that the delivery of data can use those extra resources. If the client has fewer resources than necessary, the connection is not established and a failure status is returned. If and only if the client and the server node can accept the connection parameters, the connection request is granted. The granting or refusal of the connection is relayed back to the client via the administrator node. The client must have at least enough buer space to store the largest two consecutive slot's worth of data for the opened presentation object. This is because the client library performs double buering. During the playout of the current slot, the next slot is transferred into client memory across the network interface. Due to the variable bit-rate nature of the data to be displayed, all the data for a slot must be present at the client before playback of the slot can be initiated and this space must remain available for decoding for the duration of the slot. The client does not necessarily know exactly how many bytes are strictly necessary before beginning playback to avoid starvation at the client. It is possible that 50% or more of the bytes in a slot is for the rst presentation unit, or equivalently, that 50% is for the last presentation unit. In the former case, starvation 50

would result in jitter within the slot as the next video frame could not be displayed or the audio device would run out of data. In the latter case, a large amount of buer build-up for the last frame would occur. If this space was needed by the transport layer for the data to be displayed in the next slot, then buer over ow would result. Therefore, at worst, this requires the maximum amount of data that must be presented in the largest two consecutive slots. When the delivery of data is no longer required for the object, a client application invokes CmfsClose on the connection. All the resources allocated at the server are released and the transport level connection is gracefully torn down. It is possible that a malfunctioning client or disconnected network could result in a lost CmfsClose request. Therefore, the server implements a timeout mechanism that tears down the connection if there has been no trac for a certain amount of time. 1

Data Delivery. For stream delivery, the client interface is CmfsPrepare. This

call requests that a certain portion of the stream be delivered at a speci c rate, and provides a guarantee that the client will begin consumption of the data within a speci c amount of time, given as a delay bound. This constitutes a \contractual obligation" by the client to retrieve the data at the prescribed rate. When a client issues CmfsPrepare, the request is sent from the client directly to the server node. The client request is put on the Admissions Queue for the appropriate Disk Manager thread. CmfsPrepare allows the client application to achieve all the \virtual VCR" support implemented by the server. The four parameters which thus empower the client are: start, stop, speed, and skip. The start and stop positions in the stream are given as sequence identi ers. This de nes the portion of the stream to be transferred. If start is later than stop, the stream is delivered in rewind mode. Fastmotion or slow-motion display can be accomplished by the selection of speed and skip parameters. A value of 100 for speed and 0 for skip indicates that the stream 1 This

timeout value can be set dierently for each system con guration.

51

is to be delivered at full speed (same speed at which it was recorded) and that no sequences are to be skipped. Increasing the value of speed to a value greater than 100 implies that more server bandwidth will be necessary to obtain the desired display rate. Fast motion is more easily obtained by altering the skip parameter which will cause the CMFS to only retrieve a subset of the sequences in a stream (i.e. skip=1 indicates that one sequence will be skipped for every one read, skip=2 indicates 2 skipped for every one read, etc). Given the selection of parameters, the appropriate stream schedule is constructed and the request is presented for admission control. The Disk Manager constructs the stream schedule for the requested portion of the object and performs admission control for the request. Details of of the disk admission control algorithms for the CMFS are given in Section 4.1. If the object can be scheduled from both the disk and the network point of view, a positive response is sent to the node manager and the schedule is updated. Control returns to the client when a sucient quantity of data has been sent that the client is guaranteed to not encounter starvation. There is no other provision for start-up latency in the CMFS. If the object cannot be scheduled for immediate transmission, the request for delivery is refused. If the request is accepted, however, the server continues to read and transmit blocks subject to buering constraints at the server and client. The delivery of data is guaranteed in the sense that the server will always send data ahead of time, or just in time to allow presentation of the data to the user. The correct arrival of this data cannot be guaranteed, but lost data can be compensated for by client applications. Starvation is prevented by sending the rst slot of data before returning from the call to prepare a stream. At the server, this requires scheduling the disk reads for the entire stream, completing the disk reads for the rst slot, and sending the bytes of data across the network. This is shown in Figure 3.2. On a lightly loaded system, this may happen in a very small amount of time, and prepare could return as early as time T (if the scheduling and reading 1

52

operation was done so quickly that buers were available for send ahead at that time), although the data is not guaranteed to arrive until T (the end of slot n+2). If the client begins reading at T , then later in time, the system may become heavily loaded, preventing transmission of data until the end of the guaranteed slot. This results in starvation for the client application. Therefore, the protocol waits until time T before returning from CmfsPrepare. 2

1

2

Slot n Schedule Stream S

Slot n + 1

Slot n + 3

Read Slot 0 Read Slot 1 Read Slot 2 of Stream S Send Slot 0 Send Slot 1 of Stream S of Stream S T1

Prepare arrives for Stream S

Slot n + 2

Total Server Schedule (Real Time) Guaranteed Operations for Stream S

T2 Prepare returns for Stream S

Figure 3.2: Prepare Timings When control is returned from the CmfsPrepare operation, the client is ready to read and process the media stream. This is done via CmfsRead requests. The rst call to CmfsRead informs the server that processing of the stream has begun. This is done via the sending of a \start packet". The start packet tells the server at what time the client began reading. No further communication from the client to the server is necessary, because the server then assumes that the client will continue to consume data at the rate which was speci ed in the prepare call. There is delay in the transmission of the start packet, so the client sends the local time (assuming synchronized clocks) inside the packet. This allows the server to get an estimate of network delay (Ta - Ts ). Additionally, the server calculates the proportion of a slot that has been consumed at the client at the exact time of a slot boundary. On the rst timer interrupt after the receipt of the start packet (at Tc 1

53

in Figure 3.3), a fraction of a slot proportional to the time Tc - Ts is added to the client buer capacity and thereafter, complete slots are used. This is known as the Total Client Credit (TCC) schedule, which is calculated as the stream is delivered. 1

Client

Server Tc0

Ts

Ts Start Packet (ts)

Ta

1. T = Tc1 - Ts 2. Calculate number of bytes consumed in T. 3. Thereafter, calculate number of bytes in complete slots (Tc1 - Tc0) Time

T Tc1

Time

Figure 3.3: First Read Operation The server utilizes this information in the data delivery ow control mechanism. Every subsequent call to CmfsRead is a local client operation which simply passes the data from the network buers to the application. Once CmfsPrepare has returned, the client must begin reading within a designated interval of time determined in the prepare request by the buering allocated at the client. CmfsFree returns that storage to the system when the client application has nished using it. During the delivery of an object, the client application may nd it necessary to alter the delivery parameters. The parameters which may be adjusted are speed and skip. This can be accomplished by calling CmfsReprepare. The circumstances and mechanisms for implementing CmfsReprepare are given in Section 3.4. The call to terminate delivery of data is CmfsStop. A request is sent directly to the Node Manager thread at the server node, which causes the Disk Manager to remove the object from its active list, as well as the related disk block requests. Queued server buers are ushed without sending them across the network. Finally, control is returned to the client. Before returning control to the application, the 54

client code for CmfsStop also throws away buers that have been received at the client, but not yet consumed by CmfsRead operations. The identi er of the last sequence successfully processed is returned to the client application, so that display can resume at approximately the same place within the stream.

Metadata storage and retrieval. The administrator node contains information

about each presentation object. This information is written by server nodes or client applications to be retrieved later. CmfsPutAttr stores an attribute, while CmfsGetAttr retrieves an attribute. The server node stores attributes of this nature during the creation of an object. Client applications may also make use of the attribute facility and store arbitrary metadata regarding an object. Some examples of attributes that a client might nd useful are: date of creation, copyright owner, and encoding format. Client applications can also utilize the server attributes in a read-only fashion. There are limited directory functions in this simple database that allow a client application to determine which UOIs are stored in the database. It is also possible to view the attributes that are associated with each UOI. Since the attribute values are arbitrary bit strings which can be written by various client applications, some attribute values may not be useful to other client applications. The formats of the attributes written by the server node are known to all applications.

3.3 Slot Size Implications The choice of slot sizes has many implications for server resource usage. One such resource is the amount of memory needed for disk blocks to be buered at the server and the client. The server performs at least double buering of the data for each stream. While the data for slot n is being retrieved, the data for slot n ? 1 is transferred to the client. If excess server bandwidth and buer space exist, slots n + 1; n + 2; ::: may be retrieved at the same time, but the minimum amount of 55

buer space required for each stream is two slot's worth, because the disk system lls one set of buers, while the network system empties the other set of buers. The network system empties the buers and sends across the network at the negotiated bit rate until all the data it is required or allowed to send has been sent. The same process of buering is performed at the client, where one set of buers is used to read data from the network and the other set is used by the display system to decode and present the data. The CMFS has chosen to make the required client buers the size of the two largest slots. In this case, there is space in which to receive all the data for the largest slot before decoding and presentation to the user as well as space to receive the next slot of data. For variable bit-rate data, it is possible that a large percentage of the data in a slot is required for a particular presentation unit, and so the entire slot's worth of data must be present before decoding, since the data arrives at a constant rate and may not be available in time if decoding starts early. Additionally, if the large amount of data was required for the last presentation unit, buer space at the client would be still in use for decoding when the data needed to be read for the next slot. This would manifest itself in intra-slot jitter in the former case, and buer over ow in the latter case. Therefore, a slot size of several seconds would require multiple Megabytes of client buer space for a moderate bandwidth video stream. For a stream with a peak-rate of 10 Megabits per second (1.2 MBytes/second) and a slot size of 5 seconds, this would be approximately 12 MBytes. For the same stream with a 500 msec slot, the client buer space would be 10% of this value. The disk system keeps a schedule of the number of read operations which are required for every slot for the active streams on each disk. For a two-hour schedule and 500 sec slots, this is 14,400 slots. Each active stream also has a particular delivery schedule that indicates which bytes within each disk block are to be delivered per slot, since not all the data must be delivered to the client in every case. With small slot sizes, the amount of bookkeeping information that must be 56

stored at the server is quite signi cant. Larger slot times are better from a disk performance point of view, since a greater amount of contiguous reading is possible (assuming the data for a stream is stored contiguously on the disk). Smaller slot times may increase the relative amount of time the disk spends seeking, since the read operations for a slot correspond to a shorter playback duration.

3.4 Data Delivery and Flow Control A potential problem for the client is that the negotiated bandwidth for the individual connection may not be available for the entire duration of retrieval. If signi cant loss is experienced in the presentation for some reason (perhaps the network becomes overloaded with unrelated trac), a human user may become dissatis ed with the presentation. A few options exist for solving this problem. One option is to have the server detect this loss (either directly or by negative acknowledgments from the client), and automatically adjust data delivery to eliminate the network congestion. This assumes that the client can still decode enough of the residual portions of the stream which will be sent and/or that the server can intelligently determine what to send and what to discard. As well, the server would still be reading all the data originally requested, utilizing disk resources for data that cannot be sent. A solution based on this principle is given in [82], where the client and server co-operate on de ning the order of the units to be delivered to the client and the server continues to send complete presentation units. Not all units are sent and thus, a lower frame rate for video display is presented to the user. Another possibility is for the server to transcode the data which is read o the disk to provide a lower data rate for the stream. This would require extensive server CPU resources, or hardware support for every encoding format. As well, disk bandwidth is used to extract data from the disk which cannot be sent to the client. In this case, a stream which cannot send all the data it is reading o the disk may 57

prevent other stream requests from being accepted due to this wasted bandwidth. With point-to-point network connections between the server and the client, it is a better use of resources to accept streams which can be successfully delivered as well as retrieved. In keeping with the design philosophy of the CMFS which disregards stream encoding details, the best place to handle the degradation of the quality is in the client application. The client can issue a request to prepare the stream with different delivery parameters, while maintaining as much continuity of presentation as possible. The interface to this facility is CmfsReprepare. If the server is able to support the new request, the new block schedule is used and buers belonging to the original prepare request may be ushed and/or sent to ensure the continuity of presentation. Whether a block that is buered at the server is sent or discarded depends on when it is required at the client. If a stream with low bandwidth is re-prepared, it would be more appropriate to discard blocks which are queued at the server, since it is conceivable that many seconds of data could have been read ahead, and these buers no longer correspond to requested data. It would be very awkward to adjust the osets in existing queued data so that only the bytes appropriate for the new request request are transmitted. In the case of video, if only a small number of seconds of video are buered, then continuing to send them would be more appropriate. The client must also have some way to determine when resources have been freed up so that the quality of transmission can be resumed. An exponential backo timing mechanism could be used to incrementally request more bandwidth. The design of the CMFS makes this entirely a client issue. The client is responsible for determining when to issue a CmfsReprepare with an increase in bandwidth requirements. The server translates this into a pseudo-CmfsPrepare request where the \stream" to be admitted is simply the dierence between the new schedule for the stream and the existing schedule. If that stream can be accepted, then CmfsRepre58

pare returns successfully with increased presentation quality. Otherwise, delivery continues according to the previous schedule. To ensure that client buers do not over ow, the CMFS implements a mechanism for ow control based on a credit. Credit is de ned to be the number of bytes that the server is allowed to send to the client during the current slot. This value is determined based on the knowledge of client buer space and network bandwidth associated with both the connection and the entire node. When there is ample buer space at the client, credit is issued so that the server can send at the full rate of the connection to ll client buers and reduce future network utilization. Once the client buer has been lled, credit is issued based solely on the number of bytes of data that have been presented to the user in that slot time, and thus freed at the client. The mechanism is implemented by the Network Manager thread. This thread knows the rate of each connection and the amount of buer space at each client as well as the amount of data to be displayed per slot. Without ow control of some kind, the Stream Manager would send as fast as the network would allow or as fast as the disk could read, causing over ow at one or more of the following locations: 1) network buers at the server, 2) buers in the network, or 3) buers at the client. The ow control prevents over ow or starvation by having the Stream Manager wait for credit from the Network Manager before sending data across the network. Buers are queued between the disk and the Stream Manager until the system runs out of buer space. No network communication from the client is required once the start packet from CmfsRead has been received. Further details of the implementation of this mechanism can be found in Section 5.1.

3.5 Real-Time Writing The server allows reading and writing to be performed at the same time. When the server is able to achieve a read bandwidth greater than minRead, the extra 59

bandwidth can be used for additional read-ahead, subject to buer availability. This portion of the retrieval read-ahead is not guaranteed by the server, so the server can postpone this read-ahead in favour of writing an object to the server. The size of this \bonus" read ahead varies according to the block location and seek activity on the disk for retrieving the requests of the accepted streams. If the amount of reading required in the current slot is less than minRead, the remaining bandwidth could also be used for writing. Designing the system so that real-time writing can take place is possible. This reservation may be in vain, however, because the server cannot require the client to provide the network packets at the given appropriate time in order to keep up with the promised rate of writing to the disk. Reservation of bandwidth could be done in a similar manner to that done in a CmfsPrepare request, if the size of each presentation unit is known ahead of time. In this respect, read and write operations are exact inverse of each other. If the client is slower than the reserved rate, then server resources are allocated which are not being used properly. Retrieval may be denied when resources are actually available. The major reason that real time writing is infeasible for continuous media is that data must be done reliably. This requires retransmissions of lost or corrupt data. The round-trip latencies involved prevent guaranteed real-time delivery.

3.6 Implementation 3.6.1 Environment and Calibration The CMFS has been implemented and tested on several hardware and software platforms. Most of these are UNIX-based workstation environments. In particular, versions of the server exist for IBM RS/6000 computers using AIX, SUN Sparcstations running SUN OS 4.1 or Solaris 2.5, and Pentium-based PCs running Linux, FreeBSD, Solaris, or Windows NT. 60

Client applications have been written on all of these platforms as well as Windows 95. The client applications range from simple directory listing programs, to complete writing utilities and several display clients. The display client for the SUN Sparc architecture utilizes a Parallax MJPEG decoder card, the IBM AIX client utilizes the Ultimotion MJPEG decoder card. Both of these clients request independent audio and video streams and synchronize them at the client. The Parallax decoder is capable of displaying NTSC quality video (640 x 480) at 30 frames per second, as is the Ultimotion decoder card. Unfortunately, as mentioned previously, the Ultimotion card is not capable of encoding at 30 frames per second. Client applications that decode MPEG video in software on UNIX have also been written, but sustain much lower frame rates and resolutions. A Windows client uses a Real Magic MPEG decoding card as well as software decoding. As well, various dummy clients have been implemented and used for stress testing the server. These clients discard all the data and are used to keep various statistics on the delivery of the data. The network environment for initial testing consisted of a Newbridge ATM network switch connecting the clients, the administrator, and the server nodes via 100 Mbps multi-mode bre. This provided a small scale CMFS with one administrator and as many as two nodes. Another server environment has been established on a 100 Mbps switched Ethernet network. Several Pentium based machines are connected to this network. Currently, the hardware platforms on which the CMFS has been implemented utilize the raw disk interface as provided in the UNIX operating system in con gurations that have dedicated disks. These disks are attached by a SCSI 2 Fast/Wide adapter providing a bandwidth of 20 MBytes/second. The con guration of such nodes contains four disks with 2 GBytes capacity each. The low-level I/O facilities of some versions of UNIX provide an asynchronous mechanism for reading and writing of data blocks. This feature is utilized wherever possible. The server node 61

issues requests in groups so that the disk controller (typically SCSI) and lower-level software/ rmware can reorder the requests for the best performance. When buer space is available, minRead requests are issued simultaneously. They are guaranteed to complete within a slot time. When fewer than minRead buers are available, requests are made simultaneously for the number of available buers, as there is no point in delaying the disk requests unnecessarily so that the asynchronous parallelism can be achieved. The initial calibration of the disk utilized the raw interface for AIX connected to a Seagate Barracuda model ST32550W on an IBM RS/6000 model 250. A bandwidth of 40 blocks per second was achieved in every test of the calibration program, suggesting 20 as the value for minRead. The read requests were each for one 64 KByte block when the blocks were spaced evenly across the surface of the disk. When asynchronous facilities were used in the same test, 23 blocks was the largest number of evenly spaced requests that could be satis ed within 500 msec. The worst case read time for 23 blocks was 508 msec. Given timing granularities and the fact that this is a worst case example, 23 is a more accurate value for minRead. A third method of calibration utilized the CMFS to calibrate the disk performance. Simultaneous requests for several CBR streams were submitted to the server to determine the worst case disk performance. The server was capable of supporting 23 streams which were spread out across the entire surface of a single disk and required an average of 1 block per slot, so the level of seek activity was high. During this calibration phase, an anomaly regarding disk performance was observed. One of the disks was capable of reading 26 blocks per slot if it occupied a certain position on the SCSI chain. If the disks were physically rearranged, then it was able to only achieve 23 blocks per slot. Some of the examples in Chapter 4 use 26 as a value for minRead, as some of the initial experimental work was carried out on that disk. Such anomalies highlight the importance of using a calibration 62

program to calculate minRead rather than static analysis based on the disk drive technical characteristics.

3.6.2 Implementation Environment The CMFS is implemented in C and utilizes a user-level threads library (RT Threads [62]), developed at UBC to support real-time distributed software systems. RT Threads provides mechanisms for co-ordinating access to shared data within a UNIX process via semaphores and uses a Send/Receive/Reply message passing mechanism that can be used between threads in the same address space or between threads in dierent address spaces. On operating systems (such as Windows NT and Solaris) which are already threads-based, or have system level threads available, an RT Threads application can run within a single thread or RT Threads can be mapped one-to-one with system-level threads. Detailed performance analysis in Mechler [62] shows that in nearly all cases, RT Threads performance is comparable to that provided by host operating systems. In some cases, the primitives provided by RT Threads signi cantly outperform those of the host operating system. In particular, the real-time features of Solaris Threads and AIX Threads either require special privileges (such as running as root) in order to achieve real-time performance or create situations in which overloading the system makes the entire machine unusable [69].

3.6.3 Transport Protocol Implementations The details of the transport protocol are beyond the scope of this dissertation as the low-level delivery of data does not in uence the internal design of the server. All the server requires is: 1) request/response messages and continuous media storage messages must be delivered reliably, using a bounded quantity of server resources, and 2) continuous media data must be delivered to client applications on time. The overhead associated with protocol stack processing does aect the amount of 63

remaining processor resources at the server node and client, however, so a few words are in order. The minimum requirements of a transport protocol with respect to the CMFS are that: 1) retransmissions of lost CM data do not aect the timing of delivery at the presentation device (i.e. the client application is unaware of retransmissions), 2) lost data can be detected, and 3) quality of service parameters can be speci ed for the connection. Retransmissions need not be excluded, but should only use excess bandwidth and then only to send packets that will be delivered before the application's deadline [82]. In other words, for the CMFS, if it is possible for the transport layer to detect missing data for a stream that has several seconds of data enqueued at the client, retransmitting that packet will be invisible to the client (in terms of delay) as long as the packet arrives before the application requests it. The server must make some decision as to whether or not it has the bandwidth to resend the data. In order to be capable of performing retransmissions, the server must retain packets previously sent on a connection for some length of time in anticipation of retransmission requests. The server must also decide how much buer space it can devote to these packets. In some cases, the packet may have expired at the server before the retransmission request is received. Since many existing hardware environments do not eciently support quality of service speci cation, it is possible to relax the third condition in some cases. Where quality of service is not supported in the network, the server can be installed so as to provide service for low to medium bandwidth requests that are below the bandwidth of the entire network. The network bandwidth itself becomes the limiting performance factor and neither the disk bandwidth or the network interface limits can be reached. A server con gured in this manner can still be used eectively for these type of requests (such as audio or lecture slides, etc.). The rst choice for a transport level protocol for both messages and raw data was XTP [83]. It provides the ability to have reliable or non-reliable data ows on 64

either side of a connection along with other QoS parameters. Initial experiences indicated that the implementation of XTP in our environment incurred signi cant queuing and processing overhead which limited the amount of throughput and thus, the number of simultaneous users. Another protocol was then developed for raw data transfer which utilized the basic features of UDP/IP with some added sequence number checking. TCP/IP was used to implement the reliable message transfer protocol. The data transfer protocol is called MT (Media Transport) and is described in [63]. In certain network environments, the round trip time for messages sent via TCP/IP was often unacceptably high. Thus, a UDP-based reliable request/response protocol was introduced for the transmission of reliable messages which t into a single UDP packet. Additionally, another protocol which has been used for data transfer is RTP (Real-time Transport Protocol) [79]. This protocol provides timestamps which a client application may utilize in sending data to decoders and/or display devices. RTP operates on top of MT. MT provides sequence numbers for detecting holes and just-in-time retransmission. There are at least two advantages to the client for using RTP. First, the latency in receiving timing information for the stream is eliminated, since this information is placed in the stream. Previous client applications obtained this timing information from attributes stored in the administrator database. For low-bandwidth clients (such as those accessing data across a modem-link), this delay is unacceptable. The timing information placed in an RTP packet is the display time of the rst presentation unit contained within the packet. Timing information about subsequent presentation units in the packet can then be determined either by parsing the data itself (depending on the format) or by using an RTP payload type which includes such information. The second major advantage is that using a standard RTP payload type to transmit CMFS data allows non-CMFS client applications to be the ultimate 65

recipient of the data, without the need for parsing the CMFS header information on-the- y at some intermediate location. The control of the transmission would need to be performed by some proxy client, but the raw data could be sent directly to a dierent client. This discussion of protocol implementations has been restricted to unicast point-to-point communication. This can be extended to multi-cast transmission [21]. It is quite possible that several clients could request the same object at approximately the same time. A simple enhancement to the server would be to treat this as a single request. It could do this by creating a multi-cast group for the receivers, and retrieving only one copy of the object, sending it out on the multi-cast address. It is also possible to have a proxy client represent the multiple recipients, so this functionality could be provided in a manner transparent to the server.

3.6.4 Server Memory Requirements One of the uses of memory at the server is to keep state information. Large data structures are needed at two levels: per-disk and per-stream. Each of the prepared connections has a signi cant amount of storage dedicated to recording the precise blocks which must be read and the bytes which must be delivered. If a large number of simultaneous streams are permitted, this memory usage may be very large. The rst data structure stored is the schedule for each disk. This is used in admission control and contains one integer for each slot. This is 14,400*4 = 57,600 bytes per disk for a two-hour circular schedule. The more resource-intensive data structure is the speci c block list and corresponding osets into each block that must be stored for each prepared stream. This is also in uenced by the size of the sequences used in storing the object and whether or not the sequences to be delivered are stored contiguously. If some of the stream is to be skipped, there will be discontiguities in the disk block locations for a stream. For each stream in each slot, there is a blockDescriptor structure. This 66

contains the starting block number, the number of blocks to be read and the ending oset byte pointer, as well as some other counters and ags. For large bandwidth streams that are stored contiguously on the disk, this amounts to 28 bytes per slot. For a 10 minute stream, this is 33,600 bytes. A server which is capable of supporting 100 streams of this length requires 3.3 MBytes just for the block descriptor array. For 100 minute streams, this would be 33 MBytes. For the network system to properly apportion credit to the connection for each stream, the playout vector is stored as part of the connection state, so this adds 4 bytes per slot as well. The situation is worse if small sequences are used and a non-zero value of skip is provided in the prepare request. A linked list of fragment descriptors is kept for each sequence that must be retrieved within a slot. This could add as much as 12*14=168 bytes per slot for the 14 extra sequences with 500 msec slots, 30 frame per second video objects and a sequence size of 1 frame. In this worst case, approximately 200 bytes are required per slot and this would be 2,400,000 bytes for a 100 minute stream. This consideration in scalability must be taken into account when con guring the CMFS. It should be noted that in order to support such a large number of streams from a single server node, either the number of disks must be large, or the individual stream bandwidth must be small. In the former case, server memory for disk buers would also be great and the total memory requirements would be very large. In the latter case, it could be that less buer space is needed to support the variable bitrate streams at lower bit-rates. Thus, more of the server memory could be allocated to connection state management. Another use of memory at the server is for buer structures (QueueBuers) which are manipulated by the network sending process. A pointer to the disk buer along with osets is stored for every block that is read o the disk. Thus, this amount of memory has a lower bound. If a buer is to be shared for more than one slot, then a new QueueBuer structure is created to indicate the starting and 67

ending oset into the block, since only part of the block is to be sent during that particular slot time. Some obvious optimizations could be made which would reduce the total memory usage, but not by a signi cant factor. A single disk server node supporting up to 10 large bandwidth video streams and 10 associated audio streams (of playback length not more than 10 minutes) would require a modest 672,000 bytes, at a minimum. If small sequences were used, the memory needed for schedule and connection state management could reach as high as 5 MBytes. Although the server is capable of handling many dierent types of streams at diering bit-rates, it may be advantageous, purely from a performance/cost point of view to con gure systems dierently for low bit-rate audio streams of low variability (or constant bit-rates) than for highly variable, high-quality video streams. A server that must store both types of streams could use a hybrid approach. The CBR stream server could have more memory associated with connection state, since fewer buers are needed to smooth out peaks, while the highly-variable video streams could limit the number of simultaneous connections thereby freeing more space for use as disk buers.

3.7 Example Client Application The exible API of the CMFS allows a client application to arbitrarily combine multiple streams for presentation to the user. In this section, an example is developed which illustrates this exibility. Some of the details involved in accessing data from dierent servers is omitted. Consider a video display client with associated audio and text players. The user requests a particular copy of the audio, namely the Japanese language with classical music background, the close captioned text in English, and a 30 frame per second video object of the news story. The user does not care which copy of the video or text object is retrieved. Sample code fragments in the C programming 68

language are shown in the following description for each major interaction with the server. The rst operation is to identify the server node to which the client rst wishes to make contact. Since the client knows the identity of the server with the Japanese audio, this is the rst server contacted. CmfsInit(jap_admin_addr, ADMINPORT);

Then, the objects associated with the presentation must each have a connection opened for them. Some method of determining the UOIs is required and then the following open calls are performed. if (CmfsOpen(LL_audioUOI, myCallback, &audioPrepBound, &a_cid, localIp, 0) != STREAMOK) { printf("CmfsOpen failed for Audio Stream\n"); return (-1); } if (CmfsOpen(HL_videoUOI, myCallback, &videoPrepBound, &v_cid, localIp, 0) != STREAMOK) { printf("CmfsOpen failed for Video Stream\n"); return (-1); } if (CmfsOpen(HL_textUOI, txt_Callback, &textPrepBound, &t_cid, localIp,0) != STREAMOK) { printf("CmfsOpen failed for Text Stream\n"); return (-1); }

If the CmfsOpen call fails because the object does not exist on that server, then a call to CmfsLookup could be performed. The next operation is the prepare of each stream. Assume that it is possible to structure the application as a set of threads which independently request transfer of the presentation object. If vPrepBound is 2.3 seconds and aPrepBound is 1.6 seconds and tPrepBound is 0.7 seconds, then the following code fragments could be used as the bodies of each mono-media player. Each thread waits a dierent amount 69

of time before issuing CmfsPrepare so that it can more easily schedule the CmfsRead operations in the proper time. Audio Thread: WakeUpAt( now + .7 seconds); if ((status = CmfsPrepare(a_cid, scheduletime, STARTOFSTREAM, ENDOFSTREAM, 100, SkipFactor, delay)) != STREAMOK) { fprintf(stderr, "CmfsPrepare failed, status = %d\n", status); return (-1); } do { status = CmfsRead(cid, (void **)&buf, (int *)&numRead); switch (status) { /* .... put data into device queue ..... */ /* free data buffer if all data is accounted for */ if ( allDataUsed) CmfsFree(buf); } while (!done);

Video Thread: if ((status = CmfsPrepare(a_cid, scheduletime, STARTOFSTREAM, ENDOFSTREAM, 100, SkipFactor, delay)) != STREAMOK) { fprintf(stderr, "CmfsPrepare failed, status = %d\n", status); return (-1); } do { status = CmfsRead(cid, (void **)&buf, (int *)&numRead); switch (status) { /* .... put data into device queue ..... */ if (frameIsComplete) sendDataToDisplayDevice(); if ( allDataUsed) CmfsFree(buf); } while (!done);

Text Thread: WakeUpAt( now + 1.6 seconds); if ((status = CmfsPrepare(t_cid, scheduletime, STARTOFSTREAM, ENDOFSTREAM, 100, SkipFactor, delay)) != STREAMOK) { fprintf(stderr, "CmfsPrepare failed, status = %d\n", status);

70

return (-1); } do { status = CmfsRead(cid, (void **)&buf, (int *)&numRead); switch (status) { /* .... put data into device queue ..... */ /* free data buffer if all data is accounted for */ if ( allDataUsed) CmfsFree(buf); } while (!done);

The synchronization of these streams could be performed by another group of threads which wait on a barrier, or other synchronization primitive and then display the appropriate data on the respective device. The primary display client (Parallax SUN MJPEG client) which has been used for demonstration purposes and for some performance testing uses the audio device as the master and re-synchronizes the streams once a second. It is important to note that in rewind mode, the sequences are sent in reverse order, but the data in each sequence is sent forwards. This allows some of the contiguity of the placement to be used in disk retrieval. It also permits the server to be unaware of presentation unit boundaries. Although this information is present at the server, the eort involved to retrieve and send the presentation units in reverse order was not considered a wise use of processor time. In the case of MPEG video data, this would completely confuse any decoder, because it would require the video frames in forward order for proper decoding of the inter-coded frames. It is the client's responsibility to determine in what order to present the video frames to the user. The Parallax video client places the decoded video frames on a software stack as they are received across the network and then pops the stack once an entire sequence has been received.

71

Chapter 4

Disk Admission Control The second major contribution of this dissertation is the development of a detailed disk admission control algorithm that explicitly considers the variability in the bandwidth requirements of each presentation object. This algorithm examines both the raw disk bandwidth and the server buer space available when determining if the server has enough disk and memory resources to retrieve the data stream associated with each new request. This is in contrast to other approaches which consider one of these two resources in isolation, or provide a coarse-grained characterization of the stream bandwidth over time. The disk admission algorithm emulates/simulates the disk reading of all the blocks required for the set of streams presented to the server when a new request arrives, and so it is called the vbrSim algorithm [68]. This is a somewhat misleading name, as the algorithm does not perform a simulation, but rather a worst case emulation of which disk reads would be performed during each slot time. The purpose of this chapter is to examine all aspects of the disk admission control question. Several alternative approaches to disk admission control are presented. They are compared with the vbrSim algorithm in terms of complexity and accuracy of admission results. Next, a series of performance tests are analyzed which show that the algorithm performs well on real data that is representative 72

of a News-On-Demand environment with high-quality, full-motion video streams. These experiments expand on the initial ndings reported in Makaro et al. [58]. If a request pattern that has a stagger between requests of several seconds, enough streams can be accepted such that requests for nearly all of the disk bandwidth can be reserved at the same time. Even in situations where requests arrive simultaneously, the vbrSim algorithm accepts stream requests for up to 20% more bandwidth than the next best deterministic-guarantee algorithm. As a conclusion to the chapter, an analytical discussion of the asymptotic behaviour of the vbrSim algorithm is presented. This shows that the admission performance compared with an optimal algorithm degrades linearly as the estimate of disk performance (minRead ) diers from the observed rate of disk performance.

4.1 Admission Control Algorithm Design There are several possible approaches to disk admission control. They can be deterministic-guarantee algorithms or statistical-guarantee algorithms. Providing a deterministic guarantee ensures that there will be no loss of continuity because the server's requirements were over-subscribed. Such admission control algorithms may be too conservative and admit too few streams, thereby under-utilizing the available resources. On the other hand, statistical-guarantee admission policies can typically admit more streams, resulting in better utilization. It is possible that such an algorithm admits too many streams, resulting in over-utilization, which manifests itself as delay or loss of data at the client. Although probabilistic methods exist to amortize the cost to the clients of this failure [7, 87], this is undesirable in general. This is a tradeo that must be evaluated when designing a CMFS, and indeed any system that provides quality of service guarantees. Deterministic-guarantee algorithms tend to consider peak bandwidth requirements to prevent overload situations, while the aggressive algorithms use average requirements and summary characterizations. In this section, ve distinct approaches 73

to VBR disk admission algorithms are considered. Only four of these can be implemented. Each algorithm represents a class of admission approaches which provide generally similar results. They are examined analytically and quantitatively in terms of admission performance and buer utilization for a realistic set of stream requests. Three deterministic-guarantee algorithms are considered: Simple Maximum, Instantaneous Maximum, and vbrSim. One algorithm provides a statistical guarantee: Average. The results will show that the vbrSim algorithm can eciently make correct admission decisions and that its admission performance approaches that of an optimal algorithm. Of the three deterministic algorithms, vbrSim is provably the best in admission performance. In order to accept more streams, signi cant buer space is required to accommodate read-ahead. It will also be shown that under realistic server conditions, vbrSim also outperforms the Average algorithm.

4.1.1 Experimental Setup and System Measurements A set of stream requests submitted to a CMFS as a unit is de ned as a scenario. Scenarios can consist of simultaneous request arrivals, in which case all acceptance decisions are made during the same slot time. They may also consist of staggered arrivals, modeling a more realistic workload for a single disk in a CMFS. To analyze the disk admission algorithms, all scenarios considered in this chapter are for streams located on the same disk. The scenarios which are used for testing are described in detail in Appendix B. When requests are staggered, a uniform stagger is used. This method was chosen partly because it was easy to implement and enforce with the client application software available, but also because it provided the best performance. The bene t of staggered arrivals comes from contiguous reading and with even amounts of time reading each stream, the results show that the system is able to read from one stream only until all streams are active. Enough read-ahead is achieved during the start-up time for each stream that no more blocks are needed while the next 74

stream is attempting to catch up. The worst case would be if most of the streams arrived together with some arriving a long time later. This situation would suer the seek penalty of having multiple streams start at approximately the same time. It could also have the eect of having a larger amount of read-ahead for a single stream if the delay between the rst and the second stream was long. Although it is unclear what the exact eect of a non-uniform stagger would be on read-ahead, these tests assume that a uniform stagger of n seconds would not be signi cantly dierent than staggers randomly or normally distributed with a mean value of n seconds. To understand the relevant dierences between the disk admissions algorithms, three resources are measured: the CPU cycles used in determining admissibility, the disk read bandwidth, and the number of buers available. The number of machine instructions required to execute the admission control algorithm is important, because a very accurate algorithm that cannot make a decision in a timely manner is not useful in a real-time system such as the CMFS. The bandwidth measure has three components itself: the rst is the bandwidth guaranteed by the system (previously de ned as minRead ). This estimate is used by all algorithms as the capacity of the disk subsystem. The second component is the bandwidth requested by the set of stream requests (via CmfsPrepare) that comprise a typical workload submitted to the server. This is measured as the sum of the average bandwidths of each stream, and represents a more realistic measure of the service provided than simply the number of simultaneous users. The third component is the actual bandwidth achieved in the delivery of a scenario. An algorithm is considered to perform well if it can accept a scenario with average requirements that approach or exceed minRead and approach the actual bandwidth achievable. The number of buers available to the algorithms is limited by the amount of main memory at the server. An algorithm which makes use of signi cant buer space is more costly than one which does not, and will reject streams if the buer requirements exceed the available capacity even 75

though the disk bandwidth limit may not impose a restriction. Before the dierences in each algorithm are described, the common activities within each approach will be identi ed. In some algorithms, the result of one or more of the steps described may be precomputed o-line and the results stored for use at admission time. Whenever a client makes a prepare request for a portion of a media stream at a particular display rate, a block schedule for the stream is created that contains one entry per slot for the duration of the stream playout. Each entry in the schedule is the number of disk blocks (in this instance, 64 KByte blocks) that must be read and delivered for the stream in that slot to ensure continuous client playout. These values are in uenced by the speed and skip parameters of the prepare request. The input for the block schedule calculation is the playout vector that was stored when the stream was written to the server as well as the start, stop, speed, and skip parameters. For a constant bit rate stream, each value in the block schedule would be the same (modulo disk block granularity). The values would vary for VBR streams in a manner dependent on the encoding. For instance, Figure 4.1 presents an excerpt of the block schedule from one of our sample streams. The number of blocks in the schedule may actually provide more data than required for a particular slot, because blocks are always read in their entirety to maximize performance. A speci c block schedule for an entire stream (Maproom - Raiders) is shown in Figure 4.2. This particular schedule is from a six-minute scene from the movie Raiders of the Lost Ark. Slot

1

Blocks 2

2 3

3 6

4 6

5 6

6 7

7 6

8 6

9 7

Figure 4.1: Typical Stream Block Schedule

76

10 7

11

12

6

8

MapRoom Block Schedule 9

Blocks Required 8

Blocks Required

7 6 5 4 3 2 1

721

681

641

601

561

521

481

441

401

361

321

281

241

201

161

121

81

41

1

0

Disk Slot Number

Figure 4.2: Stream Block Schedule (Entire Object) As blocks are read from the disk, they are stored into buers which are then passed to the network for transmission to the clients. The speed at which buers are lled is dependent on how fast the server reads blocks from the disk (it will be at least as fast as minRead ). The speed at which the buers are freed depends on how quickly the network can transmit the data to the client. This latter speed is itself dependent on the speed of the network and the number of buers that the client has allocated to receive the data. The network management system is assumed to transmit data only as fast as the client can consume the data. The cumulative block schedules are combined into a server block schedule. All server block schedules that have the disk in a state where the requirements do not 77

exceed the resources are said to be valid schedules, corresponding to valid scenarios. This characteristic is independent of whether any of the algorithms admits all the streams in a scenario.

4.1.2 Simple Maximum The most straightforward characterization of a stream is to reduce the bandwidth requirement description to a single number. In the Simple Maximum algorithm, the maximum number of reads required in any slot is chosen. This is referred to as peak allocation [24, 89] in other research. If the sum of this maximum value for the new stream plus the current sum of the values for the accepted set of streams is greater than minRead, the new stream must be rejected. Using the block schedule in Figure 4.1, for example, 8 would be chosen as the value for the stream. If the current sum was 17 and minRead equaled 23, the new sum of 25 would result in a rejection. A clear advantage of this algorithm is its simplicity. If the variation in the stream's block schedule is small, then this is a reasonable algorithm. In fact, it has been used in several CBR le systems [32, 73, 74]. Another advantage of this algorithm is that it produces deterministic guarantees for reading from the disk. Unfortunately, it signi cantly under-utilizes the resources as block schedule variation increases, rejecting streams which could be delivered. If the peak is twice the average, as is the case in most of the streams digitized for the performance tests, no requests for total bandwidth greater than 50% of minRead could be accepted. In one particular study [8], twelve video samples were used where the peak to mean ratio ranged from 6.6 to 13.4. In such an environment, Simple Maximum would accept a very small number of streams and waste a large amount of bandwidth.

78

4.1.3 Instantaneous Maximum The next admission control algorithm considered keeps the sum of all of the currently admitted stream block schedules in a vector called the server block schedule. When a new stream is to be admitted, its block schedule is added to the current server block schedule. If the value in any slot in the resulting schedule is greater than minRead, the new stream is rejected; otherwise it is accepted. A variation of this algorithm is described in Chang and Zakhor [16]. The following example illustrates this method. Again, assume that minRead = 23. Figure 4.3 shows the current server block schedule and the new stream's block schedule. The entire block schedule for an individual stream is combined with the server block schedule for streams which are already admitted. In this case, the i + 2nd slot would have a value of 26 which is higher than the minimum number of blocks the server can read (23). The new stream must be rejected. Current Server Schedule

i-1

13

15

19

i

i+1

i+2

2

7

9

3

6

9

3

6

New Stream Block Schedule

4

4

7

3

2

0

1

2

3

4

Combined Server Schedule

i-1

17

19

26

i

i+1

i+2

5

9

Figure 4.3: Server Schedule During Admission It is possible that delaying the acceptance of the new stream by shifting the 79

block schedule for the new stream into the future by a small number of slots could eliminate peaks that caused a rejection. It is equally likely that this shift could produce a peak of equal or greater magnitude than without shifting. Regardless of the eect on the shape of the resulting server block schedule, allowing such shifting prevents the server node from guaranteeing a bound on the time that CmfsPrepare can take and the client can begin reading. It also increases the worst case execution time of the admission algorithm by a constant factor, namely the number of slots into the future that the user is willing to wait. This analysis assumes that this number is zero. A complete server block schedule is shown in Figure 4.4. This scenario was a simultaneous request for six of the streams from Table 2.1. This is scenario 101 with the initial set of streams from Appendix B. It is clear that there are many peaks in bandwidth above the average of approximately 22 blocks per slot. For the Instantaneous Maximum algorithm to accept the scenario, a minRead value of 32 is required. This algorithm also provides a deterministic guarantee of delivery, and it can do no worse than Simple Maximum since it performs a more ne-grain evaluation of the schedules. It is still rather conservative and also may reject streams the disk system could deliver.

Theorem 1 The set of streams accepted by Simple Maximum is a subset set of the

set of streams accepted by Instantaneous Maximum.

Proof. Let the blocks required by stream S in slot k be de ned as BlocksS;k . Assume there is some scenario X which is accepted by Simple Maximum, but is rejected by Instantaneous Maximum. Thus, there must be some slot j in which Sn X Blocks

s S0

s;j

> minRead:

=

Since Simple Maximum accepts the stream scenario, it must be the case that 80

(4.1)

33 Instantaneous Maximum

31

Cumulative Average Bit Rate 29

Blocks Required

27 25 23 21 19 17

601

561

521

481

441

401

361

321

281

241

201

161

121

81

41

1

15

Disk Slot Number

Figure 4.4: Server Block Schedule

Sn X B

s S0

s

100

Percentage of Disk B/W Requested

Figure 4.20: Stream Variability: Acceptance Rates for Stagger = 10 Seconds only have shown that the disk cannot read this fast. Thus, this number measured the number of disk reads completed during the slot, although some of them were initiated in the previous slot. Thus, the cumulative average bandwidth measures disk performance more accurately, but has its drawbacks as a performance measure as well. Figure 4.21 shows this in detail for one particular scenario. It can be seen that the cumulative average bandwidth steadily decreases over time, after an initial adjustment. There are two factors contributing to the decrease. First, as more streams become active, the amount of seek activity increases, reducing the number of blocks that can be read. Second, the disk bandwidth decreases as the blocks requested are closer to the centre of the drive. Since the large video streams are most often stored contiguously on disks, the bandwidth is smaller for the later 2

2 This

is due to measuring inaccuracies, whereby the very rst slot may be shorter in duration than all remaining slots.

118

portions of the streams because the blocks are closer to the inside of the disk for all streams. The bandwidth necessary for the earlier part of the scenario is available, and later, when fewer streams are reading, less bandwidth is achievable due to this factor. Less bandwidth is needed, however, to keep up with the requirements because of the large amount of read-ahead previously achieved. Cumulative Bandwith

Cumulative Blocks Read Per Slot

40 35 30 25 20 15 10 Avg B/W Achieved 5

421

391

361

331

301

271

241

211

181

151

121

91

61

31

1

0

Disk Slot Number

Figure 4.21: Observed Disk Performance: Stagger = 10 Seconds When long staggers are used, the variability of individual streams appears to have only a minute eect on the acceptance rate of scenarios. This is because a lot of smoothing takes place with buering many slots' worth of data for the earlier streams. Thus, they contribute only a small amount to the remaining streams. This must, by de nition, require a signi cantly larger amount of buer space. The 119

detailed analysis of buer space is considered in the next experiment.

Buer Space Utilization. In order to take advantage of the read-ahead in any

of the preceding scenarios, there must be sucient buer space at the server. The buer space needed is expected to be greater for the high-variability streams in order to smooth out peaks which are above minRead. Additional buers are used for blocks which are read at a faster rate than minRead, but since all streams have already been admitted, this does not aect the admission process. As the previous results show, a small amount of stagger with moderately short video streams is enough to increase the accepted bandwidth to nearly the level of actual disk bandwidth when the server is modeled with unlimited buer space. Most of the increase in bandwidth is due to contiguous reading when only one stream is actively reading, during the catch-up phase immediately after stream acceptance. In this situation, there are no seeks required and bandwidth is very high. Some tests showed that the bandwidth was more than 2 minRead. When the next stream request arrived, the schedule for the existing streams was reduced to 0 for several slots into the future. This reduction served to smooth peaks in the remainder of the schedule. The amount of buer space required to accept a scenario was calculated by a static examination of the schedule. The largest contiguous area of the scenario's requirements above minRead is found. The blocks referred to by that area in the scenario schedule above minRead must be in server buers. Otherwise, the server cannot guarantee the delivery of the blocks to the clients, because the disk can only be guaranteed to read at minRead. This can be seen in Figure 4.22, which is a small portion of one particular scenario (Scenario 90 with high-variability streams). The rectangle below minRead when the bandwidth requirement is above minRead accounts for the blocks which can be guaranteed to be read during those slots. The blocks above the rectangle must be transmitted as well. If they cannot be guaranteed to come from the current set of disk reads, they must have been read earlier and 120

the transmission is satis ed from the read-ahead buers. 30

25

Blocks Required

20

15

10

Sample Scenario Bandwidth minRead = 23

5

61

41

21

1

0 Disk Slot Number

Figure 4.22: Buer Space Analysis Technique The buer space required for the simultaneous arrivals of low-variability streams and high-variability streams is shown in Figure 4.23. For requests of bandwidth that were signi cantly below minRead, a small amount of buer space was needed. Even larger requests needed only slightly larger amounts of buer space. The largest amount of buer space needed by low-variability streams was 75 buers (5 MBytes), when an average of 22 Mbps (or 97% of minRead ) was requested. For scenarios of high-variability streams, the largest buer request required 90 buers (6 MBytes) and had an average bandwidth of approximately 20 Mbps. Some requests for larger bandwidth required fewer buers, due to the shape of the request. The 121

buer space requirements are not signi cant when requests arrive simultaneously, as admission into the system is limited by the instantaneous bandwidth required at slots early in the schedule. 100 90

High Variability Streams Low Variability Streams MinRead=23

80

Buffers Required

70 60 50 40 30 20 10 0 12

14

16

18 20 Bandwidth Requested

22

24

Figure 4.23: Buer Space Requirements: Simultaneous Arrivals In the case of staggered arrivals, the pattern of buer usage is much dierent, because the order of reading is signi cantly changed. Buers are needed to accommodate the high bandwidth achieved when only one stream is actively being read o the disk. The same static analysis procedure can be used for these scenarios, but it must be adjusted in some cases. The analysis assumes a constant disk reading rate of minRead blocks per slot. With staggered requests, the vbrSim algorithm accounts for the extra blocks read during times when the bandwidth has been more than was 122

guaranteed. Thus, a larger amount of read-ahead is achieved and all the buers are lled by the time they are needed. The analysis technique cannot model the eect of this increased bandwidth on buer allocation without knowing the exact number of buers read in each slot. An approximation can be performed, because the bandwidth achieved in the early part of the scenario is substantially above minRead. Thus, a value of 4 minRead is used in the simulation when only one or two streams are actively reading. This is at a much higher rate than the disk can read, but does not simulate reading into buers which do not exist in the particular server con guration. It simply ensures that all the necessary read-ahead is available before buers are required for the bandwidth peaks. The buer usage for staggered arrivals is shown in Figures 4.24 and 4.25. The gures show that most requests for bandwidth below minRead use a very modest amount of buer space. As the request bandwidth increases, the buer space required increases somewhat linearly. Requests with bandwidth greater than minRead use steadily more buer space, with the maximum buer space needed being 4491 buers (287 MBytes) for a scenario which had a staggered arrival interval of 10 seconds and requested 33.5 Mbps (104% of the achieved bandwidth and 150% of minRead ). This is an enormous amount of memory. This scenario was comprised of the 7 longest streams from the high-variability streams, and there was a long substantial peak in the bandwidth required. Since this scenario was accepted, this peak occurred late in the scenario when a great deal of read-ahead had been achieved. Most of the requests with a 5 second stagger can be satis ed with fewer than 1500 buers. These requests use close to 100% of the disk bandwidth. For the requests with 10 second stagger, 3000 buers is enough for nearly all the scenarios which can be accepted. Again, these scenarios are all those that request less than 100% of the disk bandwidth. The largest buer requirement for a scenario that requested less than 100% of the disk bandwidth was 2699 buers. This scenario requested 99% of the measured disk bandwidth. The scenario which required 4491 123

2000 1800 1600 High Variability Streams Low Variability Streams Constant Bit Rate Streams MinRead=23

Buffers Required

1400 1200 1000 800 600 400 200 0 15

17

19

21 23 25 Bandwidth Requested

27

29

Figure 4.24: Buer Space Requirements: Stagger = 5 seconds buers had a cumulative request of over 103% of the disk bandwidth. The graphs thus indicate that a server with substantial, but not exorbitant, memory for disk buers can accommodate requests that require a very high percentage of the disk bandwidth achievable. Scenarios with larger values of stagger can achieve more read-ahead and take advantage of the transient higher bandwidth that accompanies contiguous reading of the disk, and even accept scenarios that request more bandwidth than is nominally available. The scenarios composed of low-variability streams had only slightly dierent buer usage patterns than those composed of high-variability streams. All of the accepted requests for bandwidth below minRead required fewer than 200 buers. For 124

5000 4500 High Variability Streams Low Variability Streams Constant Bit Rate Streams MinRead=23

4000

Buffers Required

3500 3000 2500 2000 1500 1000 500 0 12

14

16

18

20

22 24 26 28 Bandwidth Requested

30

32

34

36

Figure 4.25: Buer Space Requirements: Stagger = 10 seconds constant bit-rate streams, requests in the same bandwidth range required even fewer buers, approximately 2 minRead, which is the minimum necessary for the server's double buering. The scenarios with high-variability streams required more buers than the low-variability stream scenarios for both the 5 second stagger and the 10 second stagger situations. Those scenarios that requested below minRead blocks per slot required up to 400 buers with 5 second stagger and up to 500 buers with 10 second stagger, approximately twice what the low-variability streams needed. This is because the size of the peaks is de nitely larger and the duration may be longer for scenarios with high-variability streams and thus, more buer space is required to smooth out these peaks. 125

For requests above minRead, the linear relationship between request size and buers required continues. Slightly fewer buers are needed for low-variability streams than for high-variability streams when the request level is just above minRead. It does not appear that there is much dierence in the buer requirements between low-variability streams and high-variability streams when the request level approaches the limit of disk bandwidth. As well, the value of stagger does not seem to cause much change in the number of buers required for requests of the same size. One of the low-variability requests for 27 Mbps at a stagger of 5 seconds requires approximately 1300 buers. The requests of similar size with a 10 second stagger require approximately the same number of buers. A 10 second stagger results in higher achieved disk bandwidth and this enables more streams to be accepted and requires more buer space for those scenarios. These scenarios would not be accepted at a smaller stagger value.

Client Buer Space. The next experiment examined the eect of client buer

space on the admission performance. The extra buer space at the client permits the server to send ahead at the maximum rate for a longer period of time, based on the rate-based sender-side ow control outlined in Section 3.4. The maximum rate is the rate that is established when the connection is opened. Recall that this policy attempts to send data at the maximum rate until the client buer is full, subject to the availability of bandwidth at the server, and ensures that client buers do not over ow. In a single-disk server, the network bandwidth is always sucient. If the server can send at the faster rate for a longer period of time, then more buers are available at the server for reading the remaining streams. This read-ahead may provide enough smoothing for additional streams to be accepted by the disk. The client buer sizes were set at two dierent values. As mentioned in Section 3.2, the smallest allowable client buer is the number of bytes required to be transmitted in two consecutive slots. Thus, the values chosen were: the minimum required, and 32 minimum. For the medium rate video streams that 126

were being tested, the actual number of bytes in the minimum client buer space ranged from 750 KBytes to 4.5 MBytes. Client buer sizes of 32 minimum are much larger than can be reasonably provided by the client machines (24 MBytes to 124 MBytes). Reasonably priced client machines or such as set-top boxes are likely to have memory capacities which are somewhat smaller than this range, not likely more than 16 MBytes. Therefore, the tests which were performed exercised the limits of reasonable client buer sizes and beyond. The scenarios were presented to the CMFS with 2 values of stagger: 5 seconds and 10 seconds. Simultaneous arrival scenarios were not tested because any sendahead in the scenario is achieved after all admission decisions have been made, so send-ahead has no eect on the admission decision. The 143 scenarios of lowvariability streams and 143 scenarios of high-variability streams were submitted to a CMFS. Two separate server con gurations were used: the rst contained 64 MBytes for disk block buers (1000 buers), and the second contained 128 MBytes (2000 buers). The results showed that in every case, exactly the same streams were accepted, regardless of the client buer con guration. Even very large buer sizes at the client could not change the acceptance rates. This is because the client buer is too small to hold a large enough percentage of the stream to make any substantial dierence to the server. At a maximum bandwidth rate of 10 blocks per slot, the minimum client buer is 20 buers. A client buer space of 32*minimum is 640 buers (40 MBytes). This is a very large amount of buer space, but still less than 15% of what would be required for a 10 minute stream of this average bandwidth (4800 blocks).

Request Inter-Arrival Time. In this section, the eect of dierent values of

stagger between arrivals of stream requests is evaluated. The previous tests showed that the eective disk bandwidth which could be achieved by the CMFS depended signi cantly on the amount of contiguous disk reading that could be achieved. Con127

tiguous reading occurs whenever a new stream is admitted, because the data for the new stream has an earlier deadline than the next blocks of data which must be read for existing streams. Thus, data from only the new stream is read for several seconds after acceptance. Data for existing streams is already in buers at the server as a result of the read-ahead. The bene t of having a longer time between stream arrivals is that the steady state of the disk is reached before a new arrival occurs. In the steady state, all streams are equally read-ahead a signi cant amount of time (between 5 and 20 seconds) for a reasonably con gured system servicing moderate bandwidth video streams. For example, Section 4.4 shows that a server equipped with 128 MBytes of buer space per disk can support approximately 6 or 7 streams at an average rate of 4 Mbps each, depending on the shape of the server schedule. That translates into approximately 30 to 40 seconds of read-ahead per stream. The con guration and arrival pattern that would allow the disk to always read at maximum transfer rate would be a server with unlimited buer space and arrivals such that the entire stream was read into server buers before the next request arrived. The disk would never have to perform a seek operation during a slot. As a concrete example, consider video streams with a bit rate which averages 4 Mbps. On a disk which can read at 5 MBytes per second, the transfer rate is 10 times the required rate of playback. Thus a 5 minute stream can be read o disk in 30 seconds, totaling 150 MBytes of data. It would occupy buer space corresponding to the remaining 270 seconds of playback. This 270 seconds would require 135 MBytes because the data for the rst 30 seconds of presentation would have been transmitted during that period of time. A system would be capable of supporting 10 simultaneous streams (the theoretical maximum) with arrival staggers of 30 seconds. The buer space required would be at its maximum immediately after all 10 streams have been read. Since the earlier streams will have transmitted most of their data, 3

3 This

state is when the reading rate is limited by the number of buers returned to the system as a result of transmission.

128

the occupancy can be calculated by multiplying 15 MBytes times the number of 30 second intervals that are remaining in the transmission for each stream. The total buer space required to support such a retrieval pattern is 675 MBytes. This is more than 30% of the capacity of the disk used in this server con guration. It is not reasonable to devote so much memory to server buers. Since buer space is likely to be limited to a much smaller fraction of the disk capacity, the disk system will reach the steady state because moderate length streams cannot be stored in their entirety in server buers. If steady state is reached very quickly and each stream still has most of its data resident on disk, then all the bene t of read-ahead with limited memory at the server can be achieved with a small value of stagger. In this case, increasing the arrival stagger will not enable more streams to be admitted. If steady state is achieved more slowly, then an increased stagger will allow reading to continue based only on the limitations caused by seek activity. More streams may be accepted with longer stagger. The next set of experiments used three stagger values: 5 seconds, 10 seconds, and 20 seconds. These values were considered to be reasonable because they provide a substantial amount of time between user requests. Smaller values of stagger were not considered due to there being fewer than 10 slots during which read-ahead could take place. Longer stagger values were not considered because, according to the previous analysis, a 5 minute stream can be read in 30 seconds, but requires over 128 MBytes of buer space. Since this is more than the memory that was available in the hardware con guration, it seems certain that steady state will occur before 30 seconds of contiguous reading. Several server con gurations were used as well in the experiments, in an attempt to see how the total buer space at the server aected the ability of stagger to in uence acceptance decisions. The streams comprising the scenarios were grouped according to the length of playback time to see if shorter streams were able to take advantage of the increase in stagger more than long streams. The short streams ranged from 50 seconds to 3 129

minutes in length, while the long streams ranged from 6 minutes to 10 minutes in length. Ten randomly selected scenarios were used from the short streams as well as ten scenarios from the long streams. Tables 4.5 and 4.6 show the admission results for the short streams and long streams respectively. The results show that, for the short streams, increasing the length of the stagger allowed more scenarios to be accepted. With minimal client buer and 64 MBytes at the server, moving from a 5 second stagger to a 20 second stagger allowed every scenario to have at least one more stream accepted. With 128 MBytes at the server, only 3 of the 10 scenarios could be accepted at a stagger interval of 5 seconds, while with 20 second staggers, all scenarios were accepted.

Scen Stms B/W Req Req 1 2 3 4 5 6 7 8 9 10

7 7 7 6 6 6 6 5 4 7

34.5 37.6 37.4 33 31.7 31.6 31.7 27.5 21.2 34

64 MB Server Buer B/W Acc Stms Acc

5 24.8 27.1 26.9 27.1 27 26.9 26.1 27.5 21.2 29.3

10 24.8 33 32.7 27.1 31.7 26.9 26.1 27.5 21.2 29.3

20 34.5 33 32.7 27.1 31.7 31.6 26.1 27.5 21.2 34

5 5 5 5 5 5 5 5 5 4 6

10 5 6 6 5 6 5 5 5 4 6

20 7 6 6 5 6 6 5 5 4 7


5 24.8 33 26.9 27.1 31.7 31.6 26.1 27.5 21.2 29.3

10 29.2 37.6 37.4 33 31.7 31.6 31.7 27.5 21.2 29.3

20 34.5 37.6 37.4 33 31.7 31.6 31.7 27.5 21.2 34

5 5 6 5 5 6 6 5 5 4 6

10 6 7 7 6 6 6 6 5 4 6

Table 4.5: Short Streams - Admission Results - Staggered Arrivals For long streams, there was no dierence in acceptance in any of the scenarios. This is not particularly surprising. A 128 MByte server can hold approximately 4 minutes of video data if the display rate averages 4 Mbps. This can be read in 24 seconds. Thus most certainly, a stagger of more than 20 seconds would not in uence admission. When a second stream of similar bandwidth pro le is added, the buer 130

20 7 7 7 6 6 6 6 5 4 7

Scen Stms B/W Req Req 1 2 3 4 5 6 7 8 9 10

7 7 7 6 6 6 6 5 4 7

29.1 31.4 25.9 25.8 25.8 22.1 25.6 21.1 14.2 30.1


5 24.2 24.1 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

10 24.2 24.1 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

20 24.2 24.1 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

5 6 5 6 5 5 6 5 5 4 5

10 6 5 6 5 5 6 5 5 4 5

20 6 5 6 5 5 6 5 5 4 5


5 24.2 28.2 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

10 24.2 28.2 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

20 24.2 28.2 22.2 22.3 22.1 22.1 20.0 21.1 14.2 25.9

5 6 6 6 5 5 6 5 5 4 5

10 6 6 6 5 5 6 5 5 4 5

Table 4.6: Long Streams - Admission Results - Staggered Arrivals space will be split among the streams, giving 2 minutes to each stream. Thus, the second stream can read at maximum for 12 seconds, stealing buers from the original stream to do so until both streams have been read ahead 2 minutes. This process of sharing the server buer space continues as additional streams are added. When 6 streams are active in the steady state, between 30 and 40 seconds of data are stored in server buers per stream. This amount can be read in between 3 and 4 seconds at maximum transfer rates. If a new stream were accepted in this state, it would catch up to the existing streams in terms of read-ahead in about the same amount of time. From then on, the reading rate would be limited by buer space considerations. For a 64 MByte server, the values for the playback lengths of data which can be stored at the server must be divided by 2, so steady state is achieved much earlier. Why then is there a performance dierence when short streams are submitted? This is a case where a large percentage of the stream can be read contiguously and stored at the server in buers. In a 128 MByte server, all of a 3 minute stream can be stored. It takes approximately 18 seconds to read the entire stream. Thus, a 131

20 6 6 6 5 5 6 5 5 4 5

stagger value of greater than 18 seconds would be super uous, because there would be no more data to read before the next request arrived, during the time the server is lightly loaded. When two short streams (i.e. 3 minute streams) share the server buer space, 2 minutes of each stream can be stored, as in the previous case. So far, the analysis is identical to the previous case. A reasonable amount of data is left to be read for each stream. It changes, however when 5 or 6 streams are read and a short stream request (less than 1 minute in duration) is submitted late in the scenario. The data for a 60 second stream can be read in 6 seconds. A scenario with 6 existing streams can buer up to 40 seconds' worth of data per stream. With a small stagger of 5 seconds, all 6 streams are introduced in 25 seconds and occupy all server buer space. The new stream would arrive after 30 seconds. The new stream can be guaranteed to read at approximately 6 times the playout rate (23=6 4), so it would take about 7 seconds to read 40 seconds' worth of data, and thereby catch up to the existing streams. At this point, all 7 streams occupy the server buer space. As well, all 7 streams have a considerable amount of data left to read. The rst 3-minute stream has read at most 70 second's worth of data, 30 which have been transmitted and 40 which have been buered; there are still 110 seconds of data left to read. It may be the case that the 7th stream cannot be accepted due to the overall bandwidth still required in the future of the scenario. If this stagger is increased to 10 seconds, then the rst 6 streams are not all completely active until 50 seconds into the scenario. The earlier streams have transmitted approximately twice as much data by the time the 7th stream arrives at 60 seconds into the scenario. While this is insigni cant for a 10 minute stream, for a 3 minute stream it means that over half the rst stream has been read and/or transmitted, as only 80 seconds remain to be read. The data in the remainder of the schedule is much less than in the case where the stagger was 5 seconds. Shorter streams have an even greater percentage of their data already read o disk. This aects the size and duration of the remaining peaks, so that in more of the scenarios, 132

the 7th stream can be accepted.

4.4.6 Summary In this section, quantitative performance dierences between the algorithms were identi ed. As well, the eects of dierent trac patterns on the admission performance and buer space requirements were carefully examined with respect to the vbrSim algorithm. The dierence between the performance of the deterministic-guarantee algorithms depends on the characteristics of the streams which are submitted to the CMFS. Performance tests on real streams have shown clear quantitative dierences. Streams with very low-variability which are requested simultaneously produce bandwidth schedules which are devoid of signi cant peaks. In this situation, with simultaneous arrivals, all three algorithms accept almost the same scenarios. Since the disk is capable of higher performance, all give conservative decisions compared with the optimal algorithm. With variability in the stream bandwidth pro les, there is a substantial dierence in acceptance behaviour. The vbrSim algorithm accepts approximately 20% more bandwidth than the Instantaneous Maximum algorithm for mixed-variability streams. Valid scenarios were rejected by the Average algorithm, due to the amount of contiguous reading that was achieved by the disk system. When the average achieved bandwidth was much greater than minRead, then Average provided conservative admission decisions, while vbrSim could accept scenarios that requested substantially above minRead blocks per slot. With 10 second staggers between arrivals, vbrSim accepted scenarios that requested nearly 150% of minRead, since the achieved read-ahead in the past was incorporated by the algorithm. There are a reasonable percentage of valid scenarios which vbrSim rejects, but request below minRead in terms of total bandwidth, especially when the requests arrive simultaneously. If the Average algorithm is based on minRead, then the 133

disk system can support such scenarios, because disk performance is greater than minRead as relatively few seeks are required. There is no read-ahead achieved above minRead between arrivals for vbrSim, so the upper bound on acceptance is minRead. The Average algorithm accepted all scenarios with a request below minRead. With simultaneous arrivals, the most bandwidth could be sustained with CBR stream requests and the admission performance degraded as the variability increased. When stagger was introduced to the arrival pattern, the dierence in admission performance between stream types became negligible. Buer space requirements grew linearly with the size of the request for scenarios that requested more than minRead blocks per slot. For requests slightly below minRead, the highvariability streams needed approximately double the buer space that low-variability streams required. The experiments designed to test the eect of client buer space showed no eect on admission performance. Increasing the inter-arrival time permitted more short streams to be accepted, thus elevating the sustainable accepted bandwidth from the disk system.

4.5 Full-length streams All of the tests performed in this chapter were performed with short streams which are typical of a News-On-Demand environment. It is reasonable to assume that a Video-On-Demand environment, with feature-length video streams of an hour or more in duration, would place somewhat dierent demands on a continuous media le server. The general shape of the bandwidth schedules of each individual stream would likely be the same, regardless of clip length. When long streams are combined into scenarios, the scenarios have the same general shape. Figure 4.26 shows a selection of 3 minutes of a bandwidth schedule of a scenario with simultaneous arrivals. There are a number of peaks that are of moderate size and duration. In Figure 4.27, a 2000 second (33 minute) scenario is presented which is comprised of the same 134

40 35

Blocks Required

30 25 20 15 10

scenario 30

5

341

321

301

281

261

241

221

201

181

161

141

121

101

81

61

41

21

1

0

Disk Slot Number

Figure 4.26: Short Stream Scenario streams as in Figure 4.26, but concatenated together. A three minute selection of the bandwidth requirements is shown in Figure 4.28. The peaks are of relatively the same size and duration. The three-minute selection taken from the full-length streams was analyzed for buer space requirements. This segment would require 2462 buers to smooth out the peaks, which is similar to what is required without repetition. Unfortunately, from the buer space point of view, the scenario does not end at this point. In particular, the entire half hour of the scenario required 36527 buers (2.22 GBytes). Since the graph of the entire schedule does not seem to have many large peaks, it must be the case that the peaks are of longer duration. This scenario indicates that, while the bandwidth is not signi cantly dierent between 135

long scenarios and short scenarios, the buer space requirements may be extremely large for scenarios that have requirements somewhat above minRead, but possibly below the actual performance of the disk. This scenario could not be executed on the server, due to the size limitation on the disk and the lack of a testing client

exible enough to masquerade this loop as a single request. Therefore, it is unclear whether the scenario could be supported, but the average bandwidth of the scenario is approximately 30 blocks per slot, and most scenarios achieved close to 30 blocks per slot. This one data point is not enough to provide conclusive evidence regarding the characteristics of long streams in general. 45 40 35

25 20 Scenario 30 - looped

15 10 5

3801

3601

3401

3201

3001

2801

2601

2401

2201

2001

1801

1601

1401

1201

1001

801

601

401

201

0 1

Blocks Required

30

Disk Slot Number

Figure 4.27: Looped Scenario In the moderate length streams, several tests showed that an overall request 136

50 45 40

Blocks Required

35 30 25 20 long30

15 10 5

361

341

321

301

281

261

241

221

201

181

161

141

121

101

81

61

41

21

1

0

Disk Slot Number

Figure 4.28: Short Stream Excerpt pattern that is above minRead can be accepted if enough stagger is introduced to the arrival pattern so that the remaining portion of the schedule can be smoothed. Most of the slots during which all streams are active will have bandwidth requirements above minRead for only a portion of the scenario length. The read-ahead achieved during the start-up of the scenario may often be enough to smooth out peaks for a schedule of 10 minutes or less. With long streams, a scenario that has a cumulative bandwidth request above minRead may have the bandwidth over minRead for many minutes. This would result in a need for a massive number of buers to supply the long period of oversubscription. The simple extrapolation from this one scenario indicates that the 137

ability to read scenarios with cumulative requests above minRead is not sustainable for long scenarios. Admission performance with long streams has not been done, due to limitations in the capacity of the disks. Storing a one-hour video stream would use more than an entire 2 GByte disk at the bit rates of the streams being studied. Simulations can give some intuition, but an enhanced hardware environment is necessary to provide more de nite conclusions. The behaviour of long streams is included as a potential area of further study.

4.6 Analytical Extrapolation The vbrSim algorithm has exactly the same performance as the optimal algorithm when the observed disk performance (actualRead ) is exactly equal to minRead for every slot. If disk performance is not constant, then the admission decisions are conservative, because the actual disk bandwidth is greater than the worst case estimate. As the dierence between minRead and actualRead increases, the admission decisions of vbrSim diverge from those provided by the Optimal algorithm. The purpose of this section is to provide some analytical discussion of the relationship between this dierence and the admission performance of vbrSim. Note that actualRead varies from scenario to scenario, but is within a de ned range for a particular type of stream on the disks used in the experiments. There are 3 cases to consider: 1) actualRead = minRead, 2) minRead approaches 0, and 3) 0 minRead < actualRead. The rst two cases are the extreme boundary conditions and would never be true in a real server. They provide the limits for the algorithm.

Case 1: The disk performance can be characterized by a single number, the number of blocks read per slot. De ne this number to be N . The vbrSim algorithm simulates the reading of N blocks in every slot k. Let 138

the bandwidth requirements of the disk in each slot be de ned as reqk for all k. At the end of every slot, the read-ahead is appropriately adjusted as in Figure 4.6. If the read-ahead value is always greater than 0, the set of streams can be accepted. Since the disk will always read N blocks when buers are available, the read-ahead prediction is exact and no additional read-ahead can be achieved. If the read-ahead drops below 0, the new stream must be rejected. The Optimal algorithm would, however, make the same decision.

Case 2: If the value of minRead is suciently low, then no streams can be

accepted by the vbrSim algorithm. Choose minRead = 0. The simulation of the rst slot adjusts read-ahead by 0 ? reqk . This is negative for the rst slot where reqk > 0, and would result in rejection of the new stream.

Case 3: 0 minRead < actualRead. Consider a server which has B buers for disk blocks and is servicing moderate length, high-bandwidth video streams with average bandwidth of Sav blocks

per slot. The experimental results show admission performance for a situation in which actualRead 1:5 minRead.

With simultaneous arrivals, the admission performance is determined by bandwidth alone. No scenarios that request over 80% of the achieved bandwidth are accepted by vbrSim. With minRead = 23, this is approximately 21.5 blocks per slot (or 21.5 Mbps). Since the average bandwidth of the streams (strav ) is approximately 4 Mbps, the system can accept between 4 and 6 streams simultaneously. If the actualRead is much larger than minRead, the number of streams accepted would be identical (as would the total bandwidth accepted), but the percentage of disk utilization would decrease linearly. With staggered arrivals, the acceptance decision is determined by a combination of bandwidth and buer space limitations. The experimental environment 139

has scenarios where the maximum acceptable bandwidth is equivalent to 100% of the disk bandwidth given moderately large buer space. The approximate number of streams acceptable in this situation is actualRead=sav = n streams. The amount of buer space per stream is B=n = b buers. This buer space contains the data for an average of b=sav slots. If the observed bandwidth is doubled, but minRead remains the same, there may be a slight change in the number of streams accepted. As achieved bandwidth increases, the amount of time necessary to ll the available buer space decreases, allowing the same number of streams to be accepted at smaller stagger values. Under what circumstances could an additional stream be accepted? An additional stream could be accepted if the new schedule had an overall bandwidth requirement less than minRead for the remainder of the schedule. That is totalRequirements ? B minRead scheduleLengthInSlots. As more streams (of the same approximate length) are accepted, only totalRequirements changes. If totalRequirements does not change drastically, the new stream may be acceptable. This implies that the new stream must be of a short duration or low bandwidth (i.e., anything that creates a smaller object). In the steady state, each stream has approximately b=sav slots' worth of data in server buers. The value of b decreases as an exponentially decaying function of n. Thus, less of each stream is in server buers and more remains to be read at the time a new request arrives, compared with a smaller value of n. Consider the following example. Let t be the average length in slots for each stream. If each stream in a particular scenario has an average bandwidth of 4 blocks per slot (Sav ), and there is 128 Mbytes of server buer space (B = 2000), approximately 500 slots of data can be stored in server buers. If the length of the stream averages 5 minutes (t = 600), then each stream is composed of 2400 blocks. 140

With current disk speeds and reasonable stagger (less than 10 seconds), Section 4.4 shows that the system can accept between 6 and 7 streams of this size, providing for an average of 500=7 = 71 slots of read-ahead per stream. These 71 slots allow slightly over 12% of the data to be stored in server buers. There are 14400 total blocks of data to be read for 6 streams and 16800 for 7 streams. After 6 streams have been accepted with a 10 second stagger between them, approximately 13000 blocks remain to be read. The number of blocks which would have been transmitted is:

X 20 4 (i ? 1) = 80 (X i) = 1200 6

5

i

i

=1

(4.11)

=0

If 7 streams were successfully accepted, then about 15000 blocks would remain immediately after the 7th stream was accepted. Adding an 8th stream of similar characteristics at a 10 second stagger would inject another 2400 blocks to the schedule and impose a schedule length of 600 slots. In the meantime, 480 additional buers would have been transmitted. Thus, the new schedule contains 16920 blocks for the next 600 slots, for an average requirement of 28.2 blocks per slot. The 8th stream cannot be accepted, no matter how high the actual disk performance was in the past. So, if the disk performance is doubled, then the percentage of the disk achievable is divided by 2. Disk utilization becomes a decreasing linear function of averageRead=minRead. For shorter streams, this situation is slightly dierent, since a larger percentage of the stream can be buered at the server node. Using the same analysis as in the previous paragraph, 71 slots out of 300 are buered for 2.5 minute streams (24% of required data) after 7 streams have been admitted with a 10 second stagger between arrivals. 5920 blocks remain for the existing streams. An additional stream request 10 seconds later means that 5920+1200 ? 480 = 6640 blocks remain for the next 300 slots. The average requirement is 22.1 blocks 141

per slot, just enough to be accepted. The eect would be greater as the average playback length of the streams decreases. If streams are 2 minutes long, 48% of the data is held in server buers. Almost 10 streams can be accepted. This shows that an small increment in the number of streams is possible for short, but high-bandwidth video streams for very large values actualRead. Thus, the amount of bandwidth and the number of streams accepted by vbrSim can increase a small amount if the actual performance of the disk actualRead increases from just above minRead to a great deal more than minRead for certain types of streams. These are the smaller and shorter streams which can have most of the stream data in buers, when a new stream arrives, and thus, the remaining requirements with the new stream added are acceptable by vbrSim. If the streams are large with respect to the buer size, or have long playback duration, increasing actualRead cannot increase the acceptance by vbrSim. When actualRead increases, more scenarios become valid, and the Optimal algorithm would accept them. Thus, the percentage of the valid scenarios that vbrSim can accept decreases linearly with the increase in actualRead.

142

Chapter 5

Network Admission Control and Transmission The second major resource in limited supply in a CMFS is network bandwidth. To manage the bandwidth, a method of reserving bandwidth to prevent over ow while permitting high utilization is desired. In this chapter, the server ow-control mechanism which permits the server to push data to clients without requiring explicit feedback is described in detail. An admission algorithm is developed for the network system that ensures that the requirements of the streams never exceed the capacity of the network interface of an individual server node. In order to have such an algorithm, it is necessary to provide a measurement of the network bandwidth requirements over time. The development and integration of the network characterization and admission control algorithms is the third major contribution of this dissertation. This measurement of network bandwidth requirements is similar, but distinct from the measurement of disk bandwidth for reasons which will be described later in the chapter. The options available to solve the network bandwidth allocation problem are more restricted than for the disk, as the network bandwidth estimate is a xed upper bound (maxXmit ) on performance, while the disk admission utilizes a xed lower 143

bound (minRead ) on performance. It is not possible to send data at a faster rate than maxXmit. The concept of send-ahead cannot be used to take advantage of excess system bandwidth to the same degree as read-ahead was used for the disk. Also in this chapter, bandwidth smoothing techniques are introduced which alter the resource reservation needs of streams while still providing a guarantee that the data will be sent from the server prior to its deadline. Finally, performance experiments are described that show the bene ts of the smoothing methods and the admission algorithm in the context of the CMFS.

5.1 Server Transmission Mechanism and Flow Control The network component of the CMFS is modeled as a single out ow per active connection from the server node. It is assumed that a xed percentage of the nominal network interface is available for sending data. From the server's point of view, this is a maximum value (maxXmit ), which cannot be exceeded in any circumstance. Another assumption regarding network transmission is that the rates established by the client and the server are guaranteed rates. Once a rate is established for a particular length of time, the server transmits at that rate and the network interface on the client receives at the speci ed rate. If the network is incapable of supporting that rate, the algorithms presented in this chapter do not eciently use the server resources. The server continues to assume that all the bits transferred have value to the client. Appropriate responses under these circumstances require the client application to reduce the requested bandwidth to a level that can be supported by the network. The details of client policies to solve this problem are outside the scope of this dissertation. There may be dierent underlying physical network structures which implement the delivery of data to the client. Much research has been done in characterizing trac patterns and analyzing the performance of the network itself. The 144

server has no control, however, over any aspect of this performance, except as to its own patterns and volume of sending. From the network point of view, the value of maxXmit must be a guaranteed minimum, as the factors which in uence the throughput of the network are beyond the scope of this dissertation. When the connection is established between the server node and the client, a transmission rate is negotiated. This rate is based on the fact that, in the worst case, all the data for a speci c disk slot may need to be sent during that disk slot, rather than at any previous time. The largest amount of bandwidth needed is for the largest disk slot and this is the minimum rate used. If the bandwidth is not available, then no connection can be established. When data is actually transferred, however, there may not be enough buer space at the client to receive this amount of data at the maximum rate. Therefore, the sender and receiver require methods to deal with overload at either the sender interface, the network itself, or the receiver. This is performed in most network environments via a ow-control mechanism. This

ow control is often implemented by a protocol executed between the sender and the receiver. If the data is being transmitted too quickly, the receiver informs the sender that it cannot receive at the current rate. The sender either stops transmitting or reduces the rate appropriately until informed otherwise by the receiver. In the CMFS, a method which uses requests from the client to implement

ow-control is impractical, because of the unavoidable latency of responses to the requests. These requests are also unnecessary, since the server knows precisely how much data the client will be presenting to the user in each display period, once the presentation has begun. The server is informed of the beginning of the presentation by the start packet (see Section 3.1). Flow-control can then be implemented exclusively at the server. The basic goals of the mechanism were introduced in Section 3.4. This section provides a more detailed explanation of the server operations which implement ow control. A rate-based connection is used, but the server only sends data based on credit issued 145

by the network manager. Therefore, the channel is not fully utilized if the amount of credit issued is less than what could be sent at the full rate for the entire slot. Once the start packet has been received, the server begins computation of how many bytes have been displayed. A timer thread generates a timing signal once per disk slot. This causes the network manager to examine all the active streams and perform the following actions: 1. Increase the server's estimate of the client buer space capacity by the amount of data displayed, and therefore consumed from the client's buers, in the previous disk slot time. 2. Decrease the server's notion of available client buer space by the amount of data required to be sent in the current disk slot. 3. Issue credit to the Stream Manager for the current disk slot if data must be sent at this time in order to maintain continuity. 4. While there is excess bandwidth at the network interface, nd a stream which has unused connection bandwidth, available client buer space, and the next disk slot of data read ahead. Decrease client buer space by the amount of data in the next disk slot and issue credit for the stream manager. This is repeated for each stream until either the server bandwidth is exhausted or no stream has sucient client buer space to accept more data. This step achieves what is termed \network send-ahead". In most cases, the network will send ahead to ll up the client buer space and streams will have no work to do for steps 2 and 3. Send-ahead is limited by the amount of client buer space; the minimum required amount is the sum of the largest two disk slots. The network manager determines how many presentation units have been consumed by the client application when deciding the amount of credit that should 146

be issued to the stream manager for each connection. It does this based on the timestamp in the start packet. By knowing the starting time, the server can determine the exact buer occupancy at the client, assuming the client is processing the stream according to the contract established at CmfsPrepare. It is important to note that no stream will be given credit to send beyond the minimum required as long as there are other streams that are sending their required data. Step 4 in the above procedure attempts to send ahead by an equal amount of time for each stream, within the constraints of client buer space, server bandwidth, and individual connection bandwidth. In the description of step 4, credit is issued for one disk slot at a time. Credit is only issued for a stream if there are buers queued for transmission. This is because all of the credit must be used up by the stream manager in the current disk slot. If credit is left over for a stream because there was no data to send, in some future slot, the connection rate may be exceeded.

5.2 Network Admission Algorithm Design There are a small number of possible approaches to determining network admissibility of a continuous media stream. Much previous work has been done on the statistical multiplexing of variable bit-rate data streams on networks such as ATM [45]. Constant bit-rate connections have also been used to transport variable bitrate data [60]. Statistical approaches cannot provide absolute guarantees, since the entire concept of multiplexing is based on providing low, but non-zero probabilities of transient network switch overload. Knightly et al. [45] prove that their deterministic-guarantee admission control method signi cantly improves the network utilization above what is available using peak rate allocations. This approach is only able to achieve a maximum utilization of about 30% in tests of full length video objects, which is still a fairly low network utilization. This realization has led researchers to investigate statistical admission guarantees or to utilize constant 147

bit-rate channels and provide smoothing or startup latency. This avoids peak-rate allocation of constant bit-rate channels or failures to establish VBR connections with acceptable qualities of service. In keeping with the philosophy of disk admission control and resource usage characterization, time can be divided into network slots, and a detailed schedule of the bandwidth needed can be constructed in terms of network slot values. A network slot is an even multiple of the disk slot time (e.g. one network slot could be de ned as 20 disk slots). By using large network slots, the system can transmit data at a constant rate during a network slot in accordance with the bandwidth values in the schedule. This mechanism is known in other literature as Piecewise Constant Rate Transmission and Transport (PCRTT) [29, 60]. In network environments that provide point-to-point connections with guaranteed bandwidth, establishing a connection requires negotiation between the sender, the receiver and the network components in between. In PCRTT, the transmission rate varies during the lifetime of the connection. This requires re-negotiation to ensure that the new parameters of the connection are still acceptable to all relevant entities [37]. There are two uses for network slots in the transmission subsystem in the CMFS. The rst is for establishing the rates for individual connections. PCRTT maintains a constant rate for a period of time, at the end of which a new rate can be negotiated. It is reasonable to make this period of time a constant length and utilize it for the second purpose: admission control. The amount of data required in each network slot can be characterized by some process and used as input to the admission control algorithm. The size of a network slot is signi cantly larger than a disk slot for two main reasons. The rst is the overhead of renegotiation. A renegotiation takes a nontrivial amount of time, and therefore, should be eective for a substantial amount of time. The second is the ability to smooth out the data delivery by sending data at 148

an earlier time in the network slot than is absolutely required, making use of client buer space. This client buer space can be signi cant, allowing more send-ahead. Other research has experimented with network slots ranging from 10 seconds to 1 minute in length [37, 95]. Zhang and Knightly [95] suggest that renegotiations at 20 second intervals provide good performance. The initial tests in this chapter use 20 seconds as the size of the network slot. Additional experiments are conducted at the end of this chapter to determine an optimal network slot size. The approach taken by the network admission control in the CMFS is to provide a deterministic guarantee at the server and use constant-bit-rate network channels. Data is transmitted at a constant bit-rate, subject to buering constraints at the client for the duration of a network slot. These constraints are used by the network manager to provide credit to each stream manager for actual transmission of the data. The algorithms that were considered for the disk admission control are also candidates for the network admission control. Obviously, as shown in Chapter 4, Simple Maximum is inferior to Instantaneous Maximum, and thus, does not deserve serious consideration. Algorithms such as Average will not perform well in terms of correctness, because the estimate of network performance is a xed upper bound. If admission is based on the average being less than the bound, then it is highly probable that there will be data to be transmitted in slots that oversubscribe the network, causing the server to fail to meet its obligations. It is possible that all slots could be under the bandwidth boundary, but this is very unlikely with variable bit-rate streams. An intriguing possibility is to use vbrSim for the network. This has a number of advantages, including the fact that a uniform admission policy could be used for both resources. The smoothing eect enabled by sending data early could eliminate transient network bandwidth peaks. One of the necessary conditions is that largely diering amounts of data must be sent in each slot for each connection, correspond149

ing to the particular needs of each stream. This requires that either the network or the server polices the use of the network bandwidth. One major bene t of vbrSim for the disk system is the ability to use the server buer space to store the data which is read-ahead. This buer space is shared by all the streams and thus, at any given time, one connection can use several Megabytes, while another may use only a small amount of buer space. The server buer space is typically large enough to hold the data for dozens of disk slots per stream, when considering large bandwidth video streams. For scenarios with cumulative bandwidth approaching minRead, signi cant server buer space is required to enable acceptance. If the same relative amount of buer space was available at each client, then vbrSim's send-ahead method for the network system could be eective. Unfortunately, the server model only requires two disk slot's worth of buer space to handle the double buering. With only the required client buer, very little send-ahead is possible. Even this amount of buer space is large compared with the minimum required by an MPEG decoder. According to the MPEG-2 speci cations [38], only three or four frames is required as buer space. Many Megabytes of client buer would be needed to provide space for on the order of a dozen disk slots' worth of video data. It is not practically reasonable to assume or require client applications to have this level of memory. Another factor to consider is the disk performance guarantee. In order to accommodate the variable send-ahead, the disk must have enough data read-ahead to be able to send. In this way, the client buers can be looked upon as extensions to the server buer space. The admission control on the network must consider the disk's ability to have the data buered in time to send ahead. The disk admission only guarantees that the data will arrive in the slot in which it is needed. Thus, an intimate integration between the two algorithms is required. Since the bandwidth of the network is known with certainty, the vbrSim 150

algorithm for the network would never make a conservative admission decision. Any slot where the required amount of data is greater than the capacity would suer failure, and would be rejected by an optimal algorithm, because some data would not be sent successfully. Based on its knowledge of the client buer space and the guaranteed data delivery rate from the disk, the disk system knows precisely which data it can transmit during every slot. Unfortunately, this requires an extra level of bookkeeping. Upon admission, the system must keep track of the latest possible moment at which each data block can be read o the disk and still keep the commitment, so that the network admission control knows how much send-ahead it can perform. All of these enhancements could be made with a reasonable amount of effort, having the network and disk decisions both made on a disk-slot granularity. Unfortunately, this would require a network renegotiation on a very frequent basis. Keshav et al. [37] indicate that renegotiation intervals should be relatively long. With larger network slots, the amount of client buer space for send-ahead shrinks in relative importance. It is most likely that the client buer would be much less than could be transmitted in a network slot. Any attempt at send ahead would ll the client buer during the rst network slot and then the credit that could be issued would resort to being a function of the client buer space freed during each slot by presentation to the user. From then on, very little additional smoothing would be possible. Finally, the other signi cant performance bene t that vbrSim for the disk provides is contiguous reading, which allows a much greater disk bandwidth to be achieved, especially at the beginning of streams on the outer edge of the disk. The network has no analogous variability in performance. Thus, although vbrSim could theoretically be used as the network admission control algorithm, the limited amount of smoothing between network slots and the constant bandwidth nature of the network interface restricts the performance im151

provements possible. In light of this analysis, a straightforward adaptation of the Instantaneous Maximum algorithm is utilized for network admission control. When combined with the network characterization algorithm described in the following section, sucient send ahead is achieved even with the large network slots. The network admission control algorithm is shown in Figure 5.1. Each hardware con guration has a maximum number of bytes that it can transmit per second. This value can be easily converted to the number of blocks per disk slot, previously de ned as maxXmit, which is the only con guration-speci c component of the algorithm. The input is the network bandwidth characterization of each stream (the netBW array) and the length of the new stream (networkSlotCount). It is only necessary to examine the values for the slots in the new stream, using the loop counter netwSlot, since the comparison results for each slot are independent. All slots were below the threshold before the admission test and so the cumulative network bandwidth schedule for the server beyond the end of the new stream is irrelevant to the admission decision. NetworkAdmissionsTest( netBW[][], networkSlotCount )

begin for netwSlot = 0 to networkSlotCount do sum = 0

for i = rstConn to lastConn do

sum = sum + NetBW[netwSlot][i] if (sum > maxXmit) then return (REJECT)

end end return (ACCEPT) end

Figure 5.1: Network Admissions Control Algorithm For each network slot considered, the network bandwidth requirements for 152

each stream are summed to provide a total network utilization for that slot. If the sum of every slot is less than maxXmit, the scenario is accepted. One particular problem in admission is that a stream request may arrive during the middle of a network slot. Streams must be admitted in the disk slot that the request arrives. Thus, the rst network slot for the new stream is made shorter, so that the network slots end at the same time for all streams.

5.3 Network Bandwidth Schedule Creation The network admission control algorithm is susceptible to spurious peaks in bandwidth requirements of individual streams. If such peaks occur in the same network slot for many streams, then a scenario may be rejected, when it could easily be supported. These peaks cannot be easily smoothed via sending ahead, so it is important to provide a bandwidth characterization that has a small amount of variability. Both the overall level of the bandwidth reservation and the variability should be minimized if possible. Two simple measures of network bandwidth requirements are the peak and the average bandwidths. Peak allocation reserves bandwidth that is signi cantly higher than the average. Average allocation is very likely to fail under heavy utilization with variable bit-rate streams. While it is possible that peaks in one stream oset valleys in others, a complex probability analysis is necessary to ensure that the risk of failure is suciently minimal. Another reason this method is insucient is that sending at the average rate for the entire duration of the stream does not ensure that enough data will be present in the client buer to handle peaks in the bandwidth which occur early in the stream. In this situation, the data arrives late at the client, causing starvation and preventing continuous presentation to the user. Average allocation, combined with a method to calculate over ow probabilities is the approach taken by Vin et al. [87]. During periods of overload, their system does not attempt to send all the required data. Rather, the server deter153

mines how to judiciously spread the over ow among the active streams so that all users observe an equal degradation in service. Using average allocation and constant transmission rates, starvation can be prevented by prefetching enough media data to allow continuous playback. This introduces start-up latency and requires a large client buer. Both client buer size and start-up latency have been parameters in previous research [80]. Signi cant reductions in either buer space or latency can be achieved at the expense of increasing the other component, so design trade-os must be considered. If both of these values are to be kept to a minimum, then an approach which utilizes the VBR pro le of the network bandwidth schedule and sends at diering rates is essential. The tightest upper bound characterization on network bandwidth requirements is the empirical envelope [45], which has been used in much of the previous work in this area. It results in a somewhat conservative, piece-wise linear function, speci ed by a set of parameters, but requires O(n ) time to compute, where n is the number of frames in the stream. Approximations to the empirical envelope using multiple-leaky bucket packet delivery mechanisms have been used, providing either deterministic or statistical guarantees of data delivery across the network. Results indicate that the deterministic algorithms improve utilization above peak-rate allocation, but are still overly conservative. In this section, three algorithms for constructing a network allocation schedule (called a network bandwidth schedule) are presented. The following section compares their eect on admission performance. The rst algorithm (denoted Peak ) uses the peak disk bandwidth value in each network slot as the network bandwidth schedule value. This value is very easy to calculate via a linear pass of the disk block schedule and selecting the maximum within each network slot. The second algorithm, hereafter called Original, considers only the number of bytes that are required to be sent in each network slot independently of every other 2

154

network slot. The server knows this amount when the disk schedule is created. The algorithm proceeds as follows: Each network slot is processed in order. Within the network slot, each disk slot is examined in order. The number of bytes in each disk slot is added to the total number of bytes for the network slot. This value is divided by the number of disk slots encountered in the current network slot so far. This provides the cumulative average bandwidth required for the network slot up to this point. This process calculates the cumulative average for every disk slot within the network slot. The maximum of the cumulative averages is chosen as the bandwidth value for the network slot (rounded up to the next highest integer number of 64 KByte blocks). This method enables some peaks to be absorbed, because the server can send at the negotiated rate for the duration of the network slot as long as client buer space is available. Peaks which occur late in the network slot have marginally less in uence in the cumulative average and are absorbed easily. This is shown in Figure 5.2, for a network slot size of 20 seconds and the minimum required client buer space. Here, the rst three large peaks in disk bandwidth are at slots 68, 94, and 136. These peaks do not increase the minimum bandwidth needed at all. Unfortunately, if a peak in disk bandwidth occurs early in a network slot, then the maximum cumulative average for the slot is near this peak, as evidenced by the peaks at disk slots 201 and 241. Overall, this method does reasonably well in reducing the average network bandwidth required, but depends on fortuitous slot boundaries. The server-based ow control policy (see Section 5.1) takes advantage of the client buer by sending data to the client as early as possible. Since the value used in the network bandwidth schedule is the maximum cumulative average, more bytes can be sent in a network slot than will be displayed at the client. Therefore, at the beginning of every network slot except the rst, it is likely that some data will be present in the client buer. The leftover bandwidth may be larger for streams with highly variable disk block schedules, lling the client buer to a larger extent. 155

Original (40) - TWINS 14 Original Network Disk Block Count

12

Blocks Required

10

8

6

4

2

361

321

281

241

201

161

121

81

41

1

0 Disk Slot Number

Figure 5.2: Network Bandwidth Schedule - Original (Minimum Client Buer Space) The nal network bandwidth characterization algorithm improves on the second by explicitly accounting for the fact that sending excess data to the client reduces the amount of data that must be sent during the next network slot. Each slot \carries forward" a credit of the number of bytes already in the client buer at the beginning of a slot. In nearly all of these scenarios with 20 second network slots, there is sucient excess bandwidth to ll the client buer in the rst network slot. If the precise average bandwidth and the cumulative average (rounded to the next highest block boundary) are very close, however, the amount of send-ahead achieved may be minimal. This improvement reduces the amount of bandwidth that must be reserved for each subsequent slot by the amount of data already in the 156

client buer, smoothing the network bandwidth schedule. Thus, this is called the Smoothed algorithm. Variations of this strategy have been presented in other work [96]. The details of the Smoothed algorithm are as follows. Since the client buer is lled (either partially or completely) at the end of a network slot, the number of bytes in the client buer at the end of a network slot is counted as a negative bandwidth requirement in the next network slot. The bytes required in each disk slot are then added to the total requirements as in the Original algorithm. Due to the \negative bandwidth" from the send-ahead, the cumulative average is often also negative for the rst few disk slots of the network slot. A peak in disk bandwidth that occurs very early in a network slot can thus be merged with the previous network slot, reducing the cumulative average in the current slot. Figure 5.3 shows the smoothed network bandwidth schedule for the stream in Figure 5.2. In some cases, the reduction of the bandwidth schedule value in this manner means that after a particular slot, the client buer is less full than it was at the beginning of that slot. A larger client buer space enables smoothing to be more eective at reducing both the peaks and the overall bandwidth reservation necessary [29]. With more client buers, it is less likely that a full client buer will restrict the server's ability to send at the negotiated rate. As well, the fact that network bandwidth peaks can be smoothed by send-ahead means that subsequent network slots will have reduced resource needs. This frees up bandwidth reservation on the network for subsequent requests, but has the potential disadvantage of not fully utilizing the client buer at the end of every slot. For the purposes of this dissertation, making conservative use of network bandwidth takes precedence over completely utilizing the client buer, although the more aggressive use of the client buer is an intriguing area for future study. This would be equivalent to vbrSim for the network where every byte of the client buer is used for send ahead. 157

TWINS 14

Smoothed Network Disk Block Count

12

Blocks Required

10

8

6

4

2

361

321

281

241

201

161

121

81

41

1

0 Disk Slot Number

Figure 5.3: Network Bandwidth Schedule - Smoothed (Minimum Client Buer Space) One major assumption that makes send-ahead smoothing possible is that the disk system has achieved sucient read-ahead such that the buers are available at the server for sending. The vbrSim disk admission control algorithm only guarantees that disk blocks will be available for sending in the slot which they are required to be sent. In other words, all the peaks in disk bandwidth are properly accounted for at the end of each disk slot. Many of the streams included in the scenarios used in these experiments have several disk bandwidth peak values in each network slot greater than the network bandwidth schedule value. Only a stream that has strictly increasing bandwidth 158

during a network slot would not exhibit this characteristic. For a disk which is under heavy load, it is possible that the disk peaks which are smoothed by the network bandwidth schedule creation algorithm will not be read o the disk in time to send early. In other words, all of the bytes which are required for a particular disk slot must be read and transmitted in the next disk slot. If this is the case, the network bandwidth schedule value must be increased in order to transmit this peak amount in a single disk slot. For example, if the value in the disk block schedule for disk slot was D and the value for the network slot was N , such that D > N then the server would need to have sent D ? N of those blocks early in order to keep up with the delivery guarantee. There are 2 cases to consider: either the disk has a current read-ahead of less than minRead - D buers, or the disk has a current read-ahead of greater than or equal to minRead - D buers. In the rst case, if the disk was under heavy load and had not read minRead - D blocks before the current disk slot, it is possible that all D blocks would be read in the current slot and only N of them could be sent in the following slot. This is unacceptable and could lead to starvation at the client as the server does not know exactly which of the minRead blocks it will read during the current slot. Therefore, the network bandwidth schedule value for the current slot must be increased to D to accommodate the fact that all D of the blocks must be sent in the current disk slot. In the second case, fewer than D blocks of the current disk slot must be read in the current disk slot time, so the D ? N blocks in question must have been available in the previous slot time to be sent early. No adjustment is necessary. To summarize, if the disk has read ahead by at least minRead ? D buers, then the disk is far enough ahead and data can be sent early. Thus, any individual stream which has a disk bandwidth peak which is higher than the current network bandwidth will not have to deliver any blocks which may be read in the current 159

disk slot. It is necessary that both the bandwidth and buer space is sucient to accommodate this read ahead in every case. The probability of insucient read-ahead is very slim, because of the manner in which read-ahead is achieved by the disk subsystem. The disk admission algorithm guarantees that in steady state, the guaranteed bandwidth from the disk is always sucient to service the accepted streams. In fact, the achieved disk bandwidth is greater than this value, because disk performance is variable and the average performance is somewhat above minRead, as long as there are empty buers. If a new request fails, the accepted scenario will always have a somewhat lower bandwidth request than the capacity of the disk, due to the large granularity of video objects. The disk system will read as fast as permitted by the buer space and in steady state, all buers are lled. Steady state occurs shortly after a new stream is accepted as shown in Chapter 4. While this appears to be a tight integration between the disk admission and network admission algorithm, this process merely adjusts the connection rate in the rare case where the disk system emulation has not been successful in achieving guaranteed read-ahead. A further consideration in smoothing is the time granularity of the network slot. If very short network slots are used, then only a small amount of averaging within a slot is possible. The network bandwidth schedule would still contain a great deal of variability, which would result in a greater chance of peak values exceeding maxXmit. If the network slot is too long, then the bandwidth required approaches the average of the entire stream. This prevents the network system from sending at reduced bandwidths during periods of low bandwidth requirements.

5.4 Network Bandwidth Schedule Creation Performance In this section, the admission results of the Peak, Original, and Smoothed network bandwidth schedule creation algorithms are compared. The initial performance experiments were implemented with single disk servers and the results were combined 160

as though they were executed on a multi-disk server. This was due to limitations in the hardware environment available for experimentation. The rst observation that can be made is that the average bandwidth reservation is signi cantly greater than the average bandwidth utilization for all of the bandwidth schedule creation algorithms. Table 5.1 shows that this dierence is less for the smoothed allocation (17% rather than 27%). Thus, it is reasonable to expect that utilizing the Smoothed algorithm will result in acceptance in many network scenarios that the Original algorithm rejected. The Peak algorithm reserves almost 40% more bandwidth than required. It is also reasonable to expect that very few scenarios will be accepted by the Peak algorithm.

Average of all Scenarios Required Requested Requested Requested B/W B/W Peak B/W Original B/W Smoothed 96.5 Mbps

136.5 Mbps

122.8 Mbps

113.3 Mbps

Table 5.1: Network Bandwidth Characterization Summary A more detailed evaluation was obtained by examining these scenarios with respect to the relative amount of disk bandwidth they requested and the corresponding admission decisions. The scenarios were grouped according to the sum of the average bit-rates of each stream. 193 scenarios were presented to a CMFS as if con gured with 4 disks and attached to a 100 Mbps network interface. In the rst test, each disk had a similar request pattern that issued requests for delivery of all the stream simultaneously. The system with four disks was able to achieve between 110 and 120 Mbps in cumulative bandwidth. The scenario with the largest cumulative bandwidth that the Smoothed algorithm could accept was 93 Mbps, as compared with 87.4 Mbps for the Original algorithm. One major advantage of the vbrSim algorithm is the ability to take advantage of read-ahead achieved when the disk bandwidth exceeded the minimum guarantee. 161

This is greater when only some of the streams are actively reading o the disk, reducing the number of seeks needed per slot. Thus, more simultaneous video clients can be supported by each disk. When the scenarios are submitted to the CMFS with stagger between arrivals, a greater cumulative load is presented to the network, as almost all of the scenarios can be supported by the disk system. The achieved bandwidth of the disk increases by approximately 10%, resulting in cumulative performance between 125 Mbps and 133 Mbps. Due to the achieved read-ahead, only 9 of the 193 scenarios are rejected by the respective disk systems. The results of admission for simultaneous arrivals and staggered arrivals are shown in Tables 5.2 and 5.3. Only those scenarios that were valid from both the network and the disk point of view were considered. Even though 184 scenarios with staggered arrivals were accepted by the disk, only 130 of them requested less than 100 Mbps. Figures 5.4 and 5.5 show the same information in a visual form. If is quite clear that smoothing did change the number of streams and bandwidth that could be accepted by the network admission algorithm. The results show that smoothing is an eective way to enhance the admission performance of the network admission algorithm. A maximum of 80% of the network bandwidth can be accepted by the original algorithm on simultaneous arrivals, although most of the scenarios in that range are accepted. The smoothing operation allows almost all scenarios below 85% request to be accepted, along with a small number with slightly greater bandwidth requests. As expected, the Peak algorithm accepts very few scenarios, none which required over 70% of maxXmit. In Table 5.3, we see that combining smoothing with staggered arrivals has a compounding eect in increasing the bandwidth supportable by the server. None of these high bandwidth scenarios are accepted by any of the network admission algorithms. A few scenarios with a request range of between 80% and 90% can be accepted with the Original algorithm, which is a slight improvement over the simultaneous arrivals case. The Smoothed algorithm accepts nearly all requests 162

100% 90% Peak Original Smoothed

Percent of Scenarios Accepted

80% 70% 60% 50% 40% 30% 20% 10% 0% 60-64

65-69

70-74

75-79

80-84

85-89

90-94

95-99


Figure 5.4: Simultaneous Arrivals: Network Admission below 90% of the network bandwidth. The reason for this increase is that only a few streams are reading and transmitting their rst network slot at the same time. The rst network slot is the only one that cannot bene t from pre-sending data and cannot be smoothed. Thus, it is more likely that the peaks in bandwidth for the entire scenario with simultaneous arrivals occur in the rst network slot. With this stagger, the existing streams are sending at smoothed rates when a new arrival enters the system, meaning lower peaks for the entire scenario. The Peak algorithm shows no improvement whatsoever, which is as expected. The results of this section show that the simplest smoothing technique (Original algorithm) reduces the peaks and provides a substantial improvement over the 163

100% 90% Peak Original Smoothed

Percentage of Scenarios Accepted

80% 70% 60% 50% 40% 30% 20% 10% 0% 60-64

65-69

70-74

75-79

80-84

85-89

90-94

95-99


Figure 5.5: Staggered Arrivals: Network Admission Peak algorithm. The network bandwidth schedule that it produces still has substantial variability. The allocation required is signi cantly above the average bandwidth of the streams in the scenarios. The Smoothed algorithm provides an even better network characterization that increases the maximum network utilization up to 90%.

5.5 Stream Variability Eects In this section, the in uence of stream variability on the network admission control algorithm and the bandwidth schedule creation algorithms is examined. To evaluate this factor, three con gurations of streams were utilized: mixed variability streams, 164

Pct # of Valid # Accepted # Accepted # Accepted Band Scenarios Peak Original Smoothed

95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64 Total

0 5 4 18 32 19 11 2 91

0 0 0 0 0 0 6 2 8

0 0 0 1 19 18 11 2 51

0 0 2 17 32 19 11 2 85

Table 5.2: Network Admission Performance: Simultaneous Arrivals (% of Network)

Pct # of # Accepted # Accepted # Accepted Band Scenarios Peak Original Smoothed

95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64 Total

5 19 15 27 29 22 11 2 131

0 0 0 0 0 0 5 2 7

0 0 2 3 18 22 11 2 59

0 2 14 27 29 22 11 2 106

Table 5.3: Network Admission Performance: Staggered Arrivals (% of Network) low variability streams, and high variability streams. Each con guration simulated the storing of 44 moderate bandwidth video streams on a number of disks. The video streams were chosen from the list of streams given in Table 2.1. In the high and low variability cases, some duplication of streams was performed in order to have a sucient number of streams in each con guration. The rst con guration consisted of 44 unique streams that had a mix of variability in the scenarios (denoted MIX). The second con guration contained streams with low variability (denoted LOW). 165

The 25 lowest variability streams were used with 19 replications to complete the 44 streams. The same process was used to obtain the third con guration, with the 25 highest variability streams. As seen in the Section 4.4.1, the streams would likely occupy between 4 and 5 disks. Since the purpose of these tests was to examine only the network admission algorithm, the evaluation was done by simulation only. Hardware limitations prevented each con guration from being stored in a CMFS so that the tests could be done without signi cant system enhancements and recon gurations. A ve disk con guration was not readily available and was not necessary to evaluate this aspect of network admission, so the placement on disk of each stream was ignored. The number of disks could range from 4 to 44. In the latter case, each disk would contain only one stream. Again, 193 scenarios were submitted to the simulation of a multi-disk, singlenode CMFS. These scenarios generated between 58 and 135 Mbps of bandwidth (as measured by the sum of the average bit-rate of each stream). The admission results for simultaneous arrivals are shown in Table 5.4. The results for the Peak algorithm are not shown, since it was previously shown to be incapable of accepting stream scenarios above 70% of maxXmit for the initial set of streams. For the low variability streams, the acceptance rate of complete scenarios in the 80-84% range increased from 52% acceptance to 72% acceptance, by using the Smoothed algorithm. The high variability streams did not have any scenarios accepted in the 80-84% range with the Original algorithm, but this increased to 29% with the Smoothed algorithm. In the 75-79% range, the acceptance rate increased from 19% to 93% for high variability streams. The network could admit scenarios of low variability streams at a higher network utilization overall, but smoothing had a more drastic eect on the acceptance rate with high variability streams. Higher variability streams have more peaks to smooth, so this test shows that the Smoothed 166

algorithm is eective in achieving this smoothing. Network

Pct

95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64

MIXED

LOW

HIGH

Original Smoothed Original Smoothed Original Smoothed 0/5 0/19 0/15 3/27 18/29 21/22 11/11 2/2

0/5 0/19 4/15 27/27 29/29 22/22 11/11 2/2

0/22 0/10 0/3 13/25 24/25 21/21 22/22 30/30

0/22 0/10 0/3 18/25 24/25 21/21 22/22 30/30

0/7 0/10 0/17 0/21 5/27 16/22 25/27 5/5

0/7 0/10 0/17 6/21 25/27 22/12 27/27 5/5

Table 5.4: Network Admission Performance: Simultaneous Arrivals (% of Network) Table 5.5 shows the admission performance for arrivals which are staggered by 10 seconds. When arrivals are staggered and the Smoothed algorithm is used, a majority of requests below 90% are granted admission. Some scenarios of mixedvariability and high-variability streams are accepted in the 90-94% request range. Again, the eect of smoothing is con rmed to be greater for the high variability streams than for the low variability streams (from 5% to 95%, rather than from 56% to 100% in the 80-84% range). The acceptance rate with low variability streams is maintained at a higher level than the mixed variability streams for the 85-89% request range. Correspondingly, scenarios of mixed-variability streams have higher acceptance rates than the scenarios of high-variability streams in terms of acceptance rates. An interesting observation is that the 90-94% request range has the best performance with high-variability streams.

5.6 Network Slot Granularity While the disk admission control is based on relatively short disk-reading slots, the network admission is based on slots which are signi cantly longer. Selecting an appropriate slot length may substantially aect the network bandwidth schedule and 167

Pct

95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64

MIXED

LOW

HIGH

Original Smoothed Original Smoothed Original Smoothed 0/5 0/19 3/15 9/27 20/29 22/22 10/10 2/2

0/5 4/19 13/15 27/27 329/29 22/22 10/10 2/2

0/22 0/11 0/3 14/25 25/25 21/21 22/22 28/28

0/22 0/11 2/3 25/25 25/25 21/21 22/22 28/28

0/7 0/10 0/17 1/20 10/27 16/22 26/27 5/5

0/7 5/10 9/17 20/21 27/27 22/12 27/27 5/5

Table 5.5: Network Admission Performance: Staggered Arrivals (% of Network) subsequently the admission performance of the network admission control algorithm. Since Section 5.5 showed that the Smoothed algorithm outperforms the Original algorithm, results are only shown for the Smoothed algorithm. The network admission experiments in section 5.4 used 40 disk slots (20 seconds) as the length of the network slot. This value was chosen as a result of other work which used network slots of 20 seconds and 1 minute in duration [95] and suggested 20 seconds was a reasonable tradeo between smoothing and network renegotiation frequency. To establish the validity of the selection of 20 seconds as an optimal (or at least reasonable) choice for the network slot size, the admission performance is compared for network slot sizes of 1/2 second, 10 seconds, 30 seconds, and 600 disk slots in addition to the original choice of 20 seconds. The rst and the last case are the extreme boundary conditions. The 1/2 second slot is identical to the Instantaneous Maximum disk admission control algorithm. The 600 second slot case is very similar to allocating bandwidth on an average bit-rate basis. This average is not, however, the average of the stream but the maximum of the cumulative average bit-rates from the initial frame (aggregated into disk slot groups). This is in uenced greatly by the rst signi cant disk bandwidth peak, as described in Section 5.3. This 168

is shown in more detail by Figure 5.6, where the rst peak is at disk slot 6 and a later peak is at slot 248. Only the rst peak signi cantly aects the cumulative average. 8

7

Blocks Required

6

5

4

3

2

vbrnet-1 Cum Ave vbr-net20

1

241

221

201

181

161

141

121

101

81

61

41

21

1

0

Slots

Figure 5.6: Network Slot Characterization Summary results for the network bandwidth schedule creation algorithms in these four cases are given in Table 5.6. As the network slot size increases, each schedule becomes smoother, but the average requirement increases. Note that the average of the 20 second and 30 second slots are extremely close. The eect of this dierence in characterization on the admission results for simultaneous arrivals is shown in Table 5.7. The best performance is given by the 10 second network slot, but there is a very small dierence in performance, especially for the high variability streams. 169

Network Slot Length (in Seconds)

1 10 20 30 Bandwidth Schedule Average 4.065 4.249 4.555 4.551 Bandwidth Schedule Std. Dev 1.020 .813 .659 .61

600 5.176 0

Table 5.6: Network Bandwidth Schedule Summary for Dierent Slot Lengths For this set of streams, no scenario requesting greater than 90% of the network bandwidth can be accepted. It is clear that the worst admission performance occurs for the 600 second slot. The network bandwidth schedules are greatly aected by peaks early in the stream. The most interesting range of requests is the 85-89% range. For the mixed variability and high variability streams, there are several scenarios in that range. It appears that the worst performance, rather than the best, occurs for a network slot of 20 seconds (excluding the 600 second boundary case). The requests in this range are very sparse for the low variability streams. This is unfortunate, since all requests in the next higher range are rejected and most of the requests in the next lower range are accepted. For the 80-84% range, nearly all of the mixed variability streams are accepted, excluding the 600 second slot column. For low variability streams, fewer scenarios get accepted than in the mixed variability case. The table shows that the best admission performance is for 1/2 second slots and that acceptance gets steadily worse as the network slot size is lengthened. The major reason for this behaviour is the simultaneous arrivals of the streams. With 1/2 second slots, there is a possibility that the peaks in some of the rst few disk slots will be oset by valleys in others, which will reduce the peaks in the scenario. When a longer network slot is used, the value for each stream is very close to that of the rst peak, but that value is in eect for the entire duration of the network slot. Since valleys at the beginning of the schedule are not created in the network schedule, there is no possibility for a valley to oset that peak. For 170

Low Variability Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/22 0/22 0/22 0/22 0/22 90-94 0/10 0/10 0/10 0/10 0/10 85-89 2/3 0/3 0/3 0/3 0/3 80-84 25/25 20/25 18/25 16/25 15/25 75-79 21/21 21/21 21/21 21/21 21/21 Mixed Variability Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/6 0/6 0/6 0/6 0/6 90-94 6/19 1/19 0/19 0/19 0/19 85-89 13/15 9/15 4/15 5/15 0/15 80-84 27/27 26/27 27/27 25/27 16/27 75-79 29/29 29/29 29/29 29/29 29/29 High Variability Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/7 0/7 0/7 0/7 0/7 90-94 0/10 0/10 0/10 0/10 0/10 85-89 5/17 3/17 0/17 0/17 0/17 80-84 11/21 14/21 5/21 4/21 1/21 75-79 27/27 27/27 25/27 26/27 17/27 Table 5.7: Network Admission Granularity: Simultaneous Arrivals (% of Network) many streams, the largest network slot value occurs in the rst slot, as it is not possible to smooth the bandwidth from prior network slots. When all stream requests are submitted simultaneously, this nearly always results in the highest bandwidth request for the entire scenario occurring in the rst network slot. A clearer bene t of a longer network slot is seen with staggered arrivals With 1/2 second slots, approximately the same occurrence of peaks and valleys would occur regardless of any particular staggering eect. In the longer network slot cases, the Smoothed algorithm generates smaller values for all but the rst slot. This should produce a reduction in peaks when some of the streams are sending at smoothed data rates. The results for a stagger of 10 seconds are shown in Table 5.8. The only request bracket where the admission performance shows signi cant 171

variation in all types of streams is the 90-94% range. In this range, the 10 second network slot performs signi cantly better than the 20 second network slot and the 30 second network slot, except for the high-variability streams, in which admission performance is identical. For high variability streams, the 20 second slot is better than the 30 second slot, but for the mixed variability streams, the 30 second slot outperforms the 20 second slot. These results are quite inconclusive, partially due to the small number of scenarios in that request band. A more exhaustive analysis should show some clearer trends. Low Variability

Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/22 0/22 0/22 0/22 0/22 90-94 0/10 7/10 0/10 0/10 0/10 85-89 2/3 3/3 3/3 2/3 0/3 80-84 25/25 25/25 25/25 25/25 15/25 75-79 21/21 21/21 21/21 21/21 21/21 Mixed Variability Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/6 0/6 0/6 0/6 0/6 90-94 6/19 13/19 4/19 6/19 0/19 85-89 13/15 15/15 13/15 13/15 0/15 80-84 27/27 26/27 27/27 26/27 16/27 75-79 29/29 29/29 29/29 29/29 29/29 High Variability Pct 1/2 sec 10 sec 20 sec 30 sec 600 sec 95-100 0/7 0/7 0/7 0/7 0/7 90-94 0/10 5/10 5/10 1/10 0/10 85-89 5/17 14/17 9/17 9/17 0/17 80-84 11/21 20/21 20/21 21/21 1/21 75-79 27/27 27/27 27/27 26/27 17/27 Table 5.8: Network Admission Granularity: Staggered Arrivals (% of Network) It should be noted, however, that the results for the 1/2 second slot case show no improvement with staggered arrivals. Since there is no smoothing performed in this case, the individual peaks in bandwidth of each stream have the same probability 172

of occurring in any slot. The shape and peak values of the server network bandwidth schedule should not be very dierent between the staggered and the non-staggered case. To verify this conjecture, the network bandwidth schedules for a small number of scenarios were compared. Twenty-six scenarios from dierent con gurations were analyzed. The dierence in the total cumulative network block requests was very small. A total of 13 more blocks were required in the staggered case, which is an increase in bandwidth required. When this was averaged over 208 streams which comprise the 26 scenarios, it amounted to a .06 block per network slot addition. Although this did not indicate exactly which scenarios were accepted, it did indicate that the number of accepted scenarios in each band will not change signi cantly if stagger is introduced in the 1/2 second slot case. From these results, it appears that a 10 second network slot provides a good balance between acceptance rate with both simultaneous arrivals and staggered arrivals and the need to minimize the overhead of re-negotiating network bandwidth reservations. Further work to determine an optimal network slot size for each stream type could be a promising area of re nement. The initial results for these stream types indicate, however, that the admission performance will be better with smaller network slot sizes, rather than larger ones. This is somewhat surprising, given the initial intuition that longer streams could be smoothed more eectively. A reasonable conclusion is that smoothing is indeed eective, and that the sooner in the schedule that the smoothing takes eect, the better the performance results.

5.7 Network Admission and Scalability The results of Section 5.4 enable an additional aspect of the CMFS design to be evaluated: scalability. Although this was not a goal of that particular test, it was observed that the initial con guration of the server with four disks could not saturate the network interface. One aspect of scalability is the manner in which components 173

can be added to achieve scalable performance. It is desirable that the disk and network bandwidth scale together. In the con guration tested, 4 disks with minRead = 23 provided 96 Mbps of bandwidth with a network interface of 100 Mbps. At this level of analysis, it would seem a perfect match, but the tests with simultaneous arrivals did not support this conjecture. The tests showed that with simultaneous arrivals, a system con gured with guaranteed cumulative disk bandwidth approximately equal to nominal network bandwidth was unable to accept enough streams at the disk in order to use the network resource fully. There were no scenarios accepted by the disk that requested more than 94% of the network bandwidth. In Table 5.2, there are only 4 scenarios in the 85-89% request range, that were accepted by the disk system. In Table 5.3, there were 15 such scenarios. This increase is only due to the staggered arrivals as the same streams were requested in the same order. When staggered arrivals were simulated, the network admission control became the performance limitation, as more of the scenarios were accepted by the disk. There were no scenarios that requested less than 100 Mbps that were rejected by the disk. This arrival pattern would be the common case in the operation of a CMFS. Thus, equating disk bandwidth with network bandwidth is an appropriate design point which maximizes resource usage for moderate bandwidth video streams of short duration if the requests arrive staggered in time.

174

Chapter 6

Related Work Recent work in multimedia computing systems has been very extensive. Two large surveys by Adie [2, 3] indicate the wide-spread interest in distributed multimedia, both in the research and commercial communities, and specify the scope and focus of many projects. Several of these projects encompass a very wide focus and transcend the issues involved in continuous media server design. The existing research can be categorized as follows: server implementations, operating system level support for multimedia, data storage layout optimizations and simulation and/or analytical evaluation of resource reservation mechanisms for the disk and network systems. There are several server design and implementation issues raised in previous research eorts. A key question to consider when evaluating the existing work and its appropriateness to the design of a general, heterogeneous, continuous media le system is the level of abstraction of certain components of the system. In some work, the details of the user interface are completely ignored, while in others the details of disk block layout are not discussed in any detail and the focus may be rather on network issues. Each decision is appropriate for the speci c environment considered, but at the expense of an accurate and realistic model that incorporates the heterogeneity and scalability that is desired. 175

The remainder of this chapter discusses the approaches and contributions of other research to the issues involved in the support and development of multimedia systems, and in particular, servers designed for continuous media.

6.1 Complete System Models One de ciency in most of the previous work is the integration of the speci c algorithms or hardware features in a complete system model. Complete system models have been derived in only a small portion of the previous work. In particular, Anderson et al. [5], Kumar et al. [47], Hsieh et. al [56], Heybey et al. [42] and Little and Venkatesh [55] have provided model descriptions which are general enough to accommodate a scalable, heterogeneous server. There have also been several complete server systems implemented, most notably IBM's Tiger Shark File System [40] and Microsoft's Net Show Theater Server [64], which is based on the Tiger Video FileServer [9]. Even at that level, some important aspects of a complete model, such as scalability or variable bit-rate awareness, are left out. Complete models for the lower-level support of multimedia are also provided. These models do not correspond to particular server implementations, but do discuss relevant design issues. Little and Venkatesh focus primarily on the user query interaction and do not consider the real-time interface component. Anderson et al. base their work on a model that does not have scalability considered, although they do recognize the need for a powerful stream control interface, which has similar expressibility to that provided by the CMFS described in this dissertation. Hillyer and Robinson have a system model that matches Anderson in many ways, but their focus is on more general system issues, including, but not limited to continuous media. A server which has a system model that is similar to the CMFS is Calliope [42]. It contains facilities for extensibility and uses o-the-shelf hardware to achieve a scalable system. Their server design consists of a co-ordinator and multiple media storage units (MSUs). This corresponds well to the design of an administrator and 176

server nodes as in this dissertation. The system developed by Kumar and Kouloheris [47] has many similar architectural features as the CMFS. They focus on storing network packets at the disks and bypassing the CPU entirely once transmission has begun on a stream. They do not focus on admission control speci cally and include only a very brief model for the user's interaction with the system. Largely diering sizes of streams and fast-motion delivery complicates their scheduling process signi cantly. Message passing via a reliable connection is used to establish rates for bandwidth consumption by the client application for delivery of presentation objects. An important component in Calliope is the interleaving of the delivery schedule and the media content in a single le. Thus, this model is incapable of handling fast-forward and reverse on-line because the delivery schedule is not available separately. O-line lters are provided which create a fast-forward version of the stream. The model does not deal with the user's interface or the details of admission control beyond the speci ed bandwidth consumption rate, which is constant for a stream, even if it contains variable bit-rate data. Hsieh et al. [56] provide considerable detail on the speci c implementation of their model, but do not clearly distinguish between the principles behind the model and the instance provided by their particular hardware environment. They perform extensive experiments on the number of supportable clients, but do not describe any mechanism by which bandwidth can be reserved and admission to the system controlled. In the Tiger Shark File System [40], support for continuous-time data is provided by admission control and disk scheduling. They give no detail on the admission control and use deadline approaches to retrieving disk blocks. Striping for increased bandwidth is essential and retrieval across the network is performed by the client \pulling" data using traditional lesystem calls. Each le can be striped across a very large number of disks (possibly all the disks in the system). 177

Bandwidth is reserved for clients reading at a xed rate, implying only constant bit-rate clients can be supported. In this system, an end-to-end model is provided, including mechanisms for replication and fault tolerance, but it lacks the exibility needed for eciently dealing with complex user interactions with VBR streams. In the Tiger Vide Fileserver [9] by Microsoft, a similar end-to-end model is provided. The components which make up the server entities (tiger controllers and cubs), are similar to the notion of an administrator and server nodes. They claim that the server scales arbitrarily by striping each le across each disk in the entire system. This provided the ability for extremely high bandwidth on popular objects, if the access is staggered in time. For example, over 7000 users can view the same two-hour movie on a system with over 7000 disks, if they are space equally distant from each other in time (i.e. request delivery with one second stagger). Their distributed scheduling mechanism [10] ensures that admission of requests for constant bit-rate streams will not overload the system. Unfortunately, they are extremely conservative in requiring that no two users ever request data from the same disk during the same slot, eliminating seeks during slots. The major focus of this system is high availability and low-cost scalability, but it is in fact, quite over-engineered. Although the system uses o-the-shelf PC hardware, it requires more resources than necessary because of the manner in which it performs allocation and scheduling. The fault-tolerant capabilities also increases the amount of hardware required. The CMFS attempts to maximize the number of users that can retrieve objects from a particular disk, by examining the detailed requirements of each stream, and by intelligently allocating the disk bandwidth resource on a time-varying manner. It is unclear what kind of fragmentation problems and diering utilizations occur on either Tiger Shark or Netshow when highly variable streams are utilized. Netshow is heterogeneous in that it does not matter what kind of encoding is used in the streams, but the clients that have been written only provide MPEG encoding. 178

Other formats have been used in testing, but no performance gures are given for the other formats. System models for server support are described in reasonable detail. While these systems do not describe servers, they give detailed analysis of the relevant performance and data delivery mechanism issues. The goal in these systems is to provide an infrastructure at the operating system level that permits a number of diering multi-media applications to be implemented. Tierney et al. have developed a system [84, 85] that has its major use in storage/display of large images. It is a lower-level approach that is also claimed to be capable of supporting continuous media. The storage system is capable of transferring data at an aggregate throughput rate of hundreds of Mbps, and is designed for the support of visualization environments in conjunction with the MAGIC gigabit testbed. This con guration is used as the basis for exploring system design issues. This design model is similar to the Zebra striped le system [41], because of the distributed nature of the organization and striping of the data, but contains more special purpose design. The main goal is to enhance throughput, but not to provide redundancy. An image server that distributes tiles of an image across video data servers on a network is the most signi cant application considered. This application is somewhat similar to the CMFS, so the general system design issues are relevant. Some applications considered require reliable transmission of the images they request, so images which are incorrect or misordered, are continually requested until their real-time deadline has passed. This requirement is quite dierent than in the CMFS. A general model for distributed multi-media support is described in Mullender at al. [66, 52]. Mullender et al. [66] provide a holistic approach to scheduler support for processing multimedia data in real-time by using a special micro-kernel called Nemesis. Resources can be allocated to applications within small time windows, but generally, the application must adapt to the resources it is given. Entities similar to user-processes (called domains) are activated based on a weighted schedul179

ing discipline. If resources remain at the end of a particular scheduling interval (analagous to a slot), they are shared among the domains. Earliest-deadline- rst is the method used as the choice of scheduling algorithm for the remaining resources. This model considers other workload in a system besides continuous media, so it does not provide the strict guarantees desired by a CMFS, but supports the operational models that would be used in most continuous media applications. The UBC CMFS provides a total system model, but at a higher level of abstraction and stays away from low-level operating system details. The real-time scheduling algorithm used in the system is earliest-deadline- rst, which has been shown to be optimal if the requirements of the tasks are less than the capacity of the system.

6.2 Synchronization of Media Streams A system that provides exible user access to continuous media data must permit synchronization of streams at either the client or the server. Synchronization of multiple media streams has been a large topic of research which is addressed at various levels. Models of synchronization which deal with multi-media hyper-documents involve complex temporal relationships between continuous and non-continuous objects and is outside the scope of this dissertation. Detailed discussions can be found in Li and Georganas [53] and Bulterman and van Liere [11]. The level of client synchronization addressed by the CMFS is that which is needed for synchronization of a single video stream (or multiple resolutions of a single video clip stored as separate streams in the case of scalable video encoding) and a single audio stream, with optional synchronized text streams. In Anderson and Homsy [4], the synchronization mechanism is a logical time system (LTS), which provides time-stamps for the data units. Software processes or hardware devices at client workstations deal with time skew by skipping presentation units to speed up a stream which is slow in decoding/displaying and/or pausing the 180

presentation of one or more of the streams while waiting for data for a tardy stream. This is to enable peripheral presentation to be kept in synchronization. The server provides data in an earliest deadline rst manner to support the synchronization eort at the client station. Since data is time stamped, the client knows what the display time of each presentation unit should be and attempts to keep all media streams as close to the presentation rate as possible. The client application is capable of specifying an allowable skew so that the server's delivery requirements may be somewhat relaxed when the skew tolerance is larger. Rangan et al. [75] address the problem of synchronization by proposing a feedback mechanism to inform the server of inter-media asynchrony at the client machine so the server can adjust the delivery rate. The temporal relationships are stored as relative timestamps and one stream performs as a master of the other streams (slaves) which are concurrently presented. This causes the slave streams to pause presentation or skip presentation units to remain synchronized. Signi cant detail is provided regarding the server's interpretation of this feedback. The concept of master-slave streams can be used by client applications of the CMFS, but the lack of a feedback channel eliminates the server's direct involvement in this part of the synchronization. In Chiueh and Katz [18], multiple resolution les are stored and retrieved separately so that desired bandwidth can be achieved. This introduces the need for synchronization of the components of a video stream, but the mechanism for providing the synchronization is not discussed in detail. It is acknowledged that the retrieval of the data for the same display period must be done in a parallel fashion for reassembly and decoding at the client. The Continuous Media Player [78] uses a time-ordered queue to synchronize audio/video packets at the client and utilizes adaptive feedback to match the data delivery with the client station's capability. This method calculates penalty points for the clients and has the server adjust the frame rate according to accumulated 181

penalty points. Synchronization is a particularly dicult problem in systems that delay stream acceptance in order to reduce bandwidth needs (such as [77, 61, 33]). These systems also attempt to limit the startup latency. If multiple streams must be coordinated at a client, then small time bounds on latency are necessary so that the detailed synchronization can be achieved. For example, a 30 second start-up latency for a video stream may make it impractical to retrieve a corresponding audio stream at the same time. Extra scheduling mechanisms are necessary to know when to request that audio stream so that it arrives at the proper time for synchronization with the video stream. As mentioned previously, the CMFS does not utilize feedback for synchronization. Once the presentation has started, the client must deal with asynchrony. The real-time guarantees of delivery ensure that the data will be sent in time. Once the client knows the latency of the network connection, it can adjust prepare times so that the presentation can begin at the appropriate time for synchronization.

6.3 User Interaction Models With the underlying support for synchronization, the user interaction models can be further developed. A subset of VCR functions is typically provided, which allows the user to request continuous media streams. The simplest systems provide only playback and stop. A more sophisticated model that includes pause, fast-motion and slow-motion (in both forward and reverse directions) and random start positions is more useful to interacting with continuous media. Some system descriptions provide playback only [89, 86, 56, 14, 74, 70, 57]. This playback may be at a dierent rate than the recorded rate, so some amount of variable speed is considered, but not in any detail. In much of the work that focuses on disk layout, the ability to t a maximum number of requests on a certain number of disks assumes a certain speed of playback (i.e. full motion). One user 182

cannot alter the playback rate in a scenario without aecting the data retrieval for all other users. In systems that provide a little exibility in terms of playback rates, the number of frames per second desired/achievable is used to inform the server of the data rate required by the stream. Yavatkar and Lakshman [94] use a rate-adjustable priority scheduling mechanism to provide an average frame delivery rate to a client application. Thus, variable speed playback directly aects the data rate required from both the disk and the network. Little and Venkatesh [55] provide fast forward and rewind temporal access control in the user interface, but do not describe how the server implements these functions. Dey-Sircar et al. [24] consider the implementation of fast motion by sending data at a higher rate if possible, or by adjusting the rates of all fast motion users in a synchronized fashion so that the bandwidth constraints of the server are not violated. One of the options considered is to deny fast motion service to a user until the bandwidth is available, but to continue normal motion data delivery in the meantime. This is also the viewpoint taken by Lau and Lui [50], whereby the user selects the amount of time for which fast motion is desired. The resumption of normal display is then considered as a new request. Fast motion options are given, but not elaborated upon. Providing fast motion display by skipping data segments is an increasingly common alternative. Thus, the average data rate required is not signi cantly altered by the fast motion request. Chen et al. [17] provide a disk layout procedure to balance the load on the disks while retrieving and transmitting some percentage of the segments stored at the server, where segments are de ned to be media-type speci c amounts of continuous media data. The unique requirements associated with fast motion and its relation to batching and buer sharing are addressed by Kamath et al. [43]. They propose skipping segments as well and consider the eect of skipping data on their shar183

ing schemes. Ozden et al. [71] consider sending complete independent sequences (MPEG I-frame sequences in their discussion), but require the server to be aware of when a new I-frame has been encountered in the data stream. They give a considerably detailed discussion on how to improve the performance during fast motion retrieval, including the eect on buer space, and the possibility of storing a fast motion version of the stream. Rangan and Vin [73] distinguish between a destructive pause and a nondestructive pause operation. The non-destructive pause stops delivery and reading of continuous media, but still reserves buers and bandwidth at the server for the anticipated resumption of playback. The CMFS implements a destructive pause because of the uncertainty regarding the amount of time that the display may be paused. The CMFS allows fast motion by both increased presentation rate, which increases the bandwidth used at the server as well as skipping data segments (called sequences in this dissertation). The option of providing a non-destructive pause has not been considered, because the pause length would have to be very short to avoid causing buering problems at the server. In particular, a non-destructive pause of indeterminate length cannot work in the model of the CMFS, because it changes the timing of when buers are made available for read-ahead at the server. The availability of these buers is relied upon by other streams. It would also aect the server's send-ahead ow control mechanism adversely. In the steady state, this would simply reduce the level of read-ahead, but it may invalidate a previous admission decision if that buer space was required for any existing stream.

6.4 Scalability Since performance of individual hardware components makes a centralized le server impractical for meeting the needs of a large and diverse user population, methods of achieving scalability have been considered by many research projects. The major 184

prototype systems that have been developed are of a speci c scale, and cannot be incrementally expanded. In particular, the server by Starlite Networks [86] is built up of a collection of disks in a personal computer environment and is capable of serving twenty 1.2 Mbps video users. None address the ability to combine server components into a server that scales arbitrarily. Commercial video server products by manufacturers such as Oracle [51], SGI [67], and others have provided highspeed super-computer hardware technology, and parallel disk drives for the purpose of delivering high bandwidth video streams, but have not provided evidence that they have addressed the fundamental issues of server design from the point of view of variable bit-rate requirements of the streams. In the existing literature, several simulation experiments have considered the issues of performance and admission guarantees in large systems ([50] uses 200 disks, [75] uses 120 disks). These show the levels of bandwidth required to support a large number of users with a large selection of streams, but do not address the diculties in building a system of that size. Some systems provide a design based on a server array. In Bernhardt and Beirsack [6], a single video is striped over a subset of the server nodes. They claim that a server should be capable of storing several thousand objects and attempt to deal with the load balancing issue by evenly striping the data across a large percentage of the disks. It appears that load balancing operations may dominate the activity in such a system because the reorganization task is reasonably complex. Another method of providing scalability is to have tertiary archival storage that increases the content available in a system. Systems that incorporate this storage are known as video-jukeboxes. Systems such as these attempt to limit startup latency by retrieving part of the stream directly to server buers for transmission and the remainder to disk. Archival storage is not an issue which is directly examined in this dissertation, but the model does incorporate the ability to perform migration of continuous media streams. Migration could be initiated from an archive server 185

to keep the contents of server nodes up-to-date with recent request patterns. The work that deals with operating system enhancements has shown that le system facilities are scalable (a goal of terabyte le capacity in [66]). The simulations in [24] indicate a large system (that would require scaling of smaller systems), but oer no mechanisms by which these smaller systems can be combined. Anderson's CMFS [5] performs some simulation experiments that indicate that the limits of scale they are willing to consider is in the range of several dozen simultaneous users. When Crimmins studied Video Conference performance over Token Ring [19], the limits of the system are quickly reached by using all of the physical network medium and no concept for scaling beyond that is considered. Little and Venkatesh [55] mention the issue of scale, but their work does not speci cally provide any solutions. Linear scalability is achieved in Tiger Shark [40], Netshow [64], by simply adding disks or processing nodes. The concept of scalable server nodes, called MSU's is also provided in Calliope [42] and by Kumar et al [47]. The CMFS uses both of these methods to increase storage capacity and bandwidth. The number of disks on a server node is limited by the network interface, but then server nodes can be added until the administrator database is saturated. Then servers can be confederated with a location service [46].

6.5 Real-Time Scheduling/Guarantees By de nition, continuous media requires real-time constraints on the retrieval and delivery of the media data. The approaches to providing the real-time semantics vary from providing statistical guarantees that a certain percentage of data will arrive correct and on-time, to hard real-time guarantees of some classes of data and soft real-time guarantees on others. Real-time scheduling methods are desired by many researchers but only implemented by some. Tierney et al. [85] recognize the need for OS support for 186

deadline driven data delivery at higher rates of success than UNIX-based systems which cannot provide this facility. Their initial system does not provide such guarantees. The simulations and prototypes by Rangan and Vin [74, 75] provide statistical real-time guarantees of delivery of data, as does the admission control algorithm described by Vin et al. [87]. In such a system, it suces to allocate the data loss equitably among the users within a given service round, as well as over longer periods of time. If data loss is small enough, the clients will be satis ed as this loss will be indistinguishable from network loss and not appear as a chronic inability of the server to deliver data. This requires the server to be able to distinguish data which causes great disruption at the client (i.e. MPEG I-frame) from less important data in an eort to distribute the eective frame loss to the application. Vin et al. [87] claim a 200% increase in the number of admissible streams with this method over the conventional worst case assumptions. Chiueh and Katz [18], as well as Lau and Lui [50] and Tobagi et al. [86] provide systems that provide strict guarantees of delivery of continuous media data, but do so by delaying the servicing of a stream until the bandwidth can be guaranteed. Start-up latency can be signi cant. Average waiting time is the performance metric used, but the order in which streams are accepted is unclear. If streams are treated in a First-Come-First-Served manner, then a large bandwidth and/or long stream may prevent several smaller streams from immediate admission. A bandwidthbased policy (perhaps minimum bandwidth rst) could inde nitely starve those large bandwidth streams. Both of these policies distort the waiting time measure as an accurate re ection of system performance. The resource server of Little and Venkatesh [29] establishes real-time connections for streams \...to ensure they can support movie playout for the entire duration...." and engages in QOS negotiations with a client application. This implies a hard real-time guarantee, but explicit consideration of graceful degradation of service is given, so it is unclear what type of guarantees are provided. 187

Real-time scheduling of disk and network activity is done in many systems [40, 45, 52, 66]. The most common method is Earliest-Deadline-First, which is shown to be optimal if the resource requirements can be met. The simulations in Reddy and Wyllie [76] compare the performance of hybrid techniques, including SCAN-EDF. The disk scheduling algorithm in the CMFS is very similar to SCANEDF. The group of requests with the earliest deadline is sent in a group to the disk controller and the controller uses a method beyond the control of the application to most eciently perform this group of requests asynchronously. Systems that provide support for both real-time and non real-time data trac clearly distinguish between the types of guarantees they provide. The general kernellevel support system by Lougher/Shepherd [57] provides hard real-time guarantees of delivery of data, as well as supporting non real-time access. The hard realtime scheduler ensures all continuous media requests meet their deadlines using a round-robin scheme. This task is simpli ed as the system assumes very similar data rates for streams and constant rates within a stream. The scheduler uses worst-case estimates of processor execution time and data delivery, and is therefore conservative in its utilization of the hardware resources. A soft real-time scheduler is used during the slack time to provide additional streams which are of reduced quality and does not provide the deterministic guarantees for these streams. In Crimmins [19], a statistical guarantee is used to classify the transmission of a continuous media stream successful. A speci c threshold of 98% packet delivery for continuous media is used in his experiments which are designed to model both synchronous trac and asynchronous trac. The CMFS utilizes a real-time scheduler which can be implemented on top of a real-time or non-real-time operating system. The environment for each server component is intended to be a dedicated machine for the sole purpose of running the CMFS. On a non real-time OS, such as UNIX, there is no mechanism to support hard real-time deadlines, but performance tests by Mechler [62] indicate that missed 188

deadlines rarely occur if only one user application is active. If other programs are run on the server simultaneously, the real-time guarantees cannot be enforced. These real-time guarantees are primarily for the software tasks that run on the processor.

6.6 Encoding Format Much existing work on continuous media servers has considered particular details of the encoding format for disk layout and transmission mechanisms. The early work focussed on video-on-demand playback and providing methods that could increase the overall bandwidth of the server ([70, 72, 18]). Vin et al [87] utilize knowledge of the encoding format in dealing with transient network overload. Servers have been designed that take speci c characteristics of MPEG video streams into consideration in both data transmission and storage layout policies. The majority of the related literature does not consider any aspect of the encoding method. Since these papers analyze performance via simulation, the details of the encoding method are not signi cant except as they distinguish between CBR and VBR streams. The data format details are ignored by the CMFS as well. It is possible within the model of the CMFS to distinguish between more important (or essential ) presentation units and less important presentation units as an enhancement to the data delivery process during periods of time where the network bandwidth is not being fully used. For the systems that do describe speci c syntax, the main focus has been on MPEG video encoding [17, 18, 48]. Although the authors explicitly state that their methods may be extended to other encoding formats, it is not made clear that any instance of a system would be capable of eciently supporting more than one format simultaneously. Chen et al. [17] and Chiueh and Katz [18] specify that their techniques can only be used on MPEG-conforming models of encoding. MPEG is a good instance of an encoding method to use as an example, since it incorporates both intra-frame and inter-frame dependencies. Unfortunately, no consideration is given 189

to the combination of encoding formats on the same server, or on the same storage devices. If a system uses the unique characteristics of the encoding in optimizing disk layout (as in [18]), this would con ict with other encoding formats. Several of the simulations have restricted their focus to only one media syntax. Most typically, this has been MPEG-1, with increasing emphasis on MPEG-2. The spatial resolution and constant bit-rate of MPEG-1 makes it straightforward to study, but not useful for systems that can grow with developments in compression technology or be commercially viable. Although hardware extensions exist to provide hardware MPEG decoding, present and future generation continuous media systems will need to deal with more ecient and higher quality compression schemes, most likely in an incremental fashion. Thus, the ability to support multiple formats, as in the CMFS, is essential.

6.7 Data Layout Issues The detailed allocation of continuous media fragments to speci c locations on disk storage devices occupies a great proportion of the attention of previous system designers. The primary motivation is to increase the bandwidth in general, or to reduce potential interference that occurs when multiple requests are serviced from the same disk. As well, striping tends to focus on the details of a particular encoding algorithm, making it unsuitable for a heterogeneous system like the CMFS. On the other hand, some systems do not provide any details of disk block layout. The server of Anderson et al. [5], makes use of the raw disk interface of the operating system (in this particular case, UNIX) without consideration for striping. No detail is given on a mapping between objects and their locations on disk. A general le server for continuous media must abstract the object to disk block layout policy to the level of the object itself, due to the varying sizes of presentation units and the need to map them to speci c disk block locations for ecient retrieval. Additionally, if signi cant computational energy is used to determine the 190

optimal locations of the various components of an object, this may compete with system resources available for playback. For an environment which encourages reading and writing concurrently, the optimal allocation of existing streams may prevent new streams from having the same kind of optimal layout without reorganization of the entire disk. Complete on-line reorganization is totally unacceptable from a performance point of view. Taking the system o-line for reorganization is equally undesirable. In all these methods, the existence of multiple, independent concurrent requests aects the usefulness of the stream-speci c disk layouts for individual les. As well, if the eective bandwidth is increased due to the fact that careful placement and subsequent careful retrieval patterns reduces seek activity for an expected retrieval pattern, then a user request pattern that has regular use of varying speed/skip modes will negate this enhancement. The eort involved in the placement and striping does not bring a performance bene t, and thus is not worth the complexity. In Anderson [5], contiguous disk blocks were allocated to a stream to reduce seek times. This provided the ability to achieve higher bandwidth when relatively few streams are active, but is not a necessary condition for the CMFS, as this level of contiguity is not assumed. A method of positioning data on the disk known as multimedia \strands" and \ropes" was developed by Vin and Rangan [89]. The goal was to ensure the relative spacing between successive segments of a stream and the careful selection of appropriate data to place between those segments to guarantee continuous playback. This is only applicable to a speci c set of streams, each with a constant bit-rate, retrieved in a speci c order. The careful allocation does not perform well with arbitrary skipping of segments and the retrieval of data streams which are out of phase with each other. In particular, the retrieval patterns associated with retrieving only one or two of the accepted video streams in reverse or slow-motion would be very resource intensive, greatly reducing the bandwidth achieved due to extra seek 191

activity. The CMFS handles this transparently as it makes no assumptions regarding stream request patterns. Tierney et al. [84] propose a scheme which clusters data in 2-dimensional \tiles" for each particular image. This is a form of striping which can greatly increase the bandwidth for retrieving a single stream. The general concept of RAID striping is used by Oyang et al. [70] to design a system from a hierarchical-disk point of view. Chen, Kandlur, and Yu [17] propose striping on a segment level for the purposes of load balancing during variable speed retrieval of streams. In the case of fastmotion, entire segments of a stream are skipped to provide the fast motion. If the segments retrieved are not evenly spaced across disk devices, this could alter the relative utilization of the disks in a signi cant manner, causing hot spots on some disks. This pattern of use would not allow the system to support additional users with the bandwidth that is freed up by skipping data on some of the disks. The actual eect on the number of users would be quite complicated and could provide a further area of research. The preliminary study by Sincoskie [81] looks at striping data across devices to achieve the ability to retrieve the same stream in parallel for users who are oset in time. A similar approach is taken by Ozden et al. [71], where the buer space and disk bandwidth necessary to retrieve data for dierent phases of the same movie are analyzed. Lau and Lui [50] allocate les into equally-sized fragments and place the fragments on disks in a round-robin manner. This distributes the load across the disks, but since VBR compression results in diering frame sizes, the number of fragments required per second can vary as well. This system does not attempt to split frames across fragment boundaries. The method of disk block allocation is reasonably generic, as it does not depend on the encoding format. It is unclear what performance bene t is realized by the striping when a group of disks is scheduled to service a number of requests for dierent media objects. 192

A detailed examination of storage allocation for multi-resolution video is provided by Chiueh and Katz [9]. A compression method which utilizes Laplacian pyramids intelligently divides the data in two ways: reference data which is needed to display the video stream at the lowest level of resolution and motion data, which has separate components for each resolution level. Thus, one reference le and n ? 1 motion les are created for a video with n levels of resolution, and they are located on disk so that a reference le and the motion les do not occupy the same portion of the disk array. A method which spreads the reference les evenly across the disks could increase the number of low-resolution viewers that could be supported in this kind of a system. It could be adapted into the CMFS model, but then the block schedule for an object would need entries for every disk on which the data for the object was stored. The details of block allocation are not considered in the CMFS. If striping can increase the guaranteed minimum number of blocks which can be read, then it could be an eective lower level optimization. The admission mechanism does not use this information. Whenever possible, presentation objects are stored contiguously on an individual disk so that seeking is not necessary when reading only one stream. This resulted in a great performance bene t in the execution of the server with large bandwidth video streams. The massively parallel striping methods of the Tiger system [9] can achieve enormous bandwidth for a particular stream by striping across all disks in the system. This is only eective if an appropriate stagger between arrivals occurs.

6.8 Disk Admission Algorithms The admission control question for both CBR and VBR streams has been extensively examined, from both the network and the disk point of view. Kamath et al. [43] perform disk admission control by examining the second-by-second variable bit-rate bandwidth needs of a set of streams, but they do not take advantage of slack time 193

read-ahead at the server to smooth out data rate peaks. This is the same as the Instantaneous Maximum algorithm from Chapter 4. A variant of this algorithm is also described in Chang and Zakhor [15]. Vin et al. [87] grant admission based on average bit-rate and deal with peaks in resource usage by equitably distributing the loss among the active streams. A model for CBR streams based on data-placement details and disk retrieval deadlines (QPMS) is given in Vin and Rangan [88]. The various algorithms for disk admission considered in Chapter 4 have been simulated or implemented in several systems. The early server implementations are capable of dealing adequately only with admission mechanisms for constant bit-rate continuous media streams. In particular, a number of systems/simulations assume there is a speci c rate of consumption for each stream ([18, 31, 73, 70, 76, 86, 94]. Vin and Rangan [89] and Gemmell [31] consider systems in which there may be dierences between streams, but not within a stream. This assumption drives the remainder of the design of their systems, making them unsuitable for ecient use with variable bit-rate streams unless the peak rate is used for resource reservation. One of the rst explicit considerations of variability was made by Dey-Sircar et al. [24] with planned bandwidth allocation when supporting multiple streams, but they do not provide a mechanism for allocating the bandwidth. Lau and Lui [50] also consider variable bit-rate retrieval to provide data for the client application. Their algorithms utilize a client-provided time bound on start-up latency (i.e. deadline) to determine if a stream can be scheduled before the requested time, given a limited set of resources. The admission test considers the peak rate needed to determine if a stream is admissible and delays the starting time and readjusts the disk tasks to minimize other measures of resource usage. This approach explicitly considers the anticipated length of time required for disk-reading tasks, which may vary over the lifetime of a stream. Performance analysis via simulation, based on statistical models of arrival rates, was done with the conservative, deterministic algorithm. The analysis of Chapter 4 shows that for streams with a reasonable dierence between 194

their average and their peak rates, this method provides an unacceptably low level of utilization. The admission control and data delivery mechanisms in Tobagi et al. [86] utilize the mechanical characteristics of particular disk drives to derive their model. This is then used to predict the number of CBR users that can be guaranteed to be supported, given a bound on start-up latency. Dan et al. [20] follow a similar model of guaranteeing block delivery while batching requests for the same stream, thereby delaying the acceptance of streams until suitable points in time, known as batching intervals. No explicit consideration of VBR in the streams is considered. Recent work by Dengler et al. [23] and Biersack and Thiesse [8, 7] builds on the work of Knightly et al.[45] and Chang and Zakhor [16], describing admission control methods which provide statistical and deterministic guarantees of disk service for VBR streams. The major focus is data placement strategies and the use of trac constraint functions is prominent. Constant Time Length (CTL) placement with deterministic guarantees is investigated in [23], while statistical admission control and Constant Data Length is examined in [7]. In Vin et al. [87], a statistical admission control algorithm is presented, which considers not only average bit rates but the distributions of frame sizes and probability distributions of the number of disk blocks needed during any particular service round. They acknowledge that the algorithm fails (i.e. over-subscribes the disk) in certain circumstances referred to as over ow rounds. In over ow rounds, the system has the complex task of dealing with the inability to read enough data. A greedy disk algorithm attempts to reduce the actual occurrence of over ow rounds and the system attempts to judiciously distribute the eective frame loss among the subscribed clients. This requires some knowledge of the syntax of the data stream, at least to the point of knowing where display unit (i.e. video frame) boundaries exist and which presentation units are more important than others (i.e. MPEG I-frames vs. MPEG B-frames). Their system is able to give priority to these more 195

important presentation units. Chang and Zakhor are also among those who have more directly experimented with more complicated versions of algorithms based on average bit-rates and distribution of frame sizes. A more complex version of the Average algorithm is given in [13] for Constant Time Length (CTL) video data retrieval. They also investigate Constant Data Length retrieval methods which introduce buering for the purposes of prefetching portions of the stream and incorporate a start-up latency period. In further work [16], they show via simulation that a variation of deterministic admission control admits 20% more users than their statistical method for a small probability of overload. In the CMFS, the placement of data on the disk has no eect on the admission control algorithm. It is not always possible to allocate blocks to the streams in such a way to get higher bandwidth from individual disks. For example, when large bandwidth video objects are stored on a disk, the estimate of performance can be increased if a certain access pattern is assumed which has fewer seeks. Unfortunately, the ability to deliver streams at varying values of speed and skip makes these kinds of assumptions not valid in every case.

6.9 Network Admission Control Algorithms The problem of allocation network resources for Variable Bit Rate audio/video transmission has been studied extensively. Zhang and Knightly [95] provide a brief taxonomy of the approaches from conservative peak-rate allocation to probabilistic allocation using VBR channels of networks such as ATM. The levels of network utilization with VBR channels are still below what many system designers would consider acceptable. Thus, the use of CBR channels with smoothed network transmission schedules has also been examined. This aects the nature of the admission control algorithms which are used in these systems [60, 95]. Also, since the resource requirements vary over time, renegotiation of the bandwidth [37] is needed in most 196

cases to police the network behaviour. Knightly et al. [45] perform a comparison of dierent admission control tests in order to determine trade-os between deterministic and statistical guarantees. The streams used in their tests are parameterized by a trac constraint function, known as the \empirical envelope". It describes the bandwidth needed at various points during stream transmission, so it is somewhat similar in form and function to the block schedule as presented in this dissertation, although much less detailed. This characterization is used in a system-wide admission control. This is then combined with dierent packet transfer schemes and the results do not particularly isolate each subsystem. The results are applied primarily to the network transmission subsystem. The CMFS applies the admission control to the disk and network in series, so that admission performance bottlenecks can be isolated. The empirical envelope is the tightest upper bound on the network utilization for VBR streams, as proven in Knightly et al. [45], but it is computationally expensive. This characterization has inspired other approximations [91, 35] which are less accurate and less expensive to compute, but which still provide useful predictions of network trac. In particular, Wrege and Liebeherr [91] utilize a pre x of the video trace as an aid in characterizing the trac pattern. When combined with statistical multiplexing in the network, high levels of network utilization can be achieved [45]. Zhang et al. [96] have worked on network call admission control methods with smoothed video data, which can take advantage of client buering capabilities. This reduces the amount of buering needed at the server and increases potential network utilization. Four dierent methods of smoothing bandwidth which can be used for constructing bandwidth schedules for admission control are compared by Feng and Rexford [29]. The four main methods are: critical bandwidth allocation, minimum changes bandwidth allocation, minimum variability bandwidth allocation and 197

piecewise constant rate transmission and transport (PCRTT). The critical bandwidth allocation is simpler than the next two because it attempts to keep the same rate as long as possible. This may cause more changes later on in the stream but the process is more ecient. The other algorithms they use attempt to minimize the number of bandwidth changes and the variability in network bandwidth. The simplest computational technique is PCRTT, which is very similar to the Original network bandwidth characterization scheme. They do not integrate this with particular admissions strategies other than peak-rate allocation. The smoothing methods and performance implications are discussed in more detail in Feng [28]. Bandwidth renegotiation over CBR channels and smoothing are used by Kamiyama and Li [44] in a Video-On-Demand system. McManus and Ross [60] analyze a system of delivery that pre-fetches enough of the data stream to allow end-to-end constant bit rate transmission of the remainder without starvation or over ow at the client, but at the expense of substantial latency in start-up. Sen et al. indicate [80] that minimum buer space can be realized with a latency of between 30 seconds and 1 minute. The model used in the CMFS has a xed bound on start-up latency of a very small number of seconds (with 500 msec slots, this bound is less than 2 seconds), but requires that the data is sent at varying bit-rates to avoid starvation or buer over ow eciently at the client.

6.10 Network Transmission The CMFS and nearly all of the research eorts in continuous media servers have taken place in the context of high-speed networks where bandwidth requirements can be speci ed to the network and the network can support some type of reservation/guarantee of the resource. The rst network environment where this was investigated was ATM networks with VBR channels. As mentioned in the previous section, recent eort has been concerned with CBR connections. One reason for this is that the burstiness of a video stream can be predicted, allowing CBR channels to 198

be used eectively. Another reason is that specifying the VBR connection parameters for the minimum acceptable requirements results in consistent data loss during overload times. This loss can be prevented by using CBR channels. In either VBR or CBR channels, smoothing can be used to handle the burstiness problems associated with transmission of VBR data. Many smoothing techniques have been provided which absolve the network of the responsibility of dealing with congestion caused by variable bit-rate streams. Lam et al. [48] smooth the data rate required, explicitly examining the data via looking ahead in the stream. Statistical properties of live video can be used to predict the bit-rates of the short term future of the stream to enable smoothing on live video as well. Encoding-speci c techniques are required at the server to act as lters to enable smooth network transmission rates. As well, the problems associated with variable bit-rate disk access are not addressed. Yavatkar and Manoj [94] require variations in the amount of data sent across the network connection to increase the Quality of Service provided to client applications. The variation in bit-rate of the encoding algorithm is not considered, but selective transmission, rate-based ow control, and selective feedback are used in the simulation of the Quasi-Reliable Multicast Transmission Protocol (QMTP). This has a variation in the use of the network according to the availability of bandwidth, but not in a planned way that corresponds to the data rate of the stream itself. In the Heidelberg Transport Protocol (HeiTP) [22], congestion at a network interface due to variable bit-rate trac is handled by reducing the amount of data transmitted/read o disk in a dynamic manner, which in turn leads to degradation in service to the client. Crimmins [19] uses a variable frame size in his simulation study but has a limited scope of analysis, considering only the eect on the data transmission over a token ring network. Network protocol considerations have a considerable eect on performance. 199

Previous work has been done using TCP/IP ([5] and [85, 84]). This has the advantage of being a standard, well-understood protocol that has reliable transmission characteristics. In an environment that is more concerned with timeliness than accuracy, a protocol that only performs retransmissions on control requests would be very desirable. It has been pointed out that protocols which require retransmission of video data for reliability are not useful in continuous media environments [49]. Thus the exploration of other protocols is necessary. In particular, the idea of selective retransmission with extra bandwidth [82] can increase reliability of the data stream. This is useful as long as there is not latency introduced. Most systems use some variation of unreliable data transfer in which missing data can be identi ed.

6.11 Summary In summary, the related work has been very extensive in the design and simulation analysis of continuous media servers. Constraining the problem has enabled signi cant results to be achieved. The bandwidth available from magnetic disks has reached the point where multiple video streams can be supported form a single disk. Techniques for careful block layout tend to restrict the exibility and heterogeneity of the system unnecessarily when varying request patterns (of speed and skip) are used. Scalability of a system has been explored and several designs accommodate the ability of a system to grow. None have obtained actual performance results from a large scale server. The early implementations were not even capable of such expansion. The complete models of servers are quite similar in general design to the CMFS described in this dissertation but do not describe admission control in suf cient detail. The discussions of admission control algorithms range from speci c dependence on detailed characteristics of the disk device to abstractions that only considered peak rate allocations. The CMFS design has used a complete model and considered the variability of the data stream itself. This is the most impor200

tant aspect of the design, since delivering variable bit-rate media streams to client applications is the aim of the server. The systems and their associated issues are given in Table 6.1. Much other work is not included in this matrix because it deals with only a single issue. Thus, most of this work has a fairly complete system model to back up the subset of issues that are discussed. There has been very little discussion of synchronization of multiple streams in the more comprehensive systems. Some of the issues discussed in the existing literature have not been speci cally considered in this dissertation, but have been incorporated into the model. For example, fault tolerance could be built in, and replication issues are discussed in Kraemer [46].

201

202

System Anderson - ACME/CMFS Du - Minnesota Calliope - AT&T Little - Boston Kamath - U. Mass Knightly - UCB/Virginia Chang/Zakhor - Berkeley Pegasus - Cambridge Shepherd - Lancaster Vin - Texas Rangan/Vin - UCSD TigerShark - IBM Feng - Michigan McManus/Ross - Penn Biersack - ESPRIT Kumar - IBM Oyang - Taiwan Tiger/Netshow - MS Tobagi - Starlite Networks Paek - Columbia U. Tierney - LBL Freedman/DeWitt - SPIFFI Ozden - AT&T UBC CMFS

System Model X X X X OS X X X

Synch X

X X

X X X

X

X

Scalable large yes FS X X X X X X X X

User I/F no X allows it

yes latency VCR fn X possible X X

Data Layout contig striping merged

Data Format

Issues Buer Mgt future X

striping log striping strands striping Net. Blks striping striping striping X striping clust/stripe spec. all.

MPEG no

MPEG-2 any

X yes

Table 6.1: Research Summary

Disk Adm peak none delivery vague IM lots CBR stat. CBR CBR many CBR CBR CBR vague CBR VBR

Net Adm

emp. env.

Xmission Protocol pull UDP pkt sched

none yes CBR

recv. buer NFS/heur PCRTT

CBR

RT Sched disk sched. deadline

Repl. FT

Nemesis hard RT

load bal

deadline

yes

RT kernel funnel ATM net. ignor load bal VBR

PCRTT

RT Threaads

Chapter 7

Conclusions and Future Work 7.1 Conclusions In this dissertation, it has been demonstrated that a Continuous Media File Server (CMFS) for Variable Bit-Rate Data, based on the principles of scalability, heterogeneity, and an abstract model of storage resources and performance, can be implemented to provide a human user with exible access primitives for synchronization, while making near-optimal use of the disk and network resources available. The most signi cant aspect of the server is that it incorporates the variable bit-rate nature of compressed continuous media (particularly video) into the entire system design, from the user interface to the resource reservation mechanisms at the server and the data delivery method. This server has been implemented and tested on a variety of hardware platforms. The performance testing has utilized a wide range of video streams that are typical of a news-on-demand environment. The range of video frame rates provides near full-motion to full-motion (from 20 to 30 fps) at high resolution (640 X 480 pixels). Multiple encoding formats were used, but the primary encoding format was Motion JPEG, due to limitations in the encoding hardware available. The playback duration ranged from 40 seconds to 10 minutes per clip, which is in the appropriate range for news stories, sports highlights, or music videos. 203

Three major contributions have been identi ed: 1) a complete system model and a working, ecient server implementation; 2) the disk admissions algorithm that incorporates variable bit-rates and send ahead into the admission decision; 3) the network smoothing characterization and admission algorithm. Both of these admission algorithms have been integrated into the server and analyzed with respect to eciency and the amount of the resource that can be allocated to VBR streams while still providing deterministic data delivery guarantees. The system model is more comprehensive than most existing models in that it can accommodate more specialization in terms of heterogeneity, scalability, reliability and exibility, while not focusing on any one of these aspects to the detriment of the others. While replication and migration are extensions to the research of this dissertation which have already been implemented, the mechanisms to introduce these facilities were present in the existing model. The modular design permits systems of varying scale to be implemented. As well, the design of the user interface to the server facilities enables simple client applications or complicated continuous media presentations to be developed independently of most of the server details. These applications may even make use of objects which are located on dierent servers. The resource allocation/reservation schemes were integrated into a complete server that allows client applications the exibility of storing individual mono-media streams and retrieving them in almost arbitrary manners for synchronized presentation to a human user. The abstract disk model permits streams of widely diering VBR pro les and dierent encoding formats to be stored on the same server node without adverse eect on the guaranteed performance of the server. The most signi cant contribution is the development of a disk admission control algorithm that explicitly utilizes a detailed bit-rate pro le for each stream, simulating the use of server buer space for the reading ahead of data. This algorithm is named vbrSim, and it emulates the variable bit-rate retrieval for the stream during the admission process, using a worst-case guarantee of disk bandwidth. 204

The vbrSim algorithm signi cantly outperforms the deterministic algorithms presented in Chapter 4 in terms of admission performance and is linear in execution time. The performance tests evaluated the algorithms for scenarios of large bandwidth video streams of diering variability and arrival patterns, ranging from simultaneous to a stagger of 20 seconds between requests. For most of the scenarios with staggered arrivals, the stream requests were ordered from longest to shortest to ensure that all streams were reading contiguously for some portion of the scenario. In terms of admission performance, the Simple Maximum algorithm accepted very few simultaneous stream requests and was incapable of accepting any scenarios requesting more than 50% of the disk capacity. Even smaller scenarios were regularly rejected. The Instantaneous Maximum algorithm accepts scenarios which are approximately 20% larger than Simple Maximum, but the vbrSim algorithm regularly accepts scenarios that are at least another 20% larger. The admission performance of the rst two algorithms was unaltered by introducing inter-arrival stagger to the request pattern. The algorithms could not accept scenarios which were larger, and the relative percentage of the disk decreased, due to the higher achieved bandwidth from the disk. The admission performance of the vbrSim algorithm improved with staggered requests, due to the incorporation of achieved read-ahead and guaranteed future read-ahead. With reasonably small values of stagger, scenarios that request a cumulative bandwidth which is greater than minRead and close to the actual bandwidth achieved by the disk system can be accepted. The largest scenarios accepted by the vbrSim algorithm with a stagger of 10 seconds sustained more than 100% of the long-term disk bandwidth achieved for the execution of that scenario. Assuming the guaranteed level of disk performance in the admission algorithm enabled a large degree of smoothing of the peaks of the variable bit-rate disk schedule. The observations from the performance tests showed that the vbrSim algorithm is sensitive to the variability of the bit rate within the stream. Even the 205

high variability streams achieve high utilization of resources while providing delivery guarantees. Constant bit-rate streams would enable the system to achieve perfect utilization of the network bandwidth. As well, the most complete usage of the disk occurs with a disk that achieves a constant bandwidth. With simultaneous arrivals, the vbrSim algorithm accepted the largest percentage of the disk resource when CBR streams were stored on the server. With variable bit-rate streams, the existence of bandwidth peaks did cause many scenarios to be rejected. With staggered arrivals, there was not much dierence in the acceptance rate between the dierent types of streams. Buer space requirements for the vbrSim algorithm were found to be large, but not excessive. The scenarios which requested more than minRead blocks per slot required signi cant buer space to guarantee delivery from the server for every stream. Most scenarios that were acceptable in terms of bandwidth required less than 200 MBytes of server buer space on the disk. The required space appeared to grow linearly with the cumulative bandwidth of the scenario for requests above minRead. High-variability streams require more buer space than low-variability streams while the cumulative request is below minRead; if the request is above minRead, there is no signi cant dierence in buer space requirements between the two types of streams. Adding more client buer space did not aect the number of high-bandwidth video streams that could be accepted. As well, increasing the amount of time between request arrivals allowed more streams to be accepted only when those streams were of short playback duration. This is because a signi cant percentage of each stream could be held in the server buer space. The nal major contribution is the development and integration of a network admission control algorithm and a network bandwidth smoothing technique. This enables the network subsystem to transmit data on each real-time data connection using renegotiated constant bit-rates for reasonably long periods of time. 206

While the vbrSim algorithm was shown to be technically feasible for the network bandwidth as well as the disk bandwidth, it was not used because it depended too much on the disk system having signi cant read-ahead and very large client buer space in order to achieve any substantial guaranteed send-ahead. A slightly modi ed Instantaneous Maximum algorithm with the Smoothed network bandwidth characterization can accept scenarios with over 90% of the network interface limit requested. This is 10-15% better than using the Original network characterization method and far superior to the Peak characterization. With respect to dierent stream types, the Smoothed algorithm showed more performance improvements for the high-variability streams, because there are more peaks to smooth. Most of the performance testing was conducted using 20-second network slots for renegotiation and admission control purposes. More detailed study showed that, for the particular type of streams, using a 10-second network slot resulted in better admission performance. Some of the results are not convincing, due to the small number of data points upon which to base a conclusion.

7.2 Future Work There has been a great deal of work exploring the issues involved in storing and retrieving continuous media. Hardware limitations often restricted the more signi cant theoretical work to the context of simulation studies. Recent economics have made implementations more practical and thus provided a more tangible environment for evaluation. The implementation in this dissertation is one of the rst VBR servers that has been examined in detail. This has opened up a number of possibilities for future work.

7.2.1 Long Streams The results in Chapter 4 showed that the bene ts of read-ahead and staggered arrivals were more signi cant for shorter streams. One of the reasons behind this 207

observation is that streams utilize an increasing amount of buer space as they increase in total size. When staggered arrivals permitted a signi cant percentage of a stream's data to be read ahead into server buer space or sent ahead to be stored in client buer space, the utilization of the disk was able to be increased. More work on longer streams is necessary to nd the point at which the sophisticated admission algorithms do not improve performance and server buer space becomes the primary limitation.

7.2.2 Disk and Network Con gurations As an extreme point on the server design continuum, a CMFS could be implemented with one video clip per disk, and one disk per server node. Thus, the number of simultaneous users per server node would be a direct function of the popularity of the video clip. Such a con guration would be very expensive and lead to poor utilization, due to the phenomenon of locality of reference. There has been work done on storage allocation of movies, based on popularity, and this work could be adapted to the context of the CMFS to determine ecient con gurations of servers. In particular, this extension could help determine the optimal number of objects to be stored on a particular disk, levels of replication within the disks on a server node, and the number of disks that should be attached to a server node as a function of its network interface capability.

7.2.3 Relaxing the value of minRead Additional empirical studies with more aggressive values of minRead would lead to further indications on precisely how conservative the vbrSim algorithm is for typical sets of streams. If minRead can be set to the actual average of the recent past, what is the potential that a bad decision could be made?

208

7.2.4 Variants of the Average Algorithm In the performance tests, the Average algorithm used minRead as its estimate of disk performance. No stream that caused the average request bandwidth to exceed minRead was accepted. This process provided admission decisions that were too conservative in some cases and too aggressive in others. This was because it did not factor the shape of the bandwidth requirements into the acceptance decision. One option would be to use the observed average as the estimate of disk performance. Obviously, this would tend to eliminate the possibility of being too conservative but increase the potential for making aggressive admission decisions. A more careful study of the sensitivity to these parameters may give additional insight into the overall bene t of the vbrSim algorithm.

7.2.5 Reordering Requests One of the performance bene ts of the system is that contiguous reading o the disk increases the achieved bandwidth. This often permits additional streams to be accepted. With large amounts of server buer space, many streams have dozens of disk slots of buer space at the server. Once the server is in steady state, buers are released at a slower rate than the disk can read. When the disk system is approaching steady state, it is likely ahead on most of the existing streams. Thus, it could perform even better by readjusting the deadlines of the data for some streams, so as to perform more contiguous reading. This would enable steady state to be reached more quickly, and may have the bene t of enabling more streams to be accepted.

209

Bibliography [1] ISO/IEC JTC1 CD 10918. MJPEG Digital compression and coding of continuous-tone still images. Technical report, ISO, 1993. [2] Chris Adie. A Survey of Distributed Multimedia Research. Technical Report RARE Project OBR (92) 046v2, Reseaux Assizes pour la Recherche European, January 1993. [3] Chris Adie. Network Access to Multimedia Information. Technical Report RARE Project OBR (93) 015, Reseaux Assizes pour la Recherche European, August 1993. [4] D. P. Anderson and G. Homsy. A Continuous Media I/O Server and Its Synchronization Mechanism. IEEE Computer, 24(10):51{57, October 1991. [5] D. P. Anderson, Y. Osawa, and R. Govindan. A File System for Continuous Media. ACM Transactions on Computer Systems, 10(4):311{337, November 1992. [6] C. Bernhardt and E. Biersack. The Server Array: A Scalable Video Server Architecture. In O. Spaniol, W. Eelsberg, A. Danthine, and D. Ferrari, editors, High Speed Networking for Multimedia Applications, chapter 5, pages 103{125. Kluwer Publ., March 1996. [7] E. W. Biersack and F. Thiesse. Statistical Admission Control in Video Servers with Constant Data Length Retrieval of VBR Streams. In Third International Conference on Multimedia Modeling, Toulouse, France, November 1996. [8] E. W. Biersack and F. Thiesse. Statistical Admission Control in Video Servers with Variable Bit Rate Streams and Constant Time Length Retrieval. In Euromicro '96, Prague, Czech Republic, September 1996. [9] W. J. Bolosky, J. S. Barrera III, R. P. Draves, R. P. Fitzgerald, G. A. Gibson, M. B. Jones, S. P. Levi, N. P. Myhrvold, and R. F. Rashid. The Tiger Video 210

Fileserver. In 6th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pages 29{35, Zushi, Japan, April 1996. [10] William J. Bolosky, Robert. P. Fitzgerald, and John R. Douceur. Distributed Schedule Management in the Tiger Video Fileserver. In SOSP 16, pages 212{ 223, St. Malo, France, October 1997. [11] Dick C.A. Bulterman and Rovert van Liere. Multimedia Synchronization and UNIX. In 2nd International Workshop on Network and Operating Systems Support for Digital Audio and Video, Heidelberg, Germany, November 1991. [12] W.C. Chan and E. Geraniotis. Near-Optimal Bandwidth Allocation for MultiMedia Virtual Circuit Switched Networks. In INFOCOMM, pages 749{757, San Francisco, CA, October 1996. [13] E. Chang and A. Zakhor. Admissions Control and Data Placement for VBR Video Servers. In 1st IEEE International Conference on Image Processing, pages 278{282, Austin, TX, November 1994. [14] E. Chang and A. Zakhor. Disk-based Storage for Scalable Video. In Unknown. web page down, pages 278{282, Austin, TX, November 1994. [15] E. Chang and A. Zakhor. Variable Bit Rate MPEG Video Storage on Parallel Disk Arrays. In 1st International Workshop on Community Networking Integrated Multimedia Services to the Home, pages 127{137, San Francisco, CA, july 1994. [16] E. Chang and A. Zakhor. Cost Analyses for VBR Video Servers. In IST/SPIE Multimedia Computing and Networking, pages 381{397, San Jose, January 1996. [17] M. Chen, D. D. Kandlur, and P. S. Yu. Support for Fully Interactive Playout in a Disk-Array-Based Video Server. In ACM Multimedia, pages 391{398, San Francisco, CA, October 1994. [18] T. Chiueh and R. H. Katz. Multi-Resolution Video Representation for Parallel Disk Arrays. In ACM Multimedia, Anaheim, CA, June 1993. [19] S. Crimmons. Analysis of Video Conferencing on a Token Ring Local Area Network. In ACM Multimedia, Anaheim, CA, June 1993. [20] A. Dan, D. Sitaram, and P. Shahabuddin. Scheduling Policies for an OnDemand Video Server with Batching. In ACM Multimedia, pages 15{23, San Francisco, CA, October 1994. 211

[21] S. E. Deering. Multicast Routing in a Datagram Internetwork. PhD thesis, Stanford University, December 1991. [22] L. Delgrossi, C. Halstrick, D. Hehmann, R. G. Herrtwich, O. Krone, J. Sanvoss, and C. Vogt. Media Scaling for Audiovisual Communication with the Heidelberg Transport System. In ACM Multimedia, Anaheim, CA, June 1993. [23] J. Dengler, C. Bernhardt, and E. W. Biersack. Deterministic Admission Control Strategies in Video Servers with Variable Bit Rate Streams. In Interactive Distributed Multimedia Systems and Services, European Workshop IDMS'96, Heidelberg, Germany, March 1996. [24] J. K. Dey-Sircar, J.Salehi, J. Kurose, and D. Towsley. Providing VCR Capabilities in Large-Scale Video Servers. In ACM Multimedia, pages 25{32, San Francisco, CA, October 1994. [25] E. Dubois, N. Baaziz, and M. Matta. Impact of Scan Conversion Methods on the Performance of Scalable Video Coding. In IST/SPIE Proceedings, San Jose, CA, February 1995. [26] S. El-Henaoui, R. Coelho, and S. Tohme. A Bandwidth Allocation Protocol for MPEG VBR Trac in ATM Networks. In IEEE INFOCOMM, pages 1100{ 1107, San Francisco, CA, October 1996. [27] Anwas Elwalid, , Daniel Heyman, T. V. Lakshman, Debasis Mitra, and Allan Weiss. Fundamental Results on the Performance of ATM Multiplexers with Applications to Video Teleconferencing. In ACM SIGMETRICS '95, pages 86{97. ACM, May 1995. [28] Wu Chi Feng. Video-On-Demand Services: Ecient Transportation and Decompression of Variable-Bit-Rate Video. PhD thesis, University of Michigan, 1997. [29] Wu Chi Feng and Jennifer Rexford. A Comparison of Bandwidth Smoothing Techniques for the Transmission of Prerecorded Compressed Video. In IEEE INFOCOMM, pages 58{66, Los Angeles, CA, June 1997. [30] D. Finkelstein, R. Mechler, G. Neufeld, D. Makaro, and N. Hutchinson. Real-Time Threads Interface. Technical Report 95-07, University of British Columbia, Vancouver, B. C., March 1995. [31] D. J. Gemmell. Multimedia Network File Servers: Multi-channel Delay Sensitive Data Retrieval. In ACM Multimedia, pages 243{250, Anaheim, CA, June 1993. 212

[32] J. Gemmell and S. Christodoulakis. Principles of Delay-Sensitive Multimedia Storage and Retrieval. ACM Transactions on Information Systems, 10(1), 1992. [33] S. Ghandeharizadeh, S. Ho Kim, W. Shi, and R. Zimmerman. On Minimizing Startup Latency in Scalable Continuous Media Servers. In IST/SPIE Multimedia Computing and Networking, San Jose, CA, February 1997. [34] Pawan Goyal, Harrick M. Vin, and Prashant J. Shenoy. A Reliable, Adaptive Network Protocol for Video Transport. In IEEE INFOCOMM, San Francisco, CA, October 1996. [35] Marcel Graf. VBR Video over ATM: Reducing Network Requirements through Endsystem Trac Shaping. In IEEE INFOCOMM, pages 48{57, Los Angeles, CA, June 1997. [36] Carsten Griwodz, Michael Bar, and Lars C. Wolf. Long-term Movie Popularity Models in Video-on-Demand Systems. In ACM Multimedia, pages 349{357, Seattle, WA, November 1997. [37] M. Grossglauser, S. Keshav, and D. Tse. RCBR: A Simple and Ecient Service for Multiple Time-Scale Trac. In ACM SIGCOMM, pages 219{230, Boston, MA, August 1995. [38] ISO/IEC JTC1/SC29/WG 11 Editorial Group. MPEG-2 DIS 13818-7 - Video (Ggeneric Coding of moving pictures and associated audio information. Technical report, International Standards Organization, Geneva, Switzerland, 1996. [39] ITU-T Recommendation H.263. Video Coding for low bitrate communication. Technical report, CCITT, 1995. [40] R. L. Harkin and F. B. Schmuck. The Tiger Shark File System. In IEEE Spring Compcon, Santa Clara, CA, February 1996. [41] John Hartman and John K. Ousterhout. ZEBRA: A Striped Network File System. Technical Report UBC/CSD 92/683, University of California, Berkeley, Berkeley, CA, 1992. [42] Andrew Heybe, Mark Sullivan, and Paul England. Callipoe: A Distributed, Scalable Multimedia Server. In ACM USENIX Annual Technical Conference, San Diego, CA, January 1996. [43] M. Kamath, K. Ramamritham, and D. Towsley. Continuous Media Sharing in Multimedia Database Systems. Technical Report 94-11, Department of Computer Science, University of Massachussets, Amherst MA, 1994. 213

[44] N. Kamiyama and V. Li. Renegotiated CBR Transmission in Interactive Videoon-Demand Systems. In IEEE Multimedia, pages 12{19, Ottawa, Canada, June 1997. [45] E. W. Knightly, D. E. Wrege, J. Liebeherr, and H. Zhang. Fundamental Limits and Tradeos of Providing Deterministic Guarantees to VBR Video Trac. In ACM SIGMETRICS '95. ACM, May 1995. [46] Oliver Kraemer. A Load Sharing and Object Replication Architecture for a Distributed Media Fileserver. Master's thesis, Universitat Karlsruhe, January 1997. [47] M. Kumar, J.L. Kouloheris, M.J. McHugh, and S. Kasera. A High Performance Video Server for Broadband Network Environment. In IST/SPIE Multimedia Computing and Networking, San Jose, CA, January 1996. [48] Simon S. Lam, Simon Chow, and David K. Y. Yau. An Algorithm for Lossless Smoothing of MPEG Video. In ACM SIGCOMM, London, England, September 1994. [49] Bernd Lamparter, Wolfgang Eelsberg, and Norman Michl. A Movie Transmission Protocol for Multimedia Applications. In 4th IEEE ComSoc International Workshop on Multimedia Communications, Monterey, CA, 1992. [50] S. W. Lau and J. C. S. Lui. A Novel Video-On-Demand Storage Architecture for Supporting Constant Frame Rate with Variable Bit Rate Retrieval. In 5th International Workshop on Network and Operating System Support for Digital Audio and Video, Durham, NH, 1995. [51] Andrew Laursen, Jerey Olkin, and Mark Porter. Oracle Media Server: Providing Consumer Based Interactive Access to Multimedia Data. In ACM SIGMOD '94, pages 470{477, April 1994. [52] Ian M. Leslie, Derek McAuley, and Sape J. Mullender. Pegasus - OperatingSystem Support for Distributed Multimedia Systems. Technical Report 282, University of Cambridge, 1992. [53] Lian Li and Nicolas Georganas. MPEG-2 Coded and Uncoded Stream Synchronization Control for Real-time Multimedia Transmission and Presentation over B-ISDN. In ACM Multimedia, San Francisco, 1994. [54] C. J. Lindblad, D. J. Wetherall, W. F. Stasios, J. F. Adam, H. H. Houh, M. Ismets, D. R. Bacher, B. M. Phillips, and D. L. Tennenhouse. ViewStation 214

Applications: Intelligent Video Processing Over a Broadband Local Area Network. In High-Speed Networking Symposium, USENIX Association, Oakland, CA, August 1-3 1994. USENIX Association. [55] T.D.C. Little and D. Venkatesh. Client-Server Metadata Management for the Delivery of Movies in a Video-On-Demand Systems. In First International Workshop on Services in Distributed and Networked Environments, Prague, Czech Republic, 1994. [56] J. C. L. Liu, J.i Hseih, and D. H.C. Du. Performance of A Storage System for Supporting Dierent Video Types and Qualities. IEEE Journal on Selected Areas in Communications: Special Issue on Distributed Multimedia Systems and Technology, 14(7):1314{1341, sep 1996. [57] P. Lougher and D. Shepherd. The Design of a Storage Server for Continuous Media. The Computer Journal (Special Issue on Multimedia), 36(1):32{42, February 1993. [58] D. Makaro, G. Neufeld, and N. Hutchinson. An Evaluation of VBR Admission Algorithms for Continuous Media File Servers. In ACM Multimedia, pages 143{ 154, Seattle, WA, November 1997. [59] R. Marasli, P. D. Amer, and P. T. Conrad. Retransmissin-Based Partially Reliable Transport Service: An Analytical Model. In IEEE INFOCOMM, pages 621{629, San Francisco, CA, October 1996. [60] J. M. McManus and K. W. Ross. Video on Demand over ATM: Constant-Rate Transmission and Transport. In IEEE INFOCOMM, pages 1357{1362, San Francisco, CA, October 1996. [61] Jean M. McManus and Keith W. Ross. Video on Demand over ATM: ConstantRate Transmission and Transport. IEEE Journal on Selected Areas in Communication, 14(6), August 1996. [62] R. Mechler. A Portable Real Time Threads Environment. Master's thesis, University of British Columbia, April 1997. [63] Roland Mechler. Cmfs Data Stream Protocol. Unpublished UBC Tech Report, 1997. [64] Microsoft. Netshow overview. http://207.68.247.53/Theater/overview.htm, 1998. 215

[65] ISO/IEC JTC1/WG11 MPEG. International Standard ISO 11172 Codin of moving pictures and associated audio for digital storage media up to 1.5 mb/s. Technical report, ISO, Geneva, Switzerland, 1993. [66] S. J. Mullender, I. M. Leslie, and D. McAuley. Operating-System Support for Distributed Multimedia. In USENIX High-Speed Networking Symposium Procceedings, pages 209{219, Oakland, CA, August 1-3 1994. USENIX Association. [67] Michael N. Nelson, Mark Linton, and Susan Owicki. A Highly Available, Scalable ITV System. In SOSP 15, pages 54{67, April 1994. [68] G. Neufeld, D. Makaro, and N. Hutchinson. Design of a Variable Bit Rate Continuous Media File Server for an ATM Network. In IST/SPIE Multimedia Computing and Networking, pages 370{380, San Jose, CA, January 1996. [69] J. Nieh and M. S. Lam. SMART UNIX SVR4 Support for Multimedia Applications. In IEEE Multimedia, pages 404{414, Ottawa, Canada, June 1997. [70] Yen-Jen Oyang, Meng-Huang Lee, and Chun-Hung Wen. A Video Storage System for On-Demand Playback. Technical Report NTUCSIE94-02, National Taiwan University, Taiwan, 1994. [71] B. Ozden, A. Biliris, R. Rastogi, and A. Silberschatz. A Low-Cost Storage Server for Movie on Demand Databases. In 20th VLDB Conference, pages 594{605, Santiago, Chile, 1994. [72] Seungyup Paek and Paul Bocheck Shi-Fu Chang. Scalable MPEG2 Video Servers with Heterogeneous QoS on Parallel Disk Arrays. In Proceedings of 5th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV'95), Durham, NH, April 1995. [73] P.V. Rangan and H.M. Vin. Designing File Systems for Digital Video and Audio. In Proceedings 13th Symposium on Operating Systems Principles (SOSP '91), Operating Systems Review, volume 25, pages 81{94, October 1991. [74] P.V. Rangan and H.M. Vin. Ecient Storage Techniques for Digital Continuous Multimedia. IEEE Transactions on Knowledge and Data Engineering Special Issue on Multimedia Information Systems, August 1993. [75] P.V. Rangan, H.M. Vin, and S. Ramanathan. Designing an On-Demand Multimedia Service. IEEE Communications Magazine, 1992. [76] A. L. Reddy and J. Wyllie. Disk Scheduling in a Multimedia I/O System. In ACM Multimedia, Anaheim, CA, June 1993. 216

[77] A. Narashima Reddy. Improving Latency in an Interactive Video Server. In IST/SPIE Multimedia Computing and Networking, San Jose, CA, February 1997. [78] Lawrence A. Rowe and Brian C. Smith. A Continuous Media Player. In 3rd International Workshop on Network and Operating Systems Support for Digital Audio and Video, San Diego, CA, November 1992. [79] H Schulzrinne, S. Casner, R. Frederick, and V. Jacobsen. RFC 1889: RTP A Transport Protocol for Real-Time Applications. In IETF - AudioVideo Working Group, 1996. [80] Subrahata Sen, Jayanta Dey, James Kurose, John Stankovic, and Don Towsley. CBR Transmission of VBR Stored Video. In SPIE Symposium on Voice Video and Data Communications: Multimedia Networks: Security, Displays, Terminals, Gateways, Dallas, TX, November 1997. [81] W.C. Sincoskie. System Architecture for a Large-Scale Video on Demand Multimedia Service. Computer Networks and ISDN Systems, 22(1):155{162, 1991. [82] B. C. Smith. Implementation Techniques for Continuous Media Systems and Applications. PhD thesis, University of California, Berkeley, 1994. [83] W. T. Strayer, B. J. Dempsey, and A. C. Weaver. XTP: The Xpress Transport Protocol. Addison Wesley Publishing, October 1992. [84] B. Tierney, W. Johnston, H. Herzog, G. Hoo, G. Jin, and J. Lee. System Issues in Implementing High Speed Distributed Parallel Data Storage Systems. In USENIX High-Speed Networking Symposium, 1994. [85] B. Tierney, W. Johnston, H. Herzog, G. Hoo, G. Jin, J. Lee, L. T. Chen, and D. Rotem. Distributed Parallel Data Storage Systems: A Scalable Approach to High Speed Image Servers. In ACM Multimedia, San Francisco, CA, October 1994. [86] F. A. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming RAID - A Disk Array Management System For Video Files. In ACM Multimedia, pages 393{ 400, June 1993. [87] H. M. Vin, P. Goyal, Alok Goyal, and Anshuman Goyal. A Statistical Admission Control Algorithm for Multimedia Servers. In ACM Multimedia, pages 33{40, San Francisco, CA, October 1994. 217

[88] H. M. Vin and P. V. Rangan. Admission Control Algorithms for Multimedia On-Demand Servers. In 3rd International Workshop on Network and Operating Systems Support for Digital Audio and Video, 1992. [89] H.M. Vin and P.V. Rangan. Designing a Multi-User HDTV Storage Server. IEEE Journal on Selected Areas in Communication: Special Issue on High De nition Television and Digital Video Communication, 11(1), August 1993. [90] J.W. Wong, D. Evans, N. Georganas, J. Brinskelle, G. Neufeld, and D. Makaro. An MBone-based Distance Education System. In International Conference on Computer Communications, Cannes, France, 1997. [91] Dallas E. Wrege and Jorg Liebeherr. Video Trac Characterization for Multimedia Networks with a Deterministic Service. In IEEE INFOCOMM, pages 537 { 544, San Francisco, CA, March 1996. [92] CCITT Study Group XV. CCITT Rec H.261 Video Codec for Audiovisual Services at px64 kbit/s. Technical report, CCITT, Geneva, Switzerland, 1990. [93] D. Yau and S. Lam. Adaptive Rate-Controlled Scheduling for Multimedia Applications. In ACM Multimedia, Boston, MA, November 1996. [94] R. Yavatkar and L. Manoj. Optimisitc Strategies for Large-Scale Dissemination of Multimedia Information. In ACM Multimedia, Anaheim, CA, June 1993. [95] H. Zhang and E. W. Knightly. A New Approach to Support Delay-Sensitive VBR Video in Packet-Switched Networks. In 5th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pages 381{397, Durham NH, April 1995. [96] Z. Zhang, J.Kurose, J. D. Salehi, and D. Towsley. Smoothing, Statistical Multiplexing and Call Admission Control for Stored Video. IEEE Journal of Selected Areas in Communications Special Issue on Real-Time Video Services in Multimedia Networks, 15(6), August 1997.

218

Appendix A

CMFS Application Programmer's Interface The Distributed Continuous Media File System provides the client with the following interface. These routines use the underlying Send/Receive/Reply IPC mechanism supported via the UBC Real-Time Threads (RTT) kernel [30]. To use the CMFS as it is currently implemented, the client must be using the RTT kernel. To make use of the API, an application must have the following statement: #include

This le contains data structures and data types for further use of the interface. In particular, status values are de ned for the return codes of API calls. They are implemented as the enumerated type CmfsStatus. enum CmfsStatus = {STREAMOK = 0, NOTFOUND = -1, NOTACCESSIBLE= -2, NETWORKUNABLE = -3, SERVERUNABLE = -4, CLIENTREFUSED = -5, DATALOST = -6, NOTYETIMPL = -7, COMMERROR = -8, ENDOFDATA = -9, INVALIDREQUEST = -10, NOSPACE = -11, CLIENTUNABLE = -12}

Other data structures are described with the procedures that reference them. 219

A.1 Object Manipulation A.1.1 CMFS Storage API Creation of presentation objects must be performed by a client application. When the client makes the decision to save an object to the CMFS, the following interface is provided: CmfsStatus CmfsCreate( u_long dispUnits, u_long timeVal, u_long length, u_long *cid, UOI *uoi )

This procedure creates a new presentation object at the server. The parameters dispUnits and timeVal provide a context in which to interpret time values. These are to be interpreted as a ratio where dispUnits is speci cally the number of presentation units and timeVal is the length of time (in milliseconds) that this number of units comprises. For example, this may be given as frames/second for video (e.g. 30/1000), sampling rate in Khz for audio (i.e. 8000/1000), frame length for MPEG audio (1/24 for 24 millisecond frames) or viewgraphs (or captioned text) per second (e.g. 1/1000). Length is the approximate total length (in bytes) of the presentation object to be stored. The CMFS needs to be able to determine if it has the resources available to store the object at the time requested. If it is possible to store the presentation object, the connection on which to transmit the data (cid ) and the uoi of the object are returned as result parameters along with the status of CMFSOK. Otherwise, an error status is returned, indicating the type of error. To store a sequence of an object in the CMFS, the following call is provided: CmfsStatus CmfsWrite ( u_long cid, u_long length, char *buffer, int units, u_long sizes[] )

This call provides the method of storing the data for an object on the CMFS. The length parameter (speci ed in bytes) and buer parameter refer to the actual 220

data being sent. The data written in one call to CmfsWrite is de ned as a sequence. These sequences are de ned as small units of continuous media data (typically up to one second's worth of data, for the primary purpose of accommodating the implementation of displaying in various modes (fast forward, rewind, and slow-motion). Sequence boundaries are also points at which the retrieval process can begin. They can be used as beginning of a scene, or other related logical division of the object. Thus, several calls to CmfsWrite would be made during the creation of a particular presentation object. The units parameter refers to units of time which are compatible with the numunits parameter from CmfsCreate. For example, if the sequence consists of 2 1/3 seconds of video and the numunits parameter pervious had been set to 30, then the units parameter would be 70. The sizes parameter is an array containing the size in bytes of each display unit that is being written. The server requires this information so that it can properly select the proper data blocks which are required for retrieval during a particular time interval. NOTE: It is expected that software at the application will be provided to convert a stream (possibly encoded) into sequences that would be stored by the CMFS. This would be dierent for each media type and would provide sequences with appropriate characteristics for storage. CmfsStatus CmfsComplete( u_long cid )

This call indicates that the object has been completely written to the CMFS and the connection is closed.

A.1.2 Moving and Deleting CmfsStatus CmfsRemove ( UOI uoi )

221

This call allows an application to remove a presentation object from the server. CmfsStatus CmfsMigrate ( UOI uoi )

The details of this call and its functionality are provided in [46]. CmfsStatus CmfsReplicate ( UOI uoi )

The details of this call and its functionality are provided in [46].

A.2 Stream Delivery and Connection Management Each interface call returns a status code. The interface is as follows: CmfsStatus CmfsInit( u_long ipAddr, u_long port )

This procedure takes the address and port number of the administrator as arguments and initializes any client-wide data structures. The client initially establishes contact with the CMFS via the CmfsOpen procedure. The parameters passed by the client to CmfsOpen include the UOI for the object. CmfsStatus CmfsOpen( UOI uoi, (int)(*callBack)(SD *sd), RttTimeValue *prepareBound,

u_long *cid, u_int ipAddr, u_int port)

typedef struct StreamDescriptor { u_long init; /* initial buffer reservation request */ u_long avgBitRate; u_long avgPeriod; /* time over which avg bit rate calc */ u_long maxBitRate; }

SD;

222

CmfsOpen does not cause transmission of the presentation data, nor scheduling of disk activity. This procedure only establishes a real-time network connection to the server and veri es that the presentation object exists. Control does not return from CmfsOpen until a connection is established (or refused). Part of the establishment of the connection is to determine an initial buer size that the client must have in order for the data transmission protocol of stream data to proceed properly, i.e. that the server is be able to deliver the data across the network in time. The parameter callBack is a pointer to a client-application supplied function and is invoked when the client receives the network connect request from the server. A stream descriptor is passed to callBack which is used to convey quality of service parameters for this stream. The elds are computed at the server and/or client as appropriate. The rst value, init, is the amount of buer space that must be allocated for the connection to be able to support a prepare request. This is server-determined, based on the network latency and the maximum bandwidth of the stream. The major task of callBack is to ensure that the client has the required resources to accommodate any subsequent CmfsPrepare request and that it informs the server of the amount of buer space that it is willing to allocate to this connection. In order to refuse the entire connection, callBack should return the value CLIENTREFUSED. Otherwise, STREAMOK should be returned and the contents of the stream descriptor structure are returned to CmfsOpen. The value in init is used as the amount of buer reservation in the client. This value is (possibly) modi ed and passed back to the server so the server can perform send ahead. A client application needs to be aware of the possible con gurations in which this connection could be prepared in the future, as these impact the amount of buer space that must be allocated. The new value of init must consider the buer space needed in fast motion mode as more data is transmitted per time unit. As well, if the client allows for delay in initial reading of the stream (via delayTime in CmfsPrepare ), the buer space needed for that amount of time must be included. CmfsOpen takes care of cleaning

223

up the connection details at the server, in case of a failure of any kind. The parameter prepareBound is returned from the server to the client. It speci es the upper bound on the amount of time that a call to CmfsPrepare may take. This allows a client to determine if multiple presentation objects can be opened in sucient synchronization with each other for an eective presentation to the user. The parameters ipAddr and port identify the machine and portnumber (essentially, a process on the client machine which is to listen for continuous media data to be sent from the server) to which a real-time data connection is to be established. This allows a dierent network interface to be used for the control and the data connections. The server initiates the establishment of this connection, which is unreliable in both directions. The following two calls are provided for the convenience of client applications wishing to have separate processes (perhaps on separate processors), perform the control functions and the real-time data transfer operations. CmfsStatus CmfsProxyOpen(UOI uoi,RttTimeValue *prepareBound, u_long *cid, u_int ipAddr, u_int port, RttThreadId clientPid);

This call performs all the work of CmfsOpen except that which deals with the setting up of the real-time connection at the client. Note that the actual thread identi er (clientPid) of the real-time client must be communicated to this interface call in order for the data connection to be properly established. This must be accomplished by higher level software. CmfsStatus CmfsListen (u_long *cid, int (*) (SD *callBack) , u_int iPaddr, u_int Port);

CmfsListen establishes a transport-level connection for the stream that was opened by CmfsProxyOpen. All the parameters have the same semantics as described in CmfsOpen.

224

The client may terminate this connection at any time by issuing the following close request. CmfsStatus CmfsClose( u_long cid )

CmfsClose takes a single parameter - the connection id (cid) - and closes the session. If the stream is not in the stopped state, the CMFS stops the connection data transfer before performing the close.

A.2.1 Stream Control Once a connection has been opened, the client must request that the server prepare the stream for data transfer. This involves determining if the server has sucient resources available to display the portion of the stream requested at the moment. The request to provide real-time delivery of the media is made with CmfsPrepare. CmfsStatus CmfsPrepare( u_long cid, pos startPos, pos stopPos, u_int speed, u_int skip, RttTimeValue delayTime )

This routine returns a status code indicating success if the server can deliver the data requested by the client with the required quality of service speci ed by the speed and skip parameters. It also speci es the maximum amount of time (delayTime ) that a client can delay reading data from the connection without any (implicit) adverse action being taken by the CMFS. Bandwidth is reserved at the server and data is delivered across the network into client buers for impending calls to CmfsRead. If the delay in the initial call to CmfsRead exceeds delayTime seconds, the connection will be terminated. This may happen at the application or the network level of the server and calls to CmfsRead will indicate this data loss. The server continues to send data at the prescribed rate, and if the client does not read quickly enough, additional data may be dropped. The startPos and stopPos parameters are of the opaque datatype pos, and are interpreted as osets into the stream. If startPos is greater than stopPos, the 225

display of the video is in rewind mode. The values of these position parameters correspond to places where the data transfer can start (i.e. sequence boundaries c.f. CmfsWrite ). Otherwise, CmfsPrepare will fail. To display an entire object, the constants bf STARTOFSTREAM and ENDOFSTREAM are provided. The parameter speed indicates at what rate the client wishes to retrieve the stream in percentage of normal speed. For example, a value of 100 indicates normal speed (i.e. the same time as recording speed), whereas 50 would indicate slow retrieval at half the display rate of normal and 200 would be fast retrieval at twice the recorded rate. Due to the fact that this may aect the network bandwidth required, this may result in some parameters of the network connection being altered. A client requesting data to be delivered at a speed of 50 and displaying at normal speed (i.e. 100) will most certainly starve. The parameter skip tells the CMFS how many sequences to skip when retrieving the data from the disk. This would allow for ecient implementation of fast-motion display. A value of 0 means that no data is to be skipped. A value of 1 means that one sequence is to be skipped for every sequence sent. A value of 2 indicates 2 skipped sequences for every sequence sent. Before CmfsPrepare returns control to the client application, sucient data is sent over the network connection previously established via CmfsOpen so that CmfsRead operations (see below) will be not delayed. The amount of data that is initially sent is de ned during CmfsOpen. The parameter delayTime is given as an RttTimeValue and is the maximum amount of delay the client can cause by postponing the initial call to CmfsRead. The reason this is necessary is that per stream buer memory will be reserved at the server during CmfsPrepare. If these buers accumulate beyond a threshold value during playback, some action must be taken at the server. If that read is not issued before delayTime from the return of CmfsPrepare, then the connection should be terminated.

226

Once the presentation object has been \readied" via the CmfsOpen and CmfsPrepare requests, the client can issue the CmfsRead request to obtain the stream data. CmfsStatus CmfsRead( u_long cid, void **buffer, int *length )

This procedure returns a pointer to the data read from the connection. If the return status is ENDOFDATA, then there is no more data on the prepared stream. The caller of this procedure is responsible for freeing the returned buer. This can be done via CmfsFree (see below). If data is lost by the network, a return value of DATALOST is given, and the length parameter is set to the length of the missing data. The connection will be aborted by the server if the client does not issue sucient number of CmfsRead operations quickly enough (as given by the data rate values in the StreamDescriptor parameter in CmfsOpen). The rate of data reading must keep up with the number and sizes of the display units requested in the CmfsPrepare call. The client also has to perform the reads of data so that the order and timeliness of the data makes sense to the presentation application. Because of the transport layer implementation of buer allocations, the buer that is returned in a call to CmfsRead must be freed in accordance with the allocation. This is accomplished by the call: CmfsStatus CmfsFree (char *buffer)

so that the details of this mechanism are invisible to the client application. The delivery of the continuous media stream to the client can be terminated at any time by the following call: CmfsStatus CmfsStop( u_long cid )

Once control returns from ]em CmfsStop, any subsequent calls to CmfsRead on that connection will return the status value ENDOFDATA. 227

A.3 Meta Data Management The CMFS needs to store some of its own meta-information about the stream. The following interface allows attributes about a presentation to be stored and retrieved. There is a set of system-de ned attributes for every object that the CMFS needs for its own internal operation. Additionally, client applications can de ne their own attributes. The system de ned attributes are accessible to the application, but as read-only. CmfsStatus CmfsPutAttr ( UOI uoi, CmfsDatum attrKey, CmfsDatum value ) struct { u_int length; u_char *key; } CmfsDatum;

This procedure inserts the value into the list of attribute-value pairs for the given object. Correspondingly, an interface to retrieve the value of the attributes is provided. CmfsStatus CmfsGetAttr ( UOI uoi, CmfsDatum attrKey, CmfsDatum *value )

Any application that desires more than one of these attributes in a given call must provide a wrapper function to do so. CmfsStatus CmfsListAllUOIs ( CmfsDatum *uoiValue )

This call provides a user to get a list of the objects which are stored at the particular administrator to which a client is connected. One attribute of every UOI is the list of all the attributes that have been stored for that UOI. This allows a client application to get a detailed listing of 228

all information on an object. Due to the fact that attribute keys and values are arbitrary bit strings, a client may or may not be able to intelligently decipher the meaning of these attributes.

A.4 Directory Service A.5 Miscellaneous A.5.1 Conversions and Stream Display Information This call determines the position in the stream that corresponds to the given time value. This timevalue is the real time that has transpired since the rst read operation took place. It will be close to indicating the exact amount of data that has been displayed by the client. CmfsStatus CmfsTime2Pos(u_long cid, RttTimeValue time, pos *position, RttTimeValue *offset)

CmfsTime2Pos returns the position of the particular point in time of the stream identi ed by the cid. Note that it is impossible to make this call for an arbitrary uoi, but it can only be used on an opened stream. The returned position value is the nearest (previous in time) valid starting position in the stream. The parameter Oset indicates the dierence in time between the actual position calculated and the one returned. The position which is returned is in uenced by the parameters to the previous CmfsPrepare request. In particular, the start position, the speed, and the skip value determine how real-time display time correlates to movement in the stream itself. For example, if the stream display started at position 12, with speed equal to 100 and skip = to 1, then the position returned would be 12 greater than the amount of time the client had been displaying due to not starting at the beginning, and it would also be greater due to the fact that every other sequence had been skipped. The eect of the skip value would depend on the actual sizes of

229

the sequences which had been written. If there has been no CmfsPrepare request, this procedure assumes that the time referred to is the time from the beginning of the stream at speed = 100 and skip = 0.

230

Appendix B

Stream Scenarios B.1 Stream Groupings This section of the appendix shows which streams were grouped together for the experiments in Chapters 4 and 5. The rst grouping is for the per-disk tests in Chapter 4 and is shown in Table B.1.

First Group

bloop93 bengals chases rescue si-intro maproom coaches boxing aretha

Low Variability

aretha coaches rescue Joe Greene Ray Charles FBI Men Plan-9 Country Music Fires Tom Connors Clinton

High Short Variability Playback

football yes30 maproom bloop93 laettner snowstorm twins tenchi30E dallas akira30E John Elway

trailers catinhat tomconnors country music elway clinton twins trailers dallas montreal Canadiens FBI Men

Long Playback

leia raiders evac football aretha yes30 coaches maproom football leia yes30

Table B.1: Disk Admission Control Stream Groupings The long streams had repeat selection of streams, although they were not 231

stored on the disk more than once. This is because there was not enough disk capacity to store a stream more than once. Since the experiment where this was used was investigating the eect of stagger and client buer space, it is reasonable to use a stream more than once, since the two requests are oset in time. The next three tables show the pseudo disk-con gurations used for the network admission control tests. The rst 143 scenarios used streams in the same relative position on each disk (i.e. streams 0, 3, 4, and 9), as shown in Table B.6. The last 50 scenarios were selected afterwards and were comprised of a dierent scenario from each disk, providing a dierent cumulative load to the network.

Disk 1

n deception aretha bloop93 chases snowstorm country music iw Tenchi30E Dallas Mr. White John Major

Disk 2

evac football boxing rescue twins res Tom Connors Clinton Montreal Canadiens George of Jungle Bengals

Disk 3

leia raiders coaches spinal-tap Joe Greene FBI men Annie Hall basketball trailers Summit Series CatInHat

Disk 4

deathstar yes30 maproom laettner Ray Charles Plan-9 John Major Akira30E John Elway Criswell Green Eggs

Table B.2: Network Admission Control Stream Groupings - MIXED

B.2 Scenario Selection B.2.1 Algorithm Comparison For the set of tests in Section 4.4, a selection of streams was made in a random fashion selecting streams in the order of longest to shortest. Most of the scenarios were instantaneous arrivals. Table B.5 shows the scenarios for the rst tests which compare the various algorithms (Sections 4.4.2, 4.4.3, 4.4.4, and 4.4.5). 232

Disk 1

leia due south coaches eric beach boys FBI Men Annie Hall Buddy Holly Joe Greene joriarty Kinks

Disk 2

Disk 3

olympics fender beatles tomconnors Ray Charles Plan-9 John Major Buddy Holly John Major Tom Connors coaches

deathstar aretha joe greene fender Kinks country music Plan-9 Beatles Tenchi30E Ray Charles John Major

Disk 4

aretha moriarty rescue rescue deathstar res Tom Connors Clinton moriarty Leia Bengals

Table B.3: Network Admission Control Stream Groupings - LOW

Disk 1

rivals arrow hilites pink oyd ads cars jays basketball trailers summit series iw

Disk 2

x- les raiders maproom laettner bloop93 twins dallas akira30E John Elway criswell baseball

Disk 3

maproom football bloop93 snowstorm si-intro maproom iw dallas chicken criswell laettner

Disk 4

yes24 football si-intro pink oyd twins arrow John Elway criswell jays hilites baseball

Table B.4: Network Admission Control Stream Groupings - HIGH

233

Scenario

Stagger

S0

S1

S2

S3

S4

S5

S6

S7

S8

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

1

10

2

10

X

3

10

X

X

4

10

X

X

X

5

10

X

X

X

X

6

10

X

X

X

X

X

7

5

X

X

X

X

X

X

8

0

X

X

X

X

X

X

X

9

0

X

X

X

X

X

X

X

10

0

X

X

X

X

X

X

11

0

X

X

X

X

X

X

X

12

0

X

X

X

X

X

X

X

13

0

X

X

X

X

X

X

14

0

X

X

X

X

X

X

15

0

X

X

X

X

X

X

X

16

0

X

X

X

X

X

X

X

17

0

X

X

X

X

X

X

X

18

0

X

X

X

X

X

X

X

19

0

X

X

X

X

X

X

X

20

0

X

X

X

X

X

X

21

0

X

X

X

X

X

X

X

22

5

X

X

X

X

X

X

X

23

5

X

X

X

X

X

X

X

24

5

X

X

X

X

X

X

X

25

5

X

X

X

X

X

X

26

5

X

X

X

X

X

X

X

27

5

X

X

X

X

X

X

X

28

5

X

X

X

X

X

X

X

29

5

X

X

X

X

X

X

30

0

X

X

X

X

X

X

31

0

X

X

X

X

X

32

0

X

X

X

X

X

33

0

X

X

X

X

X

34

0

X

X

X

X

X

X

35

0

X

X

X

X

X

X

36

0

X

X

X

X

X

X

X

234

X

X X

X

X

X X X

X X

X X

X X X

X

Scenario

Stagger

S0

S1

S2

S3

S4

S5

S6

S7

S8

X

X

X

X

X

37

5

X

X

X

X

38

5

X

X

X

X

X

39

5

X

X

X

X

X

X

40

5

X

X

X

X

X

X

41

5

X

X

X

X

X

42

5

X

X

X

X

X

X

43

5

X

X

X

X

X

X

X

44

5

X

X

X

X

X

45

5

X

X

X

X

X

X

46

5

X

X

X

X

X

X

47

5

X

X

X

X

X

X

X

48

10

X

X

X

X

X

X

X

49

10

X

X

X

X

X

X

X

50

10

X

X

X

X

X

X

X

51

0

X

X

X

X

X

X

52

0

X

X

X

X

X

X

53

0

X

X

X

X

X

X

54

0

X

X

X

X

X

X

55

0

X

X

X

X

X

X

56

0

X

X

X

X

X

57

0

X

X

X

X

X

58

0

X

X

X

59

0

X

X

X

60

0

X

X

X

X

X

61

0

X

X

X

62

0

X

X

X

63

0

64

0

65

X X

X

X

X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

0

X

X

66

0

X

67

0

X

68

0

69

0

X

70

0

X

71

5

X

X

72

5

X

X

X

X X X

X X X X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

235

X

X

X

X

X

X

X

X

X

X X

X

X X

X

Scenario

Stagger

73

5

74

5

75

5

76 77

S0

S1

X

S2

S3

S4

S5

X

X

X

X

X

X

X

X

X

5

X

X

X

5

X

X

X

X

78

5

X

X

X

X

79

5

X

X

X

X

X

80

5

X

X

X

X

81

5

X

X

82

5

X

X

83

5

X

X

84

5

X

X

X

85

5

86

5

87

5

88

5

X

X

89

5

X

90

5

X

91

10

X

92

10

X

93

10

X

X

94

10

X

X

95

10

X

96

10

X

97

10

X

98

10

X

99

10

X

100

10

101

0

102

0

X

103

0

X

X

104

0

X

X

105

0

X

X

106

0

X

X

107

0

X

108

0

X

X

X

S6

S7

S8

X

X

X

X

X

X X X

X X

X

X X

X

X

X

X

X

X

X

X X

X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X X

X

X

X

X

236

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Scenario

Stagger

S0

S1

S2

S3

S4

S5

S6

S7

S8

X

X

X

X

X

X

X

109

0

X

110

0

X

X

111

0

X

X

X

112

0

X

X

X

113

0

X

X

X

114

0

X

X

X

X

115

0

X

X

X

X

116

0

X

X

X

X

117

0

X

X

X

118

0

X

X

X

119

0

X

X

120

0

X

X

X

121

5

X

X

X

122

5

123

5

X

124

5

X

X

125

5

X

X

126

5

X

X

127

5

128

5

X

129

5

X

130

5

X

X

131

5

X

X

X

132

5

X

X

X

133

5

X

134

5

135

5

X

136

5

X

137

5

138

5

X

139

5

X

X

140

5

X

X

141

10

X

142

10

X

143

10

X

144

10

X

X

X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X X

X X

X

X X

X X

X

X X

X

X

X X

X X

X

X

X

X

X

X X

X

X

X X

X

X

X

X

237

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X X

X

X

X

X

Scenario

Stagger

S0

S1

S2

S3

S4

S5

S6

S7

S8

X

X

145

10

X

X

X

146

10

X

X

X

X

X

147

10

X

X

X

X

X

148

10

X

X

X

X

X

149

10

X

X

X

X

150

10

X

X

151

0

152

0

X

153

0

X

X

154

0

X

X

X

155

0

X

X

X

156

0

X

X

X

157

0

X

158

0

X

X

159

0

X

160

0

X

X

161

0

X

X

X

162

0

X

X

X

163

0

164

0

165

0

166

0

X

X

167

0

X

X

X

168

0

X

X

X

169

0

170

0

X

X

X

171

5

X

X

X

X

172

5

X

X

X

X

173

5

X

X

X

X

174

5

X

X

X

X

175

5

X

X

X

176

5

X

X

X

177

5

X

X

178

5

X

179

5

X

180

5

X

X X

X

X X

X

X

X

X

X

X

X

X X

X X X

X

X

X

X X

X

X

X

X X

X

X

X X

X

X

X

X X

X

X X

X X

X

X

X X X

X

X X

238

X

X

X X

X X

X

X

X X

X

X X

Scenario

Stagger

S0

181

5

182

5

183

5

184

5

185

5

X

186

5

X

187 188 189

S1

S2

S3

S4

X

X

X X X

S5

X

X

X

X

X

S8

X

X X

X X

X

X

X

X

5

X

X

X

5

X

X

5

X

X

190

5

X

X

191

10

192

10

X

193

10

X

X

194

10

X

X

195

10

196

10

197

10

198

10

X

199

10

X

X

200

10

X

X

201

0

X

202

0

203

0

204

0

205

0

206

0

207

0

208

0

X

209

0

X

210

0

X

211

0

X

212

0

213

0

214

0

215

0

X

216

0

X

X

X X X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X X

X

S7

X

X

X

S6

X

X

X X

X

X

X X X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X X

X

X

X

X

X X

X

239

X

X

X

X

X X

X

Scenario

Stagger

S0

217

0

218

0

219

0

220

0

221

5

222

5

223

5

224

5

225

5

226

5

227

5

228

5

X

229

5

X

230

5

X

231

5

X

232

5

233

5

234

5

235

5

X

236

5

X

237

5

X

238

5

239

5

S1

S2

S3

S4

S5

X

X

X

S6 X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X X

X X X

X

5

242

10

243

10

244

10

245

10

246

10

247

10

248

10

X

249

10

X

250

10

X

X

X

X

X

X

X

X

X X

X

10

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X X

X

241

X

X

X

240

S8

X

X X

S7

X

X

X

X X

X

Table B.5: Stream Selection into Scenarios (First Tests)

240

B.2.2 All Remaining Comparisons For the remaining comparisons on the vbrSim algorithm, the following selections of streams into scenarios were performed. There are 143 scenarios, selected as described in Section 4.4.5. Only a small number of scenarios were used in some of the tests, such as those examining client buer space eects and inter-request arrival times. Scenario

S0

S1

S2

S3

S4

S5

S6

1

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

2 3 4 5 6

X

X

7

X

X

8

X

9

X

X

10

X

X

11

X

X

12

X

13 14

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

20

X

21

X

X

X

22

X

X

X

23

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

25 26

30

X

X X

X

X X

28 X

X

X

X

X

X

X

X

X

241

X

X

X

27

29

X

X X

X

X

X

X

X X

19

24

X

X

X

X

X X

X

X

X

X X

X X

X X X

X

X X

X

17 18

X

X X

S10

X

X X

S9

X

X

X

S8

X

X

X

X

16

X

X

X

X

X X

X

X

15

X X

X

S7

X

X X

X

Scenario

S0

S1

31

X

X

32

X

X

33

X

X

34

X

35

S2

S4

S6

X X

S8

S9

S10

X

X

X

X

X

X

X

X

X

X X

37

S7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

38

S5

X

X

36

S3

X

X

X

X

X

X

X

40

X

X

41

X

X

X

42

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

44

X

45

X

46 47 48

50

X

X

51 52

X

53

X

X

54

X

X

55

X

X

56

X

57

61

63

X

65

X

X

X

X

X X

X X

X

242

X

X

X

X

X

X X

X

X X X

X X

X

X

X

X

X

X

X X

X

X

X X

X

X

X X

X X

X

X X

X

X

X

X

X

X

X

X

X

X

X X

64

66

X X

X

62

X

X

X

X

X

X

X

60

X

X

X

X

X

X

58 59

X

X

49

X

X

X

X

X

X

39

43

X

X

X X

Scenario

S0

S1

67

X

X

68

X

69

S2

S3

S4

X

X

X X

70

X X X

71 X

73

X

74

X

75

X

X

S7

X

S9

X X

X

X X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

79 80 81

X

82 83 84

X

85 86

X

87

X

88

X

X

X X

90

X

X

X X

91

X X

93 94

X

X

98

X

99

101 102

X

X

X

X

X

X

X

X

X

X

X X X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

243

X

X X

X

X

X

X

X

97

X

X

X

96

100

X X

X

95

X

X

X

89

92

X

X

77 X

X

X

76

78

S10

X

X X

S8

X

X X

S6

X X

X

72

S5

X X

X

X

Scenario

S0

S1

S2

103

S3 X

104 105

S5

X X

X

X

111

X

X

X

X

X

X

X

X

113 114

120

X

121

X

122

X

123

X

124

X

125

X

X

126

X

X

127

X

X

X

X

131

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

135

X

X

136

X

X

X

137

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X X

X X

X

X

X

244

X

X

X

X X

X

X

134

X

X

X

X

X

X X

X

X

X

X

X

X

133

X

X

X

X

X

X

X

X

132

138

X

X

X

X

X

X

X X

X

X

X

X

130

X

X

128 129

X

X X

117

X

X

X

116

X

X

X

115

119

X

X

X

118

X

X

110

S10

X

X

X X

S9

X

X

X X

S8

X X

X X

112

S7

X

107

109

S6

X X

106

108

S4

X X X

Scenario

S0

139

X

140

X

141 142 143

S1

S2

S3

S4

S5

S6

S7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X X

X

X

X

S8

S9

X

X

X

X X

S10 X

X

X

X

X

X

X

X

Table B.6: Stream Selection into Scenarios (Remaining Tests) For the network tests on stream variability, an additional 50 scenarios were created by combining individual disk scenarios which were accepted by each disk. The twenty largest disk requests accepted by each disk for the mixed variability streams were selected and combined in various manners to get network scenarios which maximized the stress on the network admission control algorithm. This selection did not necessarily provide the most aggressive scenarios for the low variability and high variability streams, but it did give more data points in total. These are shown in Tables B.7 and B.8.

245

Scen.

S0

S1

S2

S3

S4

144

X

X

X

X

X

145

X

X

146

X

X

147

X

X

148

X

151

X

X

154

X

S8

X X X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

246

176

X

X

177

X

X

178

X X

X

179

X

X

180

X

181

X

X

182 183

X

184

X

X

X

X

X

X X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

190 191 192 193

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

189

X

X

186

188

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

185

187

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

175

X

X

X

X

X

174

X X

X

X

X

X

173

X

X

X

X

X

172

X

X

X

X

X

X

171

X

X

X

X

170

X

X

X

X

X

169

X

X

X

X

168

X

X

X

X

X

X

X

X

X

X

X

X

X

X

166

X

S21

X

X

X X

S20

X

X

165

167

X

X

S19

X

X

X

X

S18

X

X

X

S17

X X

X

X

S16 X

X

X

163 X

X

S15

X

X

162

164

X

X

X

161

X

X

X

160

X

X

X

159

X

X

X

158

X

X

X

X

X

X

X

X

X

X

X

S14 X

X

X

156

S13

X X

X

X

S12

X

X

155

157

X

S11

X

X

X

X

S10

X

X

X

X

S9

X

X

X

152 153

S7

X

X

150

S6

X

X

149

S5

X

Table B.7: Selection of Extra Scenarios: First 22 streams

X

X

Scen.

S22

S23

144

X

X

145

X

X

146

X

X

147

X

X

148

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

149 150

S24

S25

S26

X

X

X

S27

153

X

156

X

X

157

X

X

158

X

159

161

X X

X X

162

X

X

X

X

X

X X

X

X X

X

247

X X

X

165

X

X

166

X

X

167

X

X

168

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

169 170

X X

X

172 173

X

X

X

X

X

X

175

X

X

176

X

X

177

X

X

178

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

180

X

X

X

X

X

X X

181 182 183

X

X

X

X

185

X

X

186

X

X

187

X

X

188

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

191 192 193

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X X X

X X

X X X

X X X

X X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X X X

X X

X X X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X X X

X X X

Table B.8: Selection of Extra Scenarios: Last 22 streams

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

190

X

X

X

X

189

X

X

X

184

X

X

X

X X

X X

X

X

S43

X

X

X X

X

X

X

X

X X

X

X

X

179

X

X

X

X

X

X

X

S42

X

X

X

X

S41

X

X

X

X

174

X

S40

X

X X

S39

X

X

X

X

S38

X

X

X

X X

171

X X

S37

X

X

X

S36

X

X X

X

164

X

S35

X

X X

X

163

X

S34

X X

X

X X

X X

X

X

S33

X

X

X

X

S32

X

X

X

S31

X

X

X

160

S30

X

154 X

X X

X

152

155

S29

X

X X

151

S28

X

X

X X

X

X

X

X

X

Design, Implementation and Evaluation of a Variable Bit ... - CiteSeerX

Design, Implementation and Evaluation of a Variable Bit ... - CiteSeerX

Suggest Documents

Applications, Implementation and Performance Evaluation of Bit ...

Design, implementation and performance evaluation of ... - CiteSeerX

Design, Implementation, and Evaluation of a

Design, Implementation and Evaluation of a Web

Design, Implementation and Performance Evaluation of A ... - CiteSeerX

The Design, Implementation and Evaluation of SMART: A ... - CiteSeerX

Design, Implementation and Evaluation of a Multimodal ... - CiteSeerX

Design and Implementation of a Pipelined Bit-Serial SFQ

Design and implementation of a high-speed bit

Performance of Asynchronous, Variable Bit Rate ... - CiteSeerX

Design and Implementation of a Variable Gain ...

Design and Implementation of a Variable Pulse Tone Generator ...

Implementation of an Enhanced Fixed Point Variable Bit-Rate MELP ...

FPGA Implementation of Variable Bit Rate 16 QAM ...

DESIGN AND IMPLEMENTATION ISSUES OF A ... - CiteSeerX

Design, Implementation, and Evaluation of the Work First ... - CiteSeerX

design, implementation, and evaluation of bluetooth security - CiteSeerX

design and implementation of 32-bit controller for

Design, Modeling and Implementation of 8-bit Processor for Intelligent ...

Design, Modeling and Implementation of 8-bit ...

Design and Implementation of an Object-Orientated 64-bit Single ...

Design and Implementation of 1-bit Comparator in Quantum-dot ...

design and implementation of 32-bit controller for ... - Semantic Scholar

design and implementation of bit transition counter - Semantic Scholar