PC personal computer. QOS quality of service. RISC. Reduced Instruction Set ... data. These data types have very di erent storage space and retrieval rate ...
Design Issues in High Performance Media-on-Demand Servers Divyesh Jadav Alok Choudhary Department of Electrical and Computer Engineering and CASE Center Syracuse University Syracuse, NY 13244
1 Introduction Digitalization of traditionally analog data such as video and audio, and the feasibility of obtaining network bandwidths above the gigabit-per-second range are two important advances that have made possible the realization, in the near future, of interactive distributed multimedia systems. The advent of multimedia concepts has sparked a urry of activity in all elds of life, be it research in the academic community, commercialization in industry, legislation on the part of the government, or expectations of consumers. What are the reasons responsible for the ubiquitous nature of this rapidly evolving eld of computing ? The single most important reason is the tremendous potential of this technology. Multimedia systems are poised to change the life style of society. Ideas such as ordering pizzas over the computer, classrooms in the home and entertainment on demand are bound to aect each and every one. Amongst all the applications of multimedia technology, none has generated more excitement than the prospect of on-demand services. This refers to the ability of the user to receive service from remote resources interactively, and at his/her own pace. The quintessential example of this range of applications is video-on-demand. This refers to the ability of a consumer to order and view movies of his/her choice, at his/her convenience, all from his/her home using a remote control device. In other words, the consumer has at his/her disposal a virtual VCR with all the traditional functions of fast-forward, pause, play. etc, with the dierence that there is no need of commuting to the video rental store to get the cassette. This scenario, though achievable, remains an abstract concept at present. However, steps have been taken towards the realization of this concept due to recent advances in networking, processing and storage. Nevertheless, large scale deployment of interactive multimedia services is a complex problem. One of the reasons for this diculty is the inherent complexity implied by the very nature of multimedia data. 1 This work is supported by Intel Corporation, NSF Young Investigator Award CCR-9357840 and the New York State Center for Advanced Technology in Computer Applications and Software Engineering (CASE) at Syracuse University. The authors thank the Caltech CCSF facilities for providing access to the Intel Paragon.
1
ATM CATV Codec DQDB EDF FCFS FDDI GUI HDTV I/O IPI JPEG Mb/s MOD MPEG MPP NRT NTSC PC QOS RISC RT SCSI VOD WAN WWW
Asynchronous Transfer Mode Common Antenna Television coder/decoder Distributed Queue Dual Bus earliest deadline rst rst-come, rst-served Fiber Distributed Data Interface Graphical User Interface High De nition Television input/output Intellligent Peripheral Interface Joint Photographic Experts Group megabits per second media-on-demand Moving Pictures Expert Group Massively Parallel Processor non real-time National Television Systems Committee personal computer quality of service Reduced Instruction Set Computer real-time Small Computer System Interconnect video-on-demand wide area network World-Wide Web Table 1: Acronyms
2 Characteristics of Multimedia Data The reason why multimedia data processing is dicult is that such data diers markedly from the unimedia data (text) that conventional computers are built to handle [RaV92] : Multiple data streams : A multimedia object can consist of text, audio, video and image data. These data types have very dierent storage space and retrieval rate requirements. The design choices include storing data of the same type together, or storing data belonging to the same object together. In either case, multimedia data adds a whole new dimension to the mechanisms used to store, retrieve and manipulate the data. Real-time retrieval requirements: Video and audio data are characterized by the fact that they must be presented to the user, and hence retrieved and transported, in real-time. In addition, compound objects (objects consisting of more than one media type) usually require two or more data types to be synchronized as the object is played out. Large data size: The size of a typical video or audio object is much larger than that of a typical text object. For example, a two hour movie stored in MPEG-1 [Gal91] format requires over 1 gigabytes of storage. Hence, not only must retrieval and transportation mechanism be fast, they should also have large storage capacity and transfer bandwidth, respectively. 2
These features of multimedia data strain the capacities of technology that was designed to handle unimedia data. Consequently, new approaches and techniques need to be developed in all three areas of computing - storage, processing and communications. A factor that exacerbates the diculties imposed by the nature of multimedia data is the relative infancy of the eld of multimedia computing. The rate at which companies are jumping onto the multimedia bandwagon has resulted in a plethora of dierent formats and standards, which only adds to the confusion. Nevertheless, encouraging steps have been taken towards standardization, and some measure of order is being brought about. For instance, the realization that the best way to cope with the large size of video and image data is to compress such data at source, and decompress it at the destination, has resulted in compression standards for each of these data types.The JPEG (Joint Photographic Experts Group) standard is used for compressing images, while the MPEG (Moving Pictures Experts Group) standard lays down the rules for video compression. It has also become clear that ATM (Asynchronous Transfer Mode) shall be the protocol of choice for large-scale distributed multimedia networks. Signi cant progress toward the de nition of a digital HDTV (High De nition Television) standard has been made. Similarly, the WWW (World Wide Web) has considerably eased the problem of Internet navigation. More standards however, are required in the area of storage mechanisms, scripting languages, GUIs (Graphical User Interface) and database representations and manipulations for multimedia data.
3 Applications Interactive multimedia applications can be broadly classi ed under two categories - persistent and non persistent, where the word persistent refers to the nature of the data at the source and/or destination. An example of the former type would be movies-on-demand, and an example of the latter type is videoconferencing. This paper addresses applications of only the former type. Interactive services present the consumer with tremendous exibility. The main reasons in support of interactive services such as entertainment-on-demand are : The user can choose what to view instead of being restricted to choosing from the programs being shown on a few channels. Moreover, the user can avoid being forced to watch endless commercials. The ability to control when to view. This obviates the need to tailor the viewing schedule around the broadcasting schedule. Elimination of the need to commute and the tyranny of distance to make use of a service. This has tremendous applications for shopping and education. The user can proceed at his/her own pace . For instance, the availability of digitized lectures make it possible for a student to review dicult topics and skip simple ones. Interactive multimedia systems span a gamut of types, depending on the degree of interactivity allowed. [Gel+91] classi es interactive services into broadcast, pay-per-view, quasi VOD, near VOD and true VOD, in ascending order of interactivity available. Although movies-on-demand is the most visible interactive multimedia application, it is by no means the only one. Table 2 shows the important projected uses of interactive multimedia technology.
4 Hardware Requirements The entities involved in a distributed multimedia environment can be classi ed under the categories of consumer, storage provider, content provider and network provider [RaR94]. This section 3
Application Description Entertainment-on-demand Ability to view multimedia presentations with VCR-like controls for leisure and entertainment. Includes : * Movies-on-demand * News-on-demand * Interactive video games Home Shopping Ability to browse catalogs, select, order and pay for merchandise interactively Distance Learning Ability to undergo self-education by designing content and pace of 'lectures' Digital Libraries Ability to browse, read and 'check out' (download) all conventional library materials in digital form, interactively Home Oce Ability to upload and download phone messages, faxes, memos, les and folders from home
Table 2: Example Applications of Interactive Multimedia Technology gives an overview of the hardware requirements for the network provider, consumer and storage provider. The next two sections give a detailed discussion of requirements and implementation issues for storage provider hardware. The peculiar nature of multimedia data necessitates the development and deployment of specialized hardware for storage, transport and presentation. The rst requirement for a fully operational large-scale distributed multimedia environment is the presence of a high-speed wide area network. The growing popularity of the Internet suggests that it could be used as the backbone of the information superhighway. Although the infrastructure is already in place and it is rapidly gaining international acceptance (the Internet grew by 81 % in 1994 to some 3.5 million hosts in 154 countries [Low95]), it is woefully inadequate to handle the high volume of multimedia trac. Some of the network protocols that have been considered as candidates for carrying multimedia data include the 100 Mb/s Ethernet standard, DQDB (Distributed Queue Dual Bus), FDDI (Fiber Distributed Data Interface) and ATM. The rst three have bandwidths of the order of 100 Mb/s. Although this is an improvement over the 10 Mb/s of Ethernet, it is still inadequate for high volume multimedia data. ATM is rapidly emerging as the front runner [Lan94]. Given that an ATM network will form the backbone of an interactive media-on-demand system, the data will be stored in strategically located servers. Figure 1 shows a hierarchically organized MOD system. At the top of the hierarchy is the consumer. The consumer's interface device is connected to a medium capacity neighbourhood server. Programs requested by clients are downloaded from a more powerful metropolitan server to the neighbourhood server. The metropolitan server may store hundreds of titles. If a requested title is not available at the metropolitan server, then it will have to be fetched from a remote server in another city. The remote server could be a similar (metropolitan) server, or a central archive storing thousands of titles. The cost of retrieving multimedia streams as seen by the consumer is directly proportional to the depth of the hierarchy at which the stream data is located. Like a memory hierarchy in a computer, this hierarchy of networked servers caches the requested data closest to the consumer of the data, thus minimizing costly fetches from storage agents lower down the hierarchy. It is to be noted that there is a growing consensus that the data must be stored in compressed form by the storage provider for on-demand services to be feasible. Consequently, some 4
Satellite
Network Interface
User terminal High Speed Network Metropolitan Server
Neighborhood Server
Remote Archive
Figure 1: Block diagram of a hierarchical MOD system sophisticated equipment is required at the client site (see gure 2). The decompression of the data is done at the client's multimedia terminal , which is an intelligent computer with hardware such as a microphone, high-resolution graphics display, stereo speakers and a sophisticated cable decoder. The cable decoder is the interface to the high-speed wide-area network. It has tens of kilobytes of buer space and compression and decompression hardware built into it [Per94]. In general, a single user session may consist of multiple media streams. Streaming restores the temporal relationship among the elements of a single stream before delivery to a multimedia device. Synchronization restores the temporal relationship among multiple streams. Accordingly, the cable decoder is connected via buers to stream handlers, and the entire unit operates under the control of a synchronizing/streaming manager. The stream handler consists of devices such as Codecs and associated software. The decompressed, streamed and synchronized data is played out on the multimedia output devices. The user generates feedback in the form of interrupts, requests and terminations via an interface control such as the familiar TV remote controller.
5 Multimedia Servers and High Performance Computing We now focus on the hardware of the storage provider i.e. the multimedia server. While a workstation type of server can support a few tens of simultaneous multimedia streams, a more powerful machine is required for use as a neighbourhood or metropolitan server ( gure 1). The important requirements of such a server can be delineated as follows:
High Storage capacity
A two hour long MPEG-1 encoded movie requires about 1 gigabyte of storage space. MPEG1 requires a minimum playback rate of 1.5 Mb/s. The newer MPEG-2 standard which will be used for most NTSC-type full motion video applications requires a minimum playback rate of 4 Mb/s. Thus, a two hour long MPEG-2 encoded movie will require nearly 3.5 gigabytes of storage space. The storage space for compressed HDTV-quality video will be even higher. Thus, the server should have at least hundreds of gigabytes of storage space. 5
Synchronizing / Streaming Manager Stream Handler
Buffers
: Data : Control
Network Interface
Stream Handler User Interface Control
To Network
Figure 2: Block diagram of hardware required at user site
Large number of titles
Ability to sustain maximum number of streams
Minimum Response Time
QOS (Quality of Service) requirements
Cost Eectiveness
Exploit user access patterns
If a neighbourhood server is used by 100 homes, each requiring independent connectivity, then it should simultaneously store at least 100 dierent titles. If a server stores only a few titles, then it may have to incur costly downloading from remote servers. Moreover, the response time seen by the consumer will increase consequently. Larger is the maximum number of streams that a server can simultaneously source, the more is the number of clients that it can serve. Since various caching optimizations can be employed when multiple clients request the same data, the metric of importance is the number of dierent streams that the server can simultaneously source. A crucial factor that shall determine the success of on-demand services is the response time seen by clients. While a client may be willing to tolerate a long response time during set up, this may not be the case when a client restarts a paused stream. Moreover, the dierence between average and worst-case response times should be low. An important parameter that aects the usage of server resources is the QOS requirements of clients. For example, a family viewing a feature lm may not mind if an occasional frame is dropped. On the other hand, a medical student viewing a recording of an open heart surgery would demand absolute delity of playback. The server should be able to provide and adapt itself to myriad QOS requirements. This is another factor that will govern the blooming or bludgeoning of interactive multimedia services. The goal is to hold installation charges to a few hundred dollars per customer, and monthly service charges in the vicinity of CATV (common antenna television) services. The server must be able to trap and exploit dynamic user behavior. For example if ten 6
clients request the same movie in a span of 3 minutes, then it will be prohibitively expensive to start ten dierent streams for them. This issue is discussed further in the next section.
Ability to handle RT as well as NRT trac
Reliability and Availibility
Fast signal processing capability
While the primary function of a high performance multimedia server is to serve multiple real-time data streams simultaneously, it must also be able to provide satisfactory service to non real-time data. Such data would be encountered in the course of downloading new programs from satellites and remote servers, billing and accounting, and communicating with intelligent personal agents[RaR94]. Like any other kind of server, a multimedia server must be reliable. The larger the volume handled by a server, more dicult it is to guarantee reliability. Special hardware and software mechanisms must be employed to provide fault tolerance for terabytes of data. Moreover, the server must have minimal down time, since client requests are asynchronous and may arrive at any time.
The server may be required to compress video and image data and encode audio data prior to storage. Consequently, it requires sophisticated scalar and oating point arithmetic performance and specialized hardware for compression and signal processing. Clearly, a multi-user multimedia server is a complex machine with stringent requirements. The satisfaction of these requirements is well beyond the realm of personal computers or individual workstations. A natural choice for a multimedia server is a high performance computer consisting of multiple processors, connected by a high-speed interconnection network, and optimized for fast I/O. Complementary views have been expressed to this eect in the context of high performance relational database systems [Sto86, DeG92]. At the same time, it must be noted that most parallel computers available until recently have concentrated on minimizing the time required to handle workloads similar to those found in the scienti c computing domain. Hence, the emphasis was laid on performing fast arithmetic and ecient handling of vector operands. On the other hand, multimedia-type applications require fast data retrieval and real-time guarantees. I/O constitutes a severe bottleneck in contemporary parallel computers and is currently the topic of vigorous research. A comprehensive survey of the problems in high-performance I/O appears in[RoC94]. Secondly, parallel computers have traditionally been expensive on account of their high-end nature and the comparatively small user community as compared to that of PCs. The advent of multimedia applications has brought the esoteric parallel machines in direct competition with volume-produced PCs and workstations. Presently, opinion is divided as to the architectural nuts and bolts of a multimedia server. Vendors are building multimedia servers based on both conventional parallel processors as well as PC technology. For instance, companies like Oracle and Silicon Graphics advocate the use of powerful parallel computers to build multimedia servers; while companies like Microsoft, Intel and Compaq claim to achieve equivalent functionality at a lower cost by building servers through interconnecting the bulk-produced chips used in PCs [HPC94]. An example of the latter approach is Microsoft's Tiger le system, which uses a high-speed communication fabric to interconnect Intel Pentium-processor based nodes. Based on the requirements outlined above, gure 3 shows a block diagram of the logical view of a high-performance multimedia server. The gure shows a server connected to a high-speed wide area network via an ATM switch and also capable of communicating with orbiting satellites. The server consists of several logical modules. Although it would be natural to map each module to one or more nodes of a parallel 7
: Data : Control
Satellite
A t m
Billing & Accounting Storage
Interface
Agents
Nodes
Scheduler and Request Handler
Storage Manager Monitoring & Supervisory Terminal
S w i t c h
Client Streams
Feedback Data in Client Requests
High Performance Multimedia Server
Figure 3: Logical Model of a high-performance Multimedia server. Example communication patterns are shown: dark lines indicate data, dotted lines indicate control information computer, it is not necessary. The clients interact with the server through the Interface nodes and the Scheduler and request handler. Client requests arrive at the scheduler module. If the request is for a new stream, the module runs an admission control policy which determines if sucient server resources are free in order to accept the request and provide performance guarantees. The following activities are performed during execution of the admission control policy : the scheduler determines the location of the requested data from the storage manager module. The storage agents store the multimedia data. The scheduler interrogates the identi ed storage agents whether the new request can be accepted or not. If the request can be accepted by the storage agents without violating real-time requirements of the in-service streams, then the admission control policy selects an interface node to service the stream. Provided that such a node can be found, the request is accepted, resources allocated and playback is started. If not, the request cannot be accepted at that time. If the client request was to resume a paused stream, then appropriate actions are taken to resume service. An interface node assumes the responsibility of servicing accepted requests. It accepts the schedule from the scheduler, and periodically retrieves data from the storage agents at the required rate for each stream in service. The monitoring and supervisory module performs overall co-ordination and embodies failure semantics.
6 Implementation Issues The physical implications of the requirements of a high performance multimedia server are just beginning to be understood. Very little work on using a parallel machine as a multimedia server has been reported. Although some eorts have been made with regard to simulation and modeling parallel machines as multimedia servers [GhR93], many open issues persist, especially with regard to implementation aspects. 8
Storage Media
Parallelism of Retrieval
Many types of devices have been used to store persistent multimedia data. These include magnetic disks, disk arrays, CD-ROMs, magnetic tape and optical disks. The most important criteria for selecting a storage medium are high storage capacity, read/write, capability, and low data access overheads. While the storage capacity of a CD-ROM is larger than that of a magnetic disk, CD-ROMs are inferior on the other counts. Consequently, magnetic disks are becoming the storage agents of choice for high performance interactive servers. The main reasons for this are that they are easily available at commodity prices, have nearly the same access overhead for both reads and writes, and they are the most popular form of secondary storage in use at present. Of course, it is conceivable that CD-ROMs and magnetic tapes can be used for archiving multimedia data. Magnetic disks with a capacity of the order of a few gigabytes are beginning to be available. The familiar disk arm scheduling techniques such as Scan and FCFS evolved before the advent of multimedia concepts, and do not provide real-time guarantees. The EDF (Earliest Deadline First) approach can be used for multimedia data, but it makes the undesirable assumption that data accesses are preemptible. Therefore, new arm scheduling techniques need to be developed. Reddy and Wyllie [ReW93, ReW94] have proposed a disk arm scheduling approach for multimedia data, and characterized the disk-level tradeos in a multimedia server. Currently available magnetic disks have raw data transfer rates of the order of 20 to 30 Mb/s, while MPEG-2 encoded data requires a data rate of about 4 Mb/s. More importantly, the use of a magnetic disk implies seek and rotational latencies from a few milliseconds to tens of milliseconds per access. The solution to minimize these variable overheads is to increase the granularity of data retrieved per access, so that the overhead cost per byte decreases. The eect of increasing retrieval granularity is to increase queueing delays at the disk. These delays can be minimized by using a disk array which reduces data transfer time by employing the aggregate bandwidth of multiple disks. Closely tied to the problem of minimizing disk overheads is the problem of placement policy. Rangan et al. [RaV92, RVR92] have proposed a model based on constrained block allocation, which is basically non-contiguous disk allocation in which the time taken to retrieve successive stream blocks does not exceed the the playback duration of a stream block. Contiguous allocation of disk blocks for a media stream is desirable, for it amortizes the cost of a single seek and rotational delay over the retrieval of a number of media blocks, thus minimizing the deleterious eects of disk arm movement on media data retrieval. However, contiguous allocation causes fragmentation of disk space if the entire stream is stored on a single disk. Moreover, if a stream is stored on a single disk, the maximum retrieval bandwidth is restricted by the data transfer rate of the disk. This placement policy causes the disk to become the bottleneck when the server has to support thousands of simultaneous streams. Ghandeharizadeh and Ramos [GhR93] get around these problems by striping media data across several nodes of a multicomputer. The eective retrieval bandwidth is then proportional to the number of nodes across which data is striped. A further optimization is to use disk arrays at each stripe node. While a high stripe factor is desirable, there are limits to the stripe factor, both for a disk array, as well as for the number of nodes across which the data is striped. In most systems, a peripheral device bus such as SCSI or IPI (Intelligent Peripheral Interface) connects disks to the rest of the components. To amortize the cost of SCSI controllers, we can connect multiple disks to the system on a single bus. Depending on type, a SCSI bus can support bandwidths in the range of 10 to 20 Mbytes/s. This imposes a limit on the number of disks that an array can consist of. In the case of 9
Average Delay components
Average Delay components
( Pi = 64 kB, Ps = 40 kB, stripe 4)
( Pi = 64 kB, Ps = 160 kB, stripe 4 )
120
120 n/w communication time (a) disk transfer time (b) disk seek time (c) disk rotational time (d) n/w blocking time (e) S node queueing delay (f)
110
100
90
90
80
80
70
70
Time (ms)
Time (ms)
100
n/w communication time (a) disk transfer time (b) disk seek time (c) disk rotational time (d) n/w blocking time (e) S node queueing delay (f)
110
60 50
50
40
40
30
f
60
e d c
30 f
20
e
10
c
0
20
d
10 b
a 0
b
5
10
15 20 25 30 35 40 Number of streams per I node
45
50
0
55
a 0
5
10
15 20 25 30 35 40 Number of streams per I node
45
50
55
Figure 4: Eect of varying the granularity of data fetched on average delays. The graphs show the average server delays as a function of the number of streams supported per interface node for 6 interface nodes, 35 storage nodes, stripe factor of 4. The 2 graphs are for request sizes of 40 kB, and 160 kB respectively. coarse-grain striping (i.e. across nodes), larger the stripe factor, larger the number of data and control messages that will be generated as a result of a data request of a given stream from an interface node to storage agents. consequently, the scheduling, copying and buering overheads increase at the interface node. In summary, striping the data across the nodes of a parallel computer increases the parallelism of retrieval and load balancing at the server.
Granularity of Retrieval
Due to the continuous nature of multimedia data, a server proceeds in service rounds in retrieving data for multiple users. In other words, stripe fragments are retrieved for each stream in order and buered at the interface node. From there, data is sent out at the required rate to the client. For any stream, the data retrieved in one fetch from the storage agents lasts for time T (value determined by playback rate) , after which a fresh set of stripe fragments must be fetched. The granularity of data retrieved from a disk at each access is an important design parameter. This is evident from the graphs in gure 4. The graphs show the average component-wise delays in retrieving data at an interface node, with a stripe factor of 4 on the Intel Paragon parallel computer[Int93]. The server con guration used was 6 interface nodes and 35 storage agent nodes. The data is MPEG-1 encoded, and the size of packets to clients (Pi ) is 64 kB. In the rst graph, the size of a data message from the the storage agent to the interface node (Ps ) is 40 kB, while in the second case it is 160 kB. The size of data fetched from disk on each access at each storage agent is the same as the size of the data message. The delay components shown are the interconnection network latency in the absence of blocking, disk data transfer time, disk seek time, disk rotational latency, interconnection network blocking time, and queueing delay at the storage node. In both graphs, we see that as the number of streams served by an interface node 10
Average Delay components ( Pi = 64 kB, Ps = 40 kB, stripe 4, Scache size = 4*Ps) 120 n/w communication time (a) disk transfer time (b) disk seek time (c) disk rotational time (d) n/w blocking time (e) S node queueing delay (f)
110 100 90
Time (ms)
80 70 60 50 40 30 20
f e c a
10 0
0
d b 5
10
15 20 25 30 35 40 Number of streams per I node
45
50
55
Figure 5: Eect of varying server node buer size on average delays. The graphs show the average server delays as a function of the number of streams supported per interface node for 6 interface nodes, 35 storage nodes, stripe factor of 4, and storage node buer size of 4 times the message size to the interface node, (which is xed at 40 kb), respectively. increases, the last two delays increase. However, the manner in which they vary in the two graphs is interesting. The following conclusions can be drawn from these graphs : for a xed stripe factor, larger the granularity of data retrieved per access, smaller is the contribution of disk seek and rotational overheads to the total retrieval delay, and longer is the duration of data at the interface node. Moreover the frequency of fetching from the storage nodes is lesser; consequently, lesser is the queueing delay at the storage nodes. However, large retrieval granularity also implies larger blocking time in the interconnection network; this is due to the large message size. Thus, there is a tradeo with respect to retrieval granularity. Moreover, queueing and blocking delay are load-sensitive.
Buering
Service time for a data request can be made more predictable by increasing the granularity of data retrieved per disk access. However, in a high performance server consisting of multiple nodes connected by a network, the granularity of data retrieval and data transfer from storage node to interface node is a design parameter, as shown by gure 4. One way to take advantage of the bene t of small message size and large data retrieval granularity is to buer the data at the storage nodes, in addition to buering it at the interface node. This is illustrated by the graph in gure 5, which shows the eect of retrieving 160 kB from the disk at the storage node, but sending only 40 kB messages from the storage nodes to the interface nodes. The other parameters are the same as for gure 4. In other words, a buer of 160 kB per stream is used at the storage node. Clearly, buering is very eective for minimizing variations in service. However, the large size of multimedia data means that buer space must be allocated in substantially large chunks, say 64 kB, as opposed to the size of a few kilobytes used by most operating systems. 11
More importantly, a high performance server will reacquire large main memory, of the order of a 100 megabytes. In summary, the amount of buer space that should be allocated to a stream is a tradeo between gain in better control of service variation and cost of buer space.
Scheduling mechanisms
The real-time nature of multimedia data retrieval has profound implications for resource scheduling at the operating system level. The server should be able to provide real-time guarantees for individual streams as well as for all the streams combined together. In order to provide guarantees for a stream, individual steps in the access and delivery, such as access from disks, buering and copying (if buering done), communication to the network port, etc must all be guaranteed. Moreover, accepting a user request for a new stream should ensure that existing streams continue to be served with the promised guarantees. The task of evaluating resource availability and determining whether a new request can be safely accepted is the task of the admission control policy [RVR92, AOG92]. The policy performs the necessary resource reservation to serve a stream, and keeps track of resources committed but temporarily unused due to client pause. The issue of how to handle client pause is an open question. One approach is to let the resources committed to a stream be marked as allocated until the client resumes. However, this approach is likely to lower server utilization and throughput. The alternative is to release the allocated resources when a client pauses and reaquire then on resume. The response time seen by the client will then increase, and the admission control policy will have to be rerun. The choice of one approach or the other depends on the duration of a client pause, and this is not well understood at present. One compromise is to reserve some server resources for the dedicated purpose of handling paused streams. The scheduling problems associated with multimedia data become all the more severe when a parallel machine with multiple CPUs interconnected with a low latency communication mechanism is used as a multimedia server; large variances typically occur in the time it takes to send a point-to-point message. Techniques need to be developed for ecient handling of interrupts and for reducing the dierence between worst-case and average-case latency that can contribute to variance in ne-grain parallel processors [Mar94]. Another factor that has received little attention till now is dynamic scheduling. Some delays in servicing requests are workload-dependent. Queueing delay due to single thread of control or single service resource (example : disk) is one such case. Another example is shown in g 6. The gure shows a mesh connected server with four interface nodes and eight storage nodes. Interface node I3 is serving two streams whose data reside on storage nodes S1 and S2 respectively, while node I4 is serving a stream whose data reside on S1 . A popular technique to switch data from the input channels to the output channels of the network routers is wormhole routing [NiM93, MTR94]. By its very nature, wormhole routing is highly susceptible to deadlock conditions. Various routing algorithms have been proposed and used to provide deadlock-free wormhole routing. The most popular of these for a mesh connected computer is deterministic XY routing in which packets are rst sent along the X direction, and then along the Y dimension, and this is the technique depicted in the gure. XY routing is a static routing technique. The problem with using this method for multimedia data is that link contention can cause large and unpredictable network blocking delays. For example, some link may be heavily used even when alternate paths which would result in less contention, and thus smaller delays, exist. Link l2 is one such link in the gure. Hence, dynamic (adaptive) deadlock-free routing techniques need to be developed. 12
S1 1
S2
l1
S3
l2
S4
l3
l6
S5
S6
S7
l7
S8
l 13
I1
I2
I3
l 14
I4
Figure 6: Example of interconnection network contention Another factor that has a major bearing on scheduling mechanisms are the QOS requirements of clients. Depending on whether a client can tolerate no loss or some loss of data, hard or soft real-time guarantees can be provided. For example, [RVR92] categorize real time clients into 2 classes, those that require hard and soft performance guarantees, respectively. For the latter class, the worst case assumptions made in admitting new users are relaxed based on the observed server load, this enables the server to increase the number of users that can be supported.
Replication and Fault Tolerance
Unlike in scienti c computing environments, where the reliability and availability of a system are desired but not critical, in the commercial environment, they are extremely important. If on-demand services are not reliable and/or available, the economic viability of the business would be low. Therefore, hardware and software solutions that provide high reliability and availability are of great importance in the server design. These solutions must reliably serve individual customers (both in terms of quality as well as uninterrupted service), make the system as a whole reliable, and provide ways to recover from faults quickly with minimal degradation in service. Reliability ranges from being able to serve one stream reliably (eg. disk failure on which the stream data is stored), to being able to serve a large number of streams reliably. In the normal mode of operation, scheduling and access techniques have an impact on reliability in terms of QOS (for example, percentage of packets dropped). However, handling failure requires other measures. These include keeping redundant copies (for example, data mirroring), or making copies on the y of popular streams, being able to reassign the task of serving serving speci c clients if the node(s) serving them fails, etc. Striping data in the right fashion can help improve the availibilty of data in the case of failures. For example, if a (primary) stream is striped over a number of disks (instead of residing entirely on one disk) and its copy is also striped over a number of (dierent) disks, then upon a disk failure, only a part of the data is lost. Thus, on an average, this scheme may provide more time to recover without degrading the QOS because the probability that a particular stripe fragment is being served at the time of failure would be smaller by a factor equal to the number of nodes on which the stream data is striped. Similarly, being able to reroute the data in case of node failures can improve the availability of the system. In general, improve13
Average Delay components
Average Delay components
( Pi = 64 kB, Ps = 160 kB, stripe 4, ratio of S nodes to I nodes = 4:1)
( Pi = 64 kB, Ps = 160 kB, stripe 4, ratio of S nodes to I nodes = 6:1)
225
225 n/w communication time (a) disk transfer time (b) disk seek time (c) disk rotational time (d) n/w blocking time (e) S node queueing delay (f)
200
175
150
150
125
125
Time (ms)
Time (ms)
175
n/w communication time (a) disk transfer time (b) disk seek time (c) disk rotational time (d) n/w blocking time (e) S node queueing delay (f)
200
100
100
75
75
50
50
f
25
e d c
25
b 0
0
5
10
15 20 25 30 35 40 Number of streams per I node
45
50
0
55
a 0
5
10
15 20 25 30 35 40 Number of streams per I node
45
50
55
Figure 7: Eect of the ratio of number of S nodes to number of interface nodes for the same total number of nodes. Figure on the left is for a ratio of 4 : 1, while that on the right is for a ratio of 6: 1 ments in reliability and availability are also greatly dependent on the support provided in the hardware and system software. That is, the techniques employed would also depend on the speci c server architecture if they have to provide cost-eective solutions.
Con guration Tradeos
Economic factors can limit the total number of nodes available to the designer of a multimedia server. Given a xed number of nodes, interesting tradeos are possible in designating the nodes as storage nodes or interface nodes. Since it is the interface nodes that actually source the client streams, it is desirable that their number be large, so that the total streaming capacity of the server is high. (it must be noted here that the number of interface nodes cannot be arbitrary : the server architecture and the number of ports provided by the switch interface between the server and the WAN impose an upper bound on the number of interface nodes). On the other hand, since it is the storage nodes that actually store the media data, it is desirable that their number be large also. Figure 7 depicts the tradeos involved in varying the ratio of storage nodes to interface nodes, given a certain number of total nodes. We used a total of 41 nodes, with stripe factor of 4 and a PS value of 64 kB. In the rst case, the server was con gured as 8 interface nodes and 33 storage nodes, while in the second case, it was con gured as 6 interface nodes and 35 storage nodes. Thus the ratio of the number of storage nodes to the number of interface nodes was approximately 4:1 and 6:1, respectively. Figure 7 The gure shows the component delays as a function of stream load for the two cases. We observe that in the former case, the storage node queueing delay is the largest individual component, while in the latter case, the network blocking time is the largest component. This is because the number of storage nodes to store media data decreases in the rst case, while at the same time the total number of streams that must be served increases. Hence, the storage nodes become the throughput bottleneck. We can conclude that a low S to I ratio resulted in higher average total retrieval time compared to a high S to I ratio. We saw that the S node queueing delay is much higher for a low S/I ratio than it is for a high S/I ratio. Given a xed total number of nodes and a 14
0.8
0.6 Rental Probability 0.4
0.2
0 0
10
20
30
40
50
60
70
80
Movie id
Figure 8: Example distribution of rental probability of movies in a neigbourhood VOD server certain ratio of S nodes to I nodes, the designer can increase the ratio so that more storage space is available. Although the total number of streams that the server can source will decrease, the designer can aord to choose disks with lower performance so that the same quality of service can be guaranteed to clients at a lower net server cost.
Caching Optimizations
A very important operational aspect that has received very little attention is the use of caching techniques to exploit knowledge of user request patterns. It is to be noted that traditional cache memories are too small in size to be of any use to hold multimedia data. Hence, the word 'cache' when used in the context of multimedia data actually refers to the use of processor main memory for buering purposes. The chief premise behind the use of caching techniques for multimedia data is that user requests for stored data are not uniformly distributed. For example, the workload imposed on a neighbourhood VOD server would be higher in the evening and at night than during the afternoon. A more important case is the fact that some movies would be more popular than others ( gure 8). For example, it makes sense to assume that there will be many more requests for a newly released movie on a given night than for a movie that was released ve years before. Given a non uniform request rate for dierent movies, various optimizations can be developed to increase server throughput by caching the popular movies. At the storage level, one approach is to use replication , and store the replicas on dierent subsets of nodes. Assuming that a single copy of a popular movie is inadequate to service the anticipated requests for that movie, replication helps to achieve load balancing by using dierent copies of the data to serve multiple streams of the same movie. There exists a tradeo between the extra storage space required by a replica (of the order of gigabytes) and the better load balancing and ability to serve more streams due to data replication. Another optimization is possible if it is dynamically possible to recon gure the server so that an interface node also becomes a storage node. This could be of use in cases where a media object is accessed with a high frequency. The throughput of the server can then be increased by migrating the frequently accessed media object from the storage nodes on which it is stored to local disk(s) at an interface node. The interface node could then be dedicated to serving requests for that media object. The price paid is the increase in complexity of the server software. Another 15
approach to increasing server throughput is to accumulate requests over an interval of time, and avoid multiple fetches for requests received for the same object during that interval of time. We call this method gang scheduling . For instance, if during a gang window of 5 minutes, 10 requests are received for a certain object, then the server can start retrieving data for only one stream at the end of the gang window and source 10 client streams from the same data. Gang scheduling involves an extra overhead of accumulating requests over the gang window and searching through the accumulated requests to identify repeated requests. In eect, this method delays the servicing of some admitted requests in order to minimize the load on the server. Hence there is a tradeo between the increased response time for clients and reduction in server workload. Consequently, the size of the gang window is a crucial parameter in making use of gang scheduling. One problem with this approach is that if one of the clients interrupts the stream, say for pausing or fast forward, then that client will fall out of phase with the single stream being retrieved. In such a case, the server should be able to dynamically establish a fresh server-interface stream for the interrupting client. Other caching approaches include hardwiring part or all of a popular media object's data in RAM so as to hide the latency of disk accesses, and delaying the freeing of media data buer space in RAM as much as possible, so that later requests for the 'stale' data can be served directly from RAM (instead of disk). Of course, these assume that buer space availability is large and media data size is reasonable.
7 Current status of on-demand technology Industry, academia and government are racing against time to construct the information superhighway. One of the most visible facets of the superhighway is the eort to provide on-demand services as embodied by VOD. Companies like Oracle/nCUBE, Intel and IBM are trying to provide on-demand services by using distributed memory multicomputers as servers. Others like Silicon Graphics, Digital and Hewlett Packard are using multiprocessor technology to build multimedia servers. Much of the hype surrounding VOD has abated due to the failures and delays in eld trials. The coming years will decide whether geographically distributed interactive multimedia services are technologically and economically feasible or not.
8 Conclusions The coming of age of multimedia concepts has let loose a ood of activity to implement and provide on-demand services. While the networking technology is relatively well advanced in this regard, server technology has not kept place. Many open questions persist with regard to high performance I/O for multimedia. Traditional parallel computers have been built to handle scienti c, or at best, transaction-oriented type of workloads. A great deal of work needs to be done in areas like real-time scheduling, parallel I/O, reliability, scalability, dynamic scheduling, and caching techniques for multimedia data stored on parallel machines. In summary, the answer to the question of whether or not high performance computers can be used as multimedia servers in a cost eective and timely manner will govern whether large scale on-demand services will mature from a plug-and-pray technology to a plug-and-play technology.
Acknowledgements
The authors wish to thank A. N. L. Narasimha Reddy of IBM for many helpful suggestions, and Dave Riss and Denise Eklund of Intel SSD for technical inputs and support. 16
References [BCG+92] P. B. Berra, C.-Y. Chen, A. Ghafoor and T. Little. Issues in networking and data management of distributed multimedia systems. In proceedings of the First International Symposium on High Performance Distributed Computing , September 1992. [RaV92] P. V. Rangan and H. Vin. Ecient storage techniques for digital continuous multimedia. IEEE Transactions on Knowledge and Data Engineering , Vol. 5, No. 6, August 1993. [Lan94] J. Lane. ATM knits voice, data on any net. IEEE Spectrum , February 1994. [Gal91] D. Le Gall. MPEG: a video compression standard for multimedia applications. Communications of the ACM , April 1991. [ReW93] A. Reddy and J. Wyllie. Disk-scheduling in a multimedia I/O system. Proceedings of the 1st ACM Intl. Conference on Multimedia , August 1993, pg. 225. [ReW94] A. Reddy and J. Wyllie. I/O issues in a multimedia system. IEEE Computer , March 1994. [Gel+91] A. Gelman, et al. A store and forward architecture for video-on-demand service. Proc. IEEE ICC, IEEE Press, 1991. [RVR92] P. V. Rangan, H. Vin and S. Ramanathan. Designing an on-demand multimedia service. IEEE Communications, Vol 30, No. 7, July 1992. [GhR93] S. Ghandeharizadeh and L. Ramos. Continuous retrieval of multimedia data using parallelism. IEEE Trans. on Knowledge and Data Engineering, Vol. 5, No. 4, August 1993. [AOG92] D. Anderson, Y. Osawa and R. Govindan. A le system for continuous media. ACM Trans. on Computer Systems, Vol. 10, No. 4, November 1992. [RaR94] S. Ramanathan and P. V. Rangan. Architectures for Personalized Multimedia. IEEE Multimedia, Spring 1994. [Low95] S. J. Lowe. Technology 1995 : Data Communications. IEEE Spectrum, January 1995. [PRR94] C. Papidimitriou, S. Ramanathan and P. V. Rangan. Information caching for delivery of personalized video programs on home entertainment channels. Proc. of the Intl. Conference on Multimedia Systems and Computing, May 1994. [Mar94] R Marz. Reducing the variance of point-to-point transfers for parallel real-time programs. IEEE Parallel and Distributed Technology, Winter 1994. [VG+94] H. Vin, A. Goyal, et. al. An observation-based admission control algorithm for multimedia servers. Proc. of the Intl. Conference on Multimedia Systems and Computing, May 1994. [DK+92] A. L. C. Drapecu, R. Katz, G. Gibson, et. al. RAID-II: a scalable storage architecture for high-bandwidth network le service. University of California at Berkeley technical report UCB:CSD-92-672, 1992. 17
[Int93] [Per94] [Aok94] [NiM93] [HPC94] [Sto86] [DeG92] [RoC94] [MTR94]
Intel Corporation. Paragon OSF/1 User's Guide , Intel Supercomputer Systems Division, February 1993. T. Perry. Technology 1994: Consumer Electronics. IEEE Spectrum , January 1994, pg. 30 T. Aoki. Digitalization and integration portend a change in life-style. IEEE Spectrum , January 1994, pg. 34. L. Ni and P. McKinley. A survey of wormhole techniques in direct networks. IEEE Computer , February 1993. HPCwire (electronic magazine) , article number 4097, May 1994. M. Stonebraker. The case for shared-nothing. Database Engineering , Vol. 9, No. 1, 1986. D. DeWitt and J. Gray. Parallel database systems: the future of high-performance database systems. Communications of the ACM , Vol. 35, No. 6, June 1992. J. Rosario and A. Choudhary. High-performance I/O for parallel computers: problems and prospects. IEEE Computer , March 1994. P. McKinley, Y. Tsai and D. Robinson. A survey of collective communication in wormhole-routed massively parallel computers. Technical Report MSU-CPS-94-35, Dept. of Computer Science, Michigan State University, June 1994.
18