Scalable Web based MOD Services Buddhikot, D ... - Semantic Scholar

Short Title: Scalable Web based MOD Services

Buddhikot, D.Sc. 1998

WASHINGTON UNIVERSITY SEVER INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE

PROJECT MARS: SCALABLE, HIGH PERFORMANCE, WEB BASED MULTIMEDIA-ON-DEMAND (MOD) SERVICES AND SERVERS by Milind M. Buddhikot Prepared under the direction of Guru M. Parulkar

A dissertation presented to the Sever Institute of Washington University in partial fulfillment of the requirements for the degree of Doctor of Science August, 1998 Saint Louis, Missouri

WASHINGTON UNIVERSITY SEVER INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE

ABSTRACT

PROJECT MARS: SCALABLE, HIGH PERFORMANCE, WEB BASED MULTIMEDIA-ON-DEMAND (MOD) SERVICES AND SERVERS by Milind M. Buddhikot

ADVISOR: Guru M. Parulkar

August, 1998 Saint Louis, Missouri

This dissertation describes cost-effective design and implementation of scalable web based high performance multimedia-on-demand ( MOD) servers and services. An important aspect of this dissertation has been prototyping, deploying MOD applications, services and servers and learning from this experience. The three main components of this dissertation are (1) Web based interactive MOD services, (2) innovative enhancements to a server node operating system (OS) to support such MOD services, and (3) design and prototyping of a scalable server architecture and associated data layout and scheduling schemes to support a large number of independent, concurrent clients. We first describe design and prototyping of two example multimedia-on-demand services, namely interactive recording service for content creation and a fully interactive multimedia playback service to playback audio/video documents. The key aspects to design of these services are use of web as the access interface and nice separation of the bandwidth intensive data path from the low overhead control path.

We conclusively demonstrate that the periodic nature of multimedia data handled by these services requires guaranteed access to CPU and storage resources to provide realtime Quality-of-Service (QOS) guarantees to these services. Current general purpose operating systems such as UNIX on which such services are implemented do not provide such guarantees. In addition, data transfers between storage and network devices requires excessive data copying and superfluous system calls. Unfortunately, naive service implementations on such operating systems provide very poor performance. To rectify these limitations, we have designed and prototyped following innovative OS enhancements in the context of 4.4 BSD UNIX systems: (1) we use a novel co-operative scheduling based technique called Real-Time-Upcalls (RTU) to provide guaranteed application level CPU access. (2) We implement multiple priority queues serviced by a Deficit-Round-Robin (DRR) fair queuing algorithm within the SCSI storage driver to provide fair guaranteed access to storage bandwidth. (3) We designed and prototyped a new buffering system called – MultiMedia Buffers (mmbufs) to provide zero copy data transfers between the storage and the network subsystem. Last, (4) we provide a new system call interface, called stream API that allows above new enhancements to be accessed from user level. This interface allows aggregation of multiple stream read-send requests into a single system call thus minimizing the overheads. Our experimental measurements clearly demonstrate up to 20 % throughput improvements using the new buffer system and stream API even on small disk arrays. Also, we show that the enhanced SCSI system with DRR provides efficient sharing of resources between real-time and non-real-time requests. Also, with a video server prototype with two 2-disk software arrays and running OS with above enhancements comfortably supports 7 video/audio sessions with an aggregate bandwidth of 75 Mbps1. To tackle the problem of scalability, we proposed and prototyped a distributed storage architecture consisting of a cluster of multiple storage nodes in the form of PCs interconnected by an ATM cell switched interconnect. Each node in this distributed architecture, runs the OS with extensions described above. We have developed a family of distributed data layouts that stripe data over the storage nodes in constant-time-length chunks and analyzed interesting properties that guarantee load-balanced operation of the cluster during interactive playback. We have also designed and implemented distributed scheduling techniques to support synchronized playback of striped data to clients of such a distributed storage server. We have demonstrated that with sufficient client side buffers, these simple scheduling techniques do yield very good quality multimedia playback. In summary, with extensive prototyping and development, this thesis conclusively demonstrated that scalable, high performance MOD services and servers can be built using commodity software and hardware components.

The present ATM iinterface used in the prototype limits the number of sessions to 7. This is not an inherent limitation of the system. With better ATM interfaces this metric can be improved. 1

To my Aai (Mom) and Baba (Dad) who taught me importance of education, knowledge, compassion, and hard work

Contents List of Tables List of Figures

::::::::::::::::::::::::::::::::::::

ix

:::::::::::::::::::::::::::::::::::

x

:::::::::::::::::::::::::::::::::

xiv

Acknowledgments

1 Introduction : : : : : : : : : : : : : : : : 1.1 MOD Environment and Choices . . . 1.2 Challenges . . . . . . . . . . . . . . . 1.3 Overview of Solutions . . . . . . . . 1.4 Summary . . . . . . . . . . . . . . .

::::::::::::::::::: . . . .

1 2 6 7 9

. . . . . . . . . .

10 10 11 12 15 17 19 19 22 23 24

3 Research Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview of Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Innovative Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 25 27 29

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 Essential Background : : : : : : : : : : : : : : : : : : : : : : : 2.1 Multimedia-On-demand Services . . . . . . . . . . . . . . . . 2.1.1 Properties of Request Arrival Process . . . . . . . . . 2.1.2 MOD Playback Service Types . . . . . . . . . . . . . 2.1.3 Semantics of Interactive Control for Playback Services 2.2 Clients of Multimedia-On-Demand Services . . . . . . . . . . 2.3 Multimedia-On-Demand Servers . . . . . . . . . . . . . . . . 2.3.1 Hierarchical Network Model . . . . . . . . . . . . . . 2.3.2 Performance Metrics for a MOD server . . . . . . . . . 2.3.3 Taxonomy and Hierarchy of Storage Servers . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

:::::: . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

3.4

3.5 3.6

Our Research Approach . . . . . 3.4.1 Building MOD Services . 3.4.2 Building MOD Servers . 3.4.3 Achieving Load Balance Contributions . . . . . . . . . . Dissertation Outline . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 Simple Web based Multimedia-On-Demand Services : : : : 4.1 MultiMedia Explorer (MMX) Client Device . . . . . . . 4.1.1 Characteristics of MMX . . . . . . . . . . . . . 4.1.2 MMXD: A MMX control multiplexing daemon . 4.2 Design of MOD Services . . . . . . . . . . . . . . . . . 4.3 Web Based Multimedia Recording Service . . . . . . . . 4.3.1 Client Application: Record GUI . . . . . . . . . 4.3.2 The Recording server - recordd . . . . . . . . . 4.4 Web Based MOD Interactive Playback Service . . . . . 4.4.1 A Simple Application Level Streaming Protocol 4.4.2 Client Application - MMX Control Interface . . 4.4.3 Interactive Playback Server Implementation . . . 4.5 Experiences . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

30 30 31 33 34 35

. . . . . . . . . . . . .

37 38 39 41 43 45 46 50 55 57 61 62 64 65

. . . . . . . . . . . . .

66 67 67 69 69 71 73 75 76 78 80 80 81 81

::::::::: . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

5 OS Extensions for QOS Guarantees and High Performance : : : : 5.1 Limitations of UNIX CPU Scheduling . . . . . . . . . . . . . . . 5.2 Limitations of Existing Storage and Network I/O in UNIX . . . 5.2.1 Common Read/Write Application Interface . . . . . 5.2.2 File System I/O . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Network I/O . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Control and Data Paths for Disk Read and Network Send 5.3 Demonstration of the Limitations of UNIX as a Multimedia OS . 5.3.1 Effect of CPU and storage loads on MOD playback . . . 5.4 Summary of Limitations of UNIX for Networked Multimedia . . 5.5 Overview of OS enhancements . . . . . . . . . . . . . . . . . . 5.5.1 Periodic data transfer guarantees . . . . . . . . . . . . . 5.5.2 Multimedia Mbufs: A new buffer management system . 5.5.3 Priority Queuing within the SCSI driver . . . . . . . . . v

. . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

::::: . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

5.6

5.7 5.8

5.5.4 Stream Application Programming Interface ( API) . . . . . . Providing Guaranteed CPU Access . . . . . . . . . . . . . . . . . . 5.6.1 Real Time Upcall (RTU) . . . . . . . . . . . . . . . . . . . 5.6.2 A MOD Playback Server with RTUs . . . . . . . . . . . . . Design of the mmbuf buffering system . . . . . . . . . . . . . . . . Periodic QoS Guarantees from Storage System . . . . . . . . . . . 5.8.1 Priority Queuing within the SCSI Disk Driver . . . . . . . . 5.8.2 Implementation of DRR with two priority queues in NetBSD

5.9 Streams API . . . . . . . . . . . . . . . . . . . . . 5.10 Concatenated disk driver (ccd) . . . . . . . . . . 5.11 Performance evaluation . . . . . . . . . . . . . . . 5.11.1 Performance benefits of mmbufs and stream 5.11.2 QOS guarantees in SCSI . . . . . . . . . . 5.11.3 Periodicity of User Level Data Transfers . . 5.12 Related Work . . . . . . . . . . . . . . . . . . . . 5.13 Summary . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . API . . . . . . . . . . . . . . . . .

6 Design of a High Performance MOD Server : : : : : : 6.1 Design of the MOD Server . . . . . . . . . . . . . . 6.1.1 Control Flow to Set Up a New Session . . . . 6.1.2 Data Path Architecture for the SNMOD server 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Extensibility . . . . . . . . . . . . . . . . . 6.2.2 Limitations of the Prototype Playback Server 6.3 Performance Evaluation . . . . . . . . . . . . . . . . 6.3.1 Improved CPU Availability . . . . . . . . . . 6.3.2 Improved Streaming Performance . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

82 82 82 84 85 89 89 94

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

96 98 99 100 102 105 108 110

. . . . . . . . . .

111 111 114 116 118 118 119 121 122 123 127

. . . . . .

128 129 129 132 135 140 140

::::::::::: . . . . . . . . . .

7 Towards Highly Scalable Servers and Services : : : : : : 7.1 Massively-parallel And Real-time Storage Architecture 7.1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . 7.2 Storage Node Architecture . . . . . . . . . . . . . . . 7.3 MARS Storage Server Examples . . . . . . . . . . . . 7.4 Basics of Distributed Data Layout and Scheduling . . . 7.4.1 Distributed Data Layouts . . . . . . . . . . . . vi

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

:::::::::: . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

7.4.2 Data Striping Service . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Distributed Scheduling: A Simple Scheme . . . . . . . . . . . . 7.5 Issues in Design of a Distributed Scheduling Scheme . . . . . . . . . . . 7.5.1 Implications of Interactive Operations . . . . . . . . . . . . . . . 7.5.2 Implications of Granularity of Data Prefetch and Transmission at a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 BEat Directed Scheduling (BEADS) Scheme . . . . . . . . . . . . . . . 7.7 A Prototype Distributed Playback Service . . . . . . . . . . . . . . . . . 7.8 Performance of Distributed Playback . . . . . . . . . . . . . . . . . . . . 7.8.1 Effect of Number of Nodes . . . . . . . . . . . . . . . . . . . . . 7.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 High Bandwidth Disk I/O for Supercomputers . . . . . . . . . . 7.9.2 Multimedia Servers . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Load Balance Properties of Distributed Layouts : : : : 8.1 Load Balance Properties of GSDCL ks Layouts . . . . 8.2 Basic Equations . . . . . . . . . . . . . . . . . . . . 8.3 Safe Skipping Distances for GSDCL with ks = 0 . . 8.4 Safe Skipping Distances for GSDCL with ks = 1 . . 8.5 Safe Skipping Distances for GSDCL with arbitrary ks 8.6 Implications of MPEG . . . . . . . . . . . . . . . . 8.7 Related Work . . . . . . . . . . . . . . . . . . . . . 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . .

143 145 148 148

. . . . . . . . .

151 152 156 159 160 160 161 161 164

. . . . . . . .

165 166 168 170 173 176 179 181 182

. . . . . . . . .

183 183 184 184 185 185 186 186 187 188

::::::::::: . . . .

. . . .

. . . .

. . . . D . . . . . . . . . . . . .

9 Conclusions and Future Work : : : : : : : : : : : : : : : : : 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Web based MOD services . . . . . . . . . . . . . . 9.1.2 Enhancements to 4.4 BSD UNIX Server OS . . . . 9.1.3 Scalable Storage Server and Services Architecture 9.1.4 Load Balance Properties of Distributed Layouts . . 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . 9.2.1 MOD services . . . . . . . . . . . . . . . . . . . . 9.2.2 Server OS enhancements . . . . . . . . . . . . . . 9.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . vii

. . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

:::::::: . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

:::::::::::::::::::::::::::::::::::::

190

:::::::::::::::::::::::::::::::::::::::::

202

References Vita

viii

List of Tables 2.1

Taxonomy of Storage Servers . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1

MMX video markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1

Read time for normal rd/send and streamrd/send . . . . . . . . . . . . . . . 101

7.1 7.2

Prefetch information at a node . . . . . . . . . . . . . . . . . . . . . . . . 153 Frame and node sets for all connections . . . . . . . . . . . . . . . . . . . 155

8.1

Road-map for various analytical results . . . . . . . . . . . . . . . . . . . 167

ix

List of Figures 1.1

MOD

server request-response model . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6 2.7

server request-response model . . . . . . . . . . . . . . . . . . Shared Viewing with Constraints ( SVC) . . . . . . . . . . . . . . . . MOD server for a Dedicated Viewing service (DV) . . . . . . . . . . . A World-Wide-Web based Multimedia-on-demand (MOD) application A typical Video-on-demand (MOD ) application . . . . . . . . . . . . Buffer scenarios at the client . . . . . . . . . . . . . . . . . . . . . . Hierarchical Network Model . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

11 13 14 17 18 18 20

3.1 3.2 3.3 3.4 3.5

Building MOD Services . . . . . Building MOD Services . . . . . Two server prototypes . . . . . . . Single node server . . . . . . . . . Distributed MARS server prototype

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

30 31 32 33 34

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12

Basic internal architecture of MMX . . . . . . . MMXD control multiplexing daemon . . . . . . Service access web page . . . . . . . . . . . . . Abstract directory structure . . . . . . . . . . . . Recording service: components . . . . . . . . . . Client application for accessing recording service Process structure of the recordd daemon . . . . . Steps in preprocessing the recorded multimedia . Format of the MMX MJPEG video . . . . . . . . . Format of the meta data . . . . . . . . . . . . . . MoD Document Access: Main page . . . . . . . MoD Document Access: Milind’s video page . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

38 41 44 45 46 47 50 52 53 54 56 57

MOD

. . . . .

x

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2

4.13 4.14 4.15 4.16 4.17 4.18

Fully Interactive Playback Service: Components . Basics of WWW . . . . . . . . . . . . . . . . . Client application for fully interactive playback . Organization of the NCSA httpd server . . . . . . Modifications to httpd . . . . . . . . . . . . . Streaming architecture for httpd+ . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

58 59 61 63 63 64

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 5.23 5.24 5.25 5.26 5.27 5.28

Existing file and network I/O . . . . . . . . . . . . . . . proc, file table and other relevant data structures . . . . . Buffer cache block structure . . . . . . . . . . . . . . . Mbuf structure . . . . . . . . . . . . . . . . . . . . . . Read function trace . . . . . . . . . . . . . . . . . . . . Read function trace in SCSI driver . . . . . . . . . . . . Function trace for data send . . . . . . . . . . . . . . . . Effect of CPU load on MOD playback . . . . . . . . . . . httpd+ performance in presence of CPU and storage load New enhancements . . . . . . . . . . . . . . . . . . . . The model for user level Real Time Upcalls . . . . . . . RTU deadline misses for different disk loads . . . . . . . New Multimedia Memory Buffer (mmbuf) . . . . . . . . Mmbuf systems free lists . . . . . . . . . . . . . . . . . Invocation of mmbuf interface functions . . . . . . . . . New priority queuing SCSI system . . . . . . . . . . . . Service rounds and two-level disk scheduling . . . . . . DRR fair queuing for a communication link . . . . . . . DRR Fair queuing in a SCSI Driver . . . . . . . . . . . . Old sd softc data structure . . . . . . . . . . . . . . . . Modified sd softc data structure . . . . . . . . . . . . . State created by streamopen() call . . . . . . . . . . . . Parallel disk I/O with ccd . . . . . . . . . . . . . . . . . Interaction of mmbuf and ccd . . . . . . . . . . . . . . . MMBUF data path from a fast disk . . . . . . . . . . . . MMBUF data path from a regular disk . . . . . . . . . . Setup for Experiment 2 . . . . . . . . . . . . . . . . . . NRT read time vs. real-time load . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

68 69 71 72 73 75 76 77 77 81 82 84 85 87 87 90 91 92 93 94 95 96 98 98 100 101 102 103

xi

. . . . . .

. . . . . .

. . . . . .

5.29 5.30 5.31 5.32 5.33

NRT read time vs. RT fraction allocated . . . . . . . . RT read time vs. RT fraction allocated . . . . . . . . . Buffering scheme used by the test programs . . . . . . Deadline miss probability for requests on same storage Loss of bandwidth for connections on the same storage

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

104 105 105 106 107

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10

Design of the single node MOD server . . . . . . . . . . . . Function trace for session open . . . . . . . . . . . . . . . Streaming architecture . . . . . . . . . . . . . . . . . . . . Transmit portion of the ENI ATM interface . . . . . . . . . . Improved CPU availability with the second generation server Experimental setup . . . . . . . . . . . . . . . . . . . . . . Deadline miss probability vs load . . . . . . . . . . . . . . . Throughput performance of the new server . . . . . . . . . . Average time spent in RTU . . . . . . . . . . . . . . . . . . Average latency for session setup operations . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

112 115 117 120 122 123 124 125 126 127

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19

Massively-parallel and Real-time Storage (MARS) system . A PC based storage node . . . . . . . . . . . . . . . . . . Storage Node Design . . . . . . . . . . . . . . . . . . . . A prototype implementation of a MARS server . . . . . . . Cluster Based Storage ( CBS) architecture for MARS . . . . A prototype MARS server using ATM switch . . . . . . . . Layout example . . . . . . . . . . . . . . . . . . . . . . . A Generalized Staggered Distributed Data Layouts . . . . Striping service . . . . . . . . . . . . . . . . . . . . . . . A simple scheme for reads . . . . . . . . . . . . . . . . . Revised schedule when C0 performs fast forward . . . . . General case of M out of Ca connections doing fast forward Cycle and Sub-cycle . . . . . . . . . . . . . . . . . . . . Distributed scheduling implementation . . . . . . . . . . . An example of connections in different playout states . . . Distributed multimedia playback . . . . . . . . . . . . . . Two control connections in the playback server . . . . . . Implementation using RTUs . . . . . . . . . . . . . . . . . Reduction in throughput requirements . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

129 134 135 136 138 139 141 142 144 146 149 150 154 155 156 157 158 158 160

xii

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . .

8.1 8.2 8.3 8.4 8.5 8.6 8.7

A Generalized Staggered Distributed Data Layouts . . . General distribution cycle with anchor node p . . . . . . Simple GSDCL 0 layout with five nodes . . . . . . . . . . Staggered Distributed Cyclic Layout( SDCL 1 ) with ks = 1 Generalized Staggered Distributed Layout with ks = 3 . Generalized Staggered Distributed Layout with ks = 2 . Structure of MPEG stream . . . . . . . . . . . . . . . . .

xiii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

166 168 171 173 177 178 179

Acknowledgments We are all faced throughout our lives with agonizing decisions, moral choices. Some are on a grand scale; most of the choices are on lesser points, but we define ourselves by the choices we have made. We are in-fact the sum total of our choices. Events unfold, so unpredictably, so un-thoroughly, human happiness does not seem to have been included in the design of creation. It is only we, with our capacity to love, give meaning to an indifferent universe ...... and yet most of the human beings seem to have a amazing ability to keep trying, and to even find joy from simple things like their family, their work and from the hope that future generations might understand more.

Woody Allen, Crimes and Misdemeanors The day I made the choice to pursue my Doctorate, I was unaware of the exhilarating years of highs, lows, enlightenment, and excitement ahead of me. In some sense, that decision has defined and will continue to define my life. I believe completion of one’s Doctoral dissertation is a life-shaping event that warrants some retrospection. I take this opportunity to look back and thank people who have been with me on this journey. First and foremost, I would like to thank my advisor Prof. Gurudatta Parulkar for offering me a research assistantship under him to join the premier networking research program at Washington University to pursue my ambitions of a Doctorate degree. Over the years that I worked with him, multiple facets of his “professional personality” such as his technical expertise, his clear thinking and communication, his constant enthusiasm, and his penchant for grand things, have been a guiding light and inspiration. He has been a “true guru” – an advisor, a friend, and a guardian. I am grateful to him for the financial support, travel grants, and equipment he provided throughout my graduate student life and in many other ways has been instrumental to making everything possible. I am also thankful to him for being patient during the initial “hard” times when I was struggling to define my research topic. Overall, looking back, my partnership with him has proved to be a truly rewarding and exciting time in my life. I am also grateful to Guru’s better half – Kalpana for innumerable Saturday and Sunday lunches whenever I’ve visited their house on the pretext of doing work, and for always providing an attentive ear to my stories. I would like to thank my committee for taking time to learn about my work and to provide me with criticism and help. The members of my committee have provided me xiv

useful feedback and have also been a source of inspiration. Prof. Jerry Cox’s initial involvement and encouragement in the formulation of my project was valuable. In my mind, his boundless energy, his brilliant professional career, and his wisdom have set a standard that I will strive to achieve in my professional life. Prof. Turner participated in numerous meetings throughout the length of my project to provide insightful comments, suggestions, and criticism. His brilliance, his work-ethic and his professional success has been awe inspiring to me. Prof. Fred Rosenberger has been helpful throughout the project, always providing me comments and information that have been useful in building my experimental system. Dr. Hemant Kanakia graciously took time from his successful start-up venture to serve on my thesis committee. I consider myself fortunate to have had the opportunity to be part of Washington University’s Computer Science Department and its two world-class research centers – namely, the Computer and Communications Research Center ( CCRC) and the Applied Research Labs (ARL). I would like to thank Prof. Turner, Prof. Cox, Prof. Parulkar (Guru), Prof. Franklin, Prof. Chamberlain, and Prof. George Varghese for creating and maintaining these labs and fostering excellence. The administrative staff in CS, CCRC and ARL offices have been outstanding. I would like to thank Sharon and Peggy for patiently helping me with never-ending equipment purchase orders, tax matters, and ever-so-late coffee bill payments. I would like to thank Myrna for always indulging into a nice conversation every-so-often I would pass by her desk. The “calm -and-commanding” Jean was always a great help with the dissertation documents and other matters. I am also grateful to Paula for never complaining whenever I would go to her with my “Oh-I-locked-my-keys-in-my-office” story. Most of my time in my graduate student life was spent in the cozy, comfortable “CCRC back hall.” Over the years, the members of this back-hall kept me entertained and some even played a crucial role in my research. I would like to thank several members of Guru’s research group –the g-troup for the great friendship. First of all, my heartfelt thanks to my office mate Chuck Cranor for being an accommodating co-inhabitant of Bryan 406. He has been my “complete walking-talking-systems-encyclopedia” and above all, great friend. Through his playful bantering of simulation based research, and strong emphasis on “hands-on-work,” he has been instrumental in making me recognize the importance of building real systems. Without his help with NetBSD, Project MARS would have taken much longer. I am also thankful to him for introducing me to iced-tea, Perl, Patsy Kline, REM, Instrumental Rock, Elvis, and Levis 550. I would also like to thank Chuck’s better half – Lorrie Cranor for being always there to give sound advice on various matters ranging from “under-garments” to “money-management.” During my early days at WashU, Lorrie was instrumental in helping me understand the importance of concise and precise technical writing. I would also like to thank her for improving Chuck’s life by orders of magnitude and making him bit more adventurous. xv

Zubin – the Zman has been a true friend over the years. It took me a lot of time to get used to his strange sense of humor; he would walk in to my office when I would be preparing for an exam and (instead of wishing me good-luck) he would tell me “Milind, you are going to fail this exam. Yes! I hope it is so tough that you can’t answer a single question.” I shared six great years of varied experiences with the Zman and his hyper dog “Nimon.” Over these years, various facets of his personality such as his ADD, his affinity for lemon-cars and drinks, his hyper-focused efficient ways of working, his footin-the-mouth-style-of-sharing-opinions, and above all a superlative intellect and mind have always impressed me. Thanks to Zubin for a great time and those innumerable rides to the airport and to Dobbs Auto Service. Christos, a consummate Greek gentleman and a master auto mechanic has always amazed me with his calm way of dealing with any matter (may be it has something to do with the army training !!). He was a very nice roommate during my three months of internship at NEC, Princeton, NJ. I will always cherish my forays into “fix-your-caryourself” science with his help and my time at the Gym with him. My thanks to Christos for always sharing the awesome home-made Greek goodies from Cypress and for being a good friend. Hari – my next door officemate has been a great companion over the years during my late night stints in the lab. Hari’s stories of his well-rounded multi-course meals (starting with “broth” and ending with half-a-ton of ice-cream), his Saab (sob) stories, and his entertaining observations on topics ranging from Dawn-the Pakistani Newspaper to Coco, the talking chimpanzee never failed to cheer me up. A well read gentleman, Hari was instrumental in improving the Quality-of-Life in the back-hall. Sadly, after his recent betrothal, he has become an illusive character. Anshul Kantawala has always been there to share his peculiar “fire-brand” opinions on all matters ranging from Indian politics to American football. Unfortunately, after his marriage, thanks to Neepa, he has mellowed down but still continues to be a fun guy. Gopal was always a reliable source of intelligent humor and his “Optimized-DesiBehavior (ODB)” always made me think twice before indulging into any shopping. I would like to thank him for his help with the use of his RTU invention. It was great fun collaborating with him for the NOSSDAV 96 paper. I would like to thank Dakang Wu and Xin Jane Chen for being great colleagues who worked tirelessly on the Project MARS. Dakang’s initiative, and ideas have been crucial to the experimental prototype. He was a very nice and patient person to work with. Jane did an early implementation of some of the NetBSD OS enhancements and was fun to work with. I would also like to thank Daphne Tong for participating in my project to develop new MoD services. She was a smart and a pleasant colleague. John Dehart from ARL was very helpful throughout the project. He patiently provided me all the information about MMXes, helped me setup innumerable demos, and also xvi

participated in demos without ever complaining. He was one of the nicest guys to work with. I would also like to thank Ken Wong for maintaining CCRC systems, and also believing that I am a ”nice guy” (which I hope I am!). I am glad I never challenged him to a game of Racquet-ball before I got to know that he was a Missouri State Champion in his young days. My sincere thanks to Andy Fingerhut whose sharp intellect and craving for mathematical puzzles helped me simplify some of my analytical work. It was always great fun to chat with him about the WUGS switch, his dissertation, LiNUX, Joshua, and multitudes of other topics. Brad Noble was always there to ask me if I was putting in a “honest-day’s-work”. His natural curiosity, his abundant knowledge of everything from monster trucks to Indian cooking, and the “nice back-rubs” he gave me made him a true delight. I would like to thank him and his better half– Penny for several parties and nice home cooked meals. I would also like to thank James, Sanjay, Greg Peterson, Rex Hill, Girish, Diana, Cheenu, Fred Kuhns, Tom Chaney, Prof. Maynard Engebretson, Shree, Rajeev (Bector), Rajib (Gosh), Ram Sethuraman, Nivi Engineer, and Amy Murphy, for their support. Thanks also to a “real-nut-case-italian” – Guissepe Bianchi (Gepppo) for practical jokes, Marcel – the “Swiss guy” for candid conversations, Maurizio for great Carbonara, and Dan – “the Swiss boy” for great time. I have to thank a lot of people other than the ones in CS and ARL who kept my social life exciting and helped me keep my sanity. First, I will like to thank Ritu Banerjee, and her boyfriend Andrew Kloek for frequent dinners, and for providing a sympathetic ear to my woes. Anurag, Namita and Nimmi were great friends, and it was fun hanging out with them for innumerable dinners, parties, bowling and pool sessions. Mini (Mamata) Datta was a true friend. I will always remember her for her strength, determination, dedication to her family, and her true friendship. Thanks also to Gayatri Joshi, Jay, Megan, and the Joshi family for making me feel a part of their world over the last year and a half of my stay in St. Louis. My sincere thanks to the members of the Association of Graduate Engineering Students who developed and maintained the LATEX style files used to format this dissertation. Last but not the least, I am grateful to my family. My elder brother, Mukund has always been there with sound advice and has been a role model in academic matters. My sincere thanks to him and his better- half Madhavee for encouragement and support. I am immensely grateful to my parents, to whom this dissertation is dedicated, for their support all my life. Raising me and my brother in a less-than-privileged environment, they gave us the right values, and always stressed the importance of hard work, education, and humility. I would not be where I am today without their support and I am proud to honor them with my accomplishments. Finally, I am grateful to the Supreme-Being for all the blessings xvii

throughout my life. Without his invisible hand guiding me in this world, I would be all lost.

Milind M. Buddhikot Washington University in Saint Louis August 1998

xviii

1

Chapter 1 Introduction In recent years, we have witnessed an unprecedented rise in the popularity of the Internet and the World Wide Web (WWW). The number of hosts connected to the internet and the number of active web sites on the internet have grown exponentially and continue to grow at an impressive rate [4]. On-going rapid advances in optical communication, high speed packet switching, and wireless networking indicate that this trend will continue and the world will become increasingly connected. The World Wide Web is already being successfully put to use for new multimedia applications in all walks of life. For example, Multimedia-On-Demand (MOD) has emerged to be an important generic application with several specific instances. Movies-on-demand, lectures-on-demand, internet shopping catalogs, news-on-demand, electronic commerce (e-commerce) and many others are some specific examples of this application. Several rudimentary MOD applications that employ low bandwidth, low quality audio and video have already begun to appear on the internet. However, with improvements in compression and higher network speeds, push for high quality, high bandwidth multimedia applications is getting stronger. Figure 1.1 illustrates this future scenario wherein high performance multimedia applications will be run on diverse end-systems such as portable PCs, workstations, and TV sets with set-top boxes connected to a broadband internet with wired or wireless access. The three key components of these networked applications are: (1) the multimedia-ondemand service, (2) the server that provides such a service, and (3) the client device supporting the application using which an end-user accesses the MOD service. Three years ago, when we initiated our research project, cost-effective, scalable, high bandwidth and high quality MOD services and servers accessible from a universal access interface at the clients did not exist. Therefore, the primary goal and motivation for our

2 Storage Server Wireless Access Portable PDA or PC Wireless Central Office This is a

Home

description

of

the

Image Sequence Text Sequence

TV

High Speed Network

Video Stream

Compute Server

Workstation

Figure 1.1:

MOD

server request-response model

work was to design and prototype such scalable MOD services, servers and client applications. We sought to achieve this goal by using emerging broadband networking technologies, commodity hardware/software components such as PCs, general purpose operating systems enhanced to handle multimedia, and web based clients. Given this approach, several choices and associated tradeoffs were available to us for each of the three components of the networked multimedia applications. We will first outline our choices to provide the context for our work.

1.1 MOD Environment and Choices In the following, we discuss different choices for the client, server, and the network components of the future networked multimedia applications that can lead to different environments for which the research reported in this dissertation could have been conducted. Here, we aim to provide the rationale for our choices.

3

Multimedia Quality High quality multimedia streams are data and bandwidth intensive. For example, a full resolution (640 480), uncompressed 24 bit/pixel, full rate (30 fps) 2-hr duration video stream requires 220 Mbps bandwidth and 100 GB storage. Data compression and smaller resolution reduce these requirements dramatically. The multimedia applications currently available on the internet support very low bit rate ( 100 kbps), low resolution (160 120), and poor quality (less than 10 fps) video/audio and therefore, offer an unsatisfactory multimedia experience. In contrast, our work focuses only on full size ( 640 480), full motion, 30 fps compressed video and preferably stereo quality audio. Our choice represents the standard VHS quality video/audio common in home entertainment systems. Anything less than this standard quality is not acceptable to the end-user, and is commercially irrelevant in the long term.

MOD services It is necessary that MOD services be universally accessible from diverse client devices such as a web TV unit, a desktop workstation, or a laptop. Also, it is desirable that the access interface for these services be an easy-to-use and familiar interface – namely the WWW interface. Two key and minimal MOD services that are desirable are an interactive recording service and a fully interactive playback service. The recording service must allow the client to interactively record, playback and publish on the web content from different video-audio devices such as a VCR, a laser disk, or a camera. The playback service should allow the client to playback previously recorded and published content to its local device. In addition to the regular playback, the client must be able to exercise full playout control - specifically be able to fast-forward, rewind, pause, and even perform random search in the multimedia document. Examples of several other complex MOD services are multimedia-document composition, content based search, media transcoding, and orchestrated interactive presentations. It is desirable that these complex services be built using the basic recording and playback services. In our work we focus on interactive recording and playback services accessed using the WWW interface.

Client The client device used to access the MOD services from a server has significant implications on the design of an entire end-to-end system. It is desirable that the multimedia

4 device used by the client be able to playback stored as well as live compressed video/audio streams. Also, for applications such as teleconferencing that will become commonplace in the near future, the device should be capable of real-time compression and transmission of audio/video data from the local audio, video sources such as camera, laser disks, and VCRs. Though several audio/video software codecs are already commercially available, the quality they support is far less than the desirable full resolution ( 640 480), 30 fps video and stereo audio. We believe that at least for the short-term future, multimedia devices capable of such high quality audio/video will continue to be hardware devices. Therefore, in this dissertation, we use a locally available multimedia device called Multimedia Explorer ( MMX) representative of such high performance devices. The MMX connects to a broadband ATM network and supports full rate, full duplex 30 fps video and stereo audio. The lack of large playout buffers in this device resembles the most demanding set-top box based clients of MOD services. The choice of this device was crucial to demonstrating truly high bandwidth MOD services in our work. Our design choices and implementation are significantly dictated by the characteristics of this device.

Servers The MOD servers in the future internet may range from a small scale department server that serves an intra-net to a large scale neighborhood movie server. The following three parameters determine the size of such servers: 1. Maximum number of clients: The number of concurrent clients at an MOD server may range from a few tens to a few thousands. Each of these clients can open multiple sessions with a number of active media streams and each may independently access the same or different data. 2. Storage capacity: Given the storage intensive nature of multimedia data, the server storage requirements may range from a few gigabytes to a few terabytes. For example, a movie server with 1000 NTSC quality 5 Mbps MPEG2 2-hour-long movies will require roughly 4.5 terabytes ( TBs) of storage [118, 126] . 3. Network and storage system bandwidth: The data intensive and periodic nature of multimedia streams demands large amounts of network and storage system throughput. Also, the periodic nature of the multimedia data necessitates QOS guarantees in the form of guaranteed throughput and bounded latency for all active streams. This requires that the MOD server provide large aggregate throughput and per-session

5 real-time service guarantees. For example, a movie server that supports one thousand users with NTSC MPEG2 quality movies requires network and storage bandwidth in excess of 5 Gbps. Such storage throughput requirement is two orders of magnitude higher than maximum throughput of the state-of-the art storage technologies such as Redundant Arrays of Inexpensive Disks (RAIDs). In our work, we aim to devise a storage server architecture that scales from very small servers with a throughput of few hundred Mbps to large scale servers that offer few tens of Gbps. We want to make this architecture cost-effective for it to be economically viable. Specifically, we want the cost per client to be constant and the cost per byte of storage to be comparable to that for the present day network based servers.

Server Operating System The existing network based storage servers commonly use variants of UNIX OS such as HP- UX , IRIX , AIX , DIGITAL U NIX or more recent Windows NT operating system. These general purpose operating systems are inadequate for multimedia storage servers due to several drawbacks, such as random placement of data, mixing of the meta-data and data on the storage devices, inefficient disk-to-network data paths and the lack of real-time I / O and CPU scheduling support. Use of a real-time OS may seem to be an easy solution to the problem of providing QOS guarantees in an MOD server. However, commercially available real-time OSes, such as VXWORKS, P S O S, LYNX OS, QNX, run on embedded systems and have been traditionally optimized to provide strict real-time CPU access guarantees to process-control and mission-critical tasks. They are not optimized for storage and network I / O guarantees. Clearly, general purpose as well as the real-time OSes require changes to efficiently support multimedia data. We believe that a real-time OS is unnecessary to support multimedia and that a general purpose OS system suitably enhanced can effectively support co-existence of realtime (multimedia computing) and non-real-time (general purpose computing) tasks. We selected 4.4 BSD UNIX – a modern, well researched and extensively documented OS – as our candidate server OS. Since complete availability of source code was necessary to implement new OS enhancements, we selected the NetBSD UNIX, which is a public domain version of 4.4 BSD UNIX OS ported to multiple CPU architectures.

6

Networking technology Emerging integrated services networks are aimed at satisfying the real-time transport requirements of multimedia data. For example, asynchronous transfer mode ( ATM) networks allow a user to setup a virtual circuit ( VC) between a source and destination to transport a media stream [84]. The bandwidth and delay of the VC can be chosen depending on the requirements of the particular media. These parameters determine the quality of service (QOS) provided to the VC. Once a VC is setup, its QoS parameters are guaranteed by the network until the VC is released. These two features together preserve the continuous nature of media streams as they traverse the network. QoS guarantees can also be provided in datagram networks by building new functionality such as weighted fair queuing into routers [43]. New IP routers provide such support and make it feasible for IP networks to support QOS. However, unlike existing IP networks, the present day ATM networks already provide support for QOS and are more amenable to deploying experimental high bandwidth services. Also, we already have an extensive existing ATM infrastructure in our environment. Therefore, the work reported in this dissertation is carried out in the context of an ATM network connection between MOD client and servers.

1.2 Challenges Our choice of the MMX client device, ATM networking technology, NetBSD UNIX server OS, and our goal to devise a new scalable server architecture pose several challenges that we address in our research. Using web framework: The multimedia data path from the server to the MMX client device is a very high bandwidth data path. In order to achieve high performance, this data path must be separate from the control path. Also, to allow use of the web as a control path and as an access interface for any MOD service, such separation must be feasible for simple as well as very complex MOD services. The key challenge we faced is how to use or enhance the HTTP framework to achieve this objective. Server OS extensions: The inability of the MMX to handle lossy data streams and its lack of large playout buffers requires that the server OS and the intermediate network pace the data at regular rate to avoid buffer overflows. This in turn requires that within the server, the MOD recording and playback services provide good soft-real-time guarantees. Specifically, these services must be guaranteed periodic access to CPU and

7 storage resources. The challenge we faced was how to enhance the existing UNIX OS CPU scheduling and the storage and network I / O to provide such QOS guarantees and efficient disk-to-network data path. The choice of ATM networks instead of IP networks requires that the server OS provide efficient support for ATM protocols, namely the AAL0 and AAL5 adaptation layers and support appropriate network signaling. Scalable MOD: The scalable MOD server architecture we want to devise and prototype must use off-the-shelf hardware and software components. Use of hardware components such as SCSI storage controllers, PCI I / O interconnect, standard network interfaces, and CPUs, guarantees that the architecture remains viable with advances in technology and also makes it cost-effective. Similarly, use of off-the-shelf software components such as the enhanced NetBSD UNIX, and public domain web servers guarantees flexibility and extensibility. Another challenge we face is how to minimize data replication to minimize storage costs and achieve high parallelism, concurrency and scalability in data accesses. In the following, we provide a brief overview of our solutions that address the above challenges.

1.3 Overview of Solutions This dissertation describes a cost-effective design and implementation of scalable web based high performance multimedia-on-demand ( MOD) servers and services. An important aspect of this dissertation has been prototyping and deploying MOD applications, services and servers, and learning from this experience. The three main components of this dissertation are: (1) Web based interactive MOD services, (2) Innovative enhancements to a server node operating system ( OS) to support such services, and (3) design and prototyping of a scalable server architecture and associated data layout and scheduling schemes to support large number of independent, concurrent clients.

MOD Services: Design and prototyping Key aspects to the design of our interactive recording and playback MOD services are: (1) a separation of the bandwidth intensive data path from the low overhead control path and (2) the use of WWW to implement control path and serve as an universal access interface. We conclusively demonstrate that such partitioning works very well at the server

8 and the client by actual prototyping of example services. We also demonstrate that periodic access to CPU and storage resources within the server is crucial to provide QOS guarantees on the bandwidth intensive data path.

Server OS Enhancements To rectify the shortcomings of 4.4 BSD UNIX in providing QOS guarantees for periodic storage and CPU accesses and its lack of efficient disk-to-network data paths, we have designed and prototyped the following innovative OS enhancements: (1) We used a novel CPU scheduling mechanism called Real-Time-Upcalls (RTU) [52] to provide guaranteed application level CPU access. (2) We implemented Deficit-Round-Robin (DRR) [104] fair queuing over multiple priority queues within the SCSI storage driver to provide fair guaranteed accesses to storage bandwidth. (3) We designed and prototyped a new buffering system called - Multimedia Buffers (mmbufs) to provide zero copy data transfers between the storage and the network subsystem. (4) We provided a new system call interface, called stream API that allows user applications to access these OS enhancements. This API allows aggregation of multiple stream read-send requests into a single system call, thus minimizing the system call overheads. Our measurements clearly demonstrate up to 40 % throughput improvements using the mmbuf system and the stream API on fast storage systems. We also demonstrated efficient sharing of storage bandwidth between real-time and non-real-time requests [26, 27]. We prototyped a MOD server that uses the enhanced OS and comfortably supports an aggregate network and storage throughput of 75 Mbps 1.

Scalable MOD servers and services To tackle the problem of scalability, we proposed and prototyped a distributed storage architecture consisting of a cluster of multiple storage nodes in the form of inexpensive PCs interconnected by a ATM cell switched interconnect[22, 23]. Each node in this distributed architecture runs the OS with extensions described above. We developed a family of distributed data layouts that stripe data over the storage nodes using constant-time-length chunks to support high parallelism and concurrency. We analyzed interesting properties of these layouts that guarantee load-balanced operation of the cluster in presence of large number of concurrent interactive sessions. We also designed and implemented distributed scheduling techniques to support synchronized playback of striped data to the clients of The present ATM interface used in the prototype limits the number of sessions to 7. This is not an inherent limitation of the system. With better ATM interfaces this metric will be improved. 1

9 such a distributed storage server [24, 25, 29]. We demonstrated that with sufficient client side buffers, these simple scheduling techniques do yield very good quality multimedia playback.

1.4 Summary This dissertation is concerned with the design and prototyping of scalable cost effective high performance multimedia-on-demand ( MOD) services and servers. It demonstrates that MOD services that use web for control operations and support data path optimized for high performance and QOS guarantees can be built using off-the-shelf components. Specifically, our claim in this dissertation is that using emerging broadband networking technologies such as ATM and commodity hardware/software components such as PCs, and UNIX operating system enhanced to handle multimedia, high performance MOD servers and services can be built in a cost-effective manner. Also, we claim that massively-parallel processors or symmetric multiprocessors are expensive solutions to the problem of scalability of MOD servers. Instead, clusters of PCs interconnected using a system or desk-area network that uses the same technology as the external network represent a scalable, inexpensive alternative to build large scale servers and services. The research reported in this dissertation aims to establish these claims.

10

Chapter 2 Essential Background This chapter presents the essential background on MOD services, servers and clients to motivate the research issues addressed in this thesis. To start with, it discusses properties of requests received by a MOD server and implications of them on the definition of the MOD playback services. It then presents various application scenarios in which MOD services are accessed and the properties of the clients which access these services. In the end, it discusses the background on MOD servers. Specifically, it presents the hierarchical network model that necessitates a geographical hierarchy of MOD servers of different scales interconnected by high speed networks such as ATM. The last section also presents a taxonomy and a list of performance metrics for the MOD servers.

2.1 Multimedia-On-demand Services The MOD server should provide multimedia content creation and multimedia content access/playback services to its clients. Three simple examples of the basic content creation services are Interactive Recording, Multimedia Composition, and Media Transcoding. An interactive recording service allows an end-user to record a multimedia document consisting of audio, and video streams originating at the user’s device such as a VCR, on to a remote server. The multimedia composition service allows the user to create complex documents by editing pre-existing multimedia documents recorded using simple recording service. Quite often, a user may record video/audio documents in an uncompressed format or in compression formats such as MJPEG that result in high quality but require high bandwidth and disk space. The media transcoding service allows the client to request conversion of such documents to a different compression intensive format that saves disk space and bandwidth. For example, a high quality full resolution video/audio document in

11 MJPEG format that requires 20 Mbps bandwidth and 18 GB disk-space can be converted to a MPEG2 format which supports better compression and therefore, requires one-fourth bandwidth and disk space. Normally, content creation from different clients will be independent of each other and therefore, a client of a content creation service can be allocated a dedicated session. More complex scenarios in which several end-users remotely collaborate to create content violate this assumption. However, we do not consider such scenarios in this dissertation, and focus mainly on the basic service - the interactive recording service. The content access service provided by the storage server allows an end user to playback the content published by potentially every user of the content creation services. There are several types of such playback services, each of which provides the user with varying amount of interactive control on the multimedia playback and imposes different requirements on the server and the network. Before we describe these types, it is important to understand the spatial and the temporal locality properties of playback requests from the users that give rise to these types.

2.1.1 Properties of Request Arrival Process Figure 2.1 shows the model of a MOD server in a MOD retrieval environment. In this model, each MOD server has N multimedia files, each of which has two attributes associated with it. The first attribute, duration Dk for the k th multimedia file, is the time required to play the file at standard playout rate, whereas the demand Qk for the k th multimedia file is the number of simultaneous connections the server can support for the file. The clients can access multimedia files by sending requests to the server. The server performs an admission control, which, based on the existing load and the resource usage admits or rejects the client request. The request arrival process at the MOD server may exhibit spatial and temporal locality as described below.

Req2

Req2

ReqN

Req4

Req10

1

2

3

4

5

6

MOD Server

Network Req1

Req1

Req1

Req2

Req2

Data N-2 N-1

N

Multimedia Files

Figure 2.1:

MOD

server request-response model

12 Spatial Locality: The spatial locality property signifies that some of the documents are likely to accessed more frequently than the others. For example, in case of an ondemand movie application, a large fraction of requests received are likely to be for popular and recently released movies and relatively less for old movies. Empirically, the arrival process at a movie rental shop can be modelled using Zipf’s law [36]. Similarly, in the case of a tele-shopping mall, some stores would be more popular than the others and would therefore receive more requests. In case of a digital lecture archive, lectures on difficult topics and those given by laconic faculty members are likely to be accessed more often. However, some MOD applications may exhibit less spatial locality than the others, for example, all contemporary publications in an ondemand digital library or radiological movies in a patient information database are likely to be uniformly accessed. Temporal Locality: Depending on the application, the request arrival process at a MOD server can be temporally clustered to various extent. For example, in case of on-demand movies, request rate for a set of popular movies would be higher during the evening time period of a weekend than on the weekdays and more requests would be received during the 6 to 9 pm than any other period of the day. On the other hand, in case of a digital library or a radiological database, the request arrivals would be Poisson or slightly bursty.

In general, the spatial locality is exploited in placement of multimedia data in a storage hierarchy consisting of different storage types. Clearly, frequently requested data should be assigned to faster and more expensive storage devices. The temporal locality property is important in deciding the service and network model for a server.

2.1.2 MOD Playback Service Types The multimedia-on-demand playback services can be classified into following three types: Pay-Per-View (PPV) service: Analogous to the existing PPV channels on cable networks, the time of access for a multimedia document in this service is decided by the server rather than the client. The MOD server multicasts the multimedia program at fixed times and any clients interested in receiving the program tune in to the server at those fixed times. Clearly, a client of such a service

13 cannot control the stream playout and also has minimal freedom as to when the multimedia stream will be available. Implementing such a server hardly presents any technological challenges.

10min Cluster2

Movie X

Cluster1 MOD Server D1

D2

D3

D4

Cluster1 Requests D D D D

D12 Cluster12 Requests

D

Figure 2.2: Shared Viewing with Constraints ( SVC) Near Video-on-Demand (NVD): The Near Video-on-Demand service, sometimes called Shared Viewing with Constraints (SVC) exploits spatial and temporal locality in request arrivals. The server processes and accepts user requests in groups to exploit the clustering properties of the arrival process. A new client request may face a variable admission latency, after which the client becomes a part of a multicast group, all members of which are connected to the server by a single multicast connection. Consider the example of on-demand movies with a typical movie of 120 min duration. As shown in Figure 2.2, the server may decide to cluster requests received in a 10 minute duration and service all such clients as a single multicast group. Thus, in the worst case a client will face an admission latency of 10 minutes. Also, the maximum number of retrievals required at any given time will be 12, unlike only one in PPV service. One advantage of this scheme is that the movie can be divided into twelve independent units and stored on separate storage nodes, each of which can service a multicast group at a given instant as the movie playout progresses. For example, for clients belonging to cluster 1, the first 10 minutes of the movie are played out from one storage node and at the end of this duration, the client is switched to another storage node for next 10 minutes of the movie. In this simple scheme, under busy conditions there can be as many as 12 multicast groups, each being serviced by one storage node and switched to an appropriate storage node at the end of the ten minute duration. However, any kind of interactivity is clumsy to support. First of all, excessive interactivity such as rewind and fast forward requires frequent switching of clients to different multicast groups. Also, given that all the clients that are

14 grouped and serviced together have the same ‘view’ of the movie, any fast forward/rewind by any member will alter this ‘view’ for all members of the group which is undesirable. If fast forward and rewind from each member are to be treated independently, multicast cannot be used to realize bandwidth any bandwidth saving. The NVD represents an incremental improvement over the PPV service, as it allows user to access the multimedia documents at arbitrary instances, unlike the fixed instances in PPV service and for this reason it is sometimes called near-video-on-demand. It has also been called Periodic Broadcast Service in literature [41, 42, 65]. Several new schemes such as the Pyramidal broadcasting [121], and Skyscrapper broadcasting [65], attempt to improve the basic scheme described above to provide better access latency. The key advantage of these schemes is that they make storage and bandwidth requirements for the document being broadcast independent of the number of clients accessing the documents, and provide a tradeoff between access latency and amount of bandwidth dedicated to broadcast. However, they are suitable only for popular documents which are being continuously requested by a dense user population. Clearly, for applications for which request clustering is difficult and interactivity is important, these services cannot be used.

True-Multimedia-On-Demand (T- MOD): t1

t2

t3 time t

Video X 1

Video X 2

On-demand

Video y 3

Sever

High Speed Network

Requests

Figure 2.3:

MOD

server for a Dedicated Viewing service (DV)

Figure 2.3 illustrates this service model, in which each access request from same or different clients is treated independently at the server. For example, in Figure 2.3, even if two requests received at time t1 and t2 are close in time and are for the same video sequence X, if admitted, server will perform separate retrieval and transmissions for these

15 requests. Due to this independence of requests, T- MOD service model is sometimes referred to as Dedicated Viewing ( DV). Consider an example of a neighborhood on-demand movie server connected to a head-end switch which connects to a maximum of 1000 users (households), all 1000 users may watch the same movie independently using DV service and would therefore require 1000 independent retrievals and deliveries. The primary advantage of this service is that it is a natural paradigm for personalized, interactive multimedia delivery, as streams received by multiple clients are independent and do not have any temporal relationship between them. Clearly, it is an appropriate service model for the future distributed multimedia on-demand applications that emphasize personalization and interactivity. However, T- MOD service has serious implications on the throughput. Consider an example of a storage server serving HDTV quality movies to 200 customers. Each HDTV stream requires 20 Mbps on the average. So in the worst case the aggregate network and storage throughput requirement is in excess of 4 Gbps. Designing scalable servers that can offer such a service in scalable fashion is clearly a challenging task. In our work, we concentrate on building the most challenging of all the three content playback services, namely the True Multimedia-On-Demand service. We support basic interactive operations such as fast-forward, rewind, slow-play, pause/resume and random search. Using these basic operations more complex operations such as content based searches can be easily supported. In the following, we will discuss the semantics of these operations which influences the server design.

2.1.3 Semantics of Interactive Control for Playback Services In order to understand implications of ff and rw operations on data layout and scheduling in an MOD server, it is necessary to first understand various ways of implementing the playout control operations. The two ways of implementing these operations are as follows: Rate Variation Scheme (RVS): In this scheme, the operation changes the rate of display at the client and hence the rate of data retrieval and transmission at the server. The performance analysis of a large-scale server using such a scheme has been analyzed in [44]. Sequence Variation Scheme (SVS): In this scheme, the operation changes the sequence of frame display and hence the sequence of data retrieval and transmission at the server. The display rate at the client side is unaltered but the retrieval rate and the transmission rate at the server may be affected.

16 As an example, consider the fast forward operation. In the implementation using RVS, the display rate at the client terminal is increased to give the user a perception of fast forward. For example, a video stream may be played at 90 frames/sec (fps) instead of standard 30 fps. On the other hand in case of SVS, irrespective of whether a video stream is in normal play mode or ff mode, display rate is always 30 fps. The perception of fast forward is achieved by displaying an altered frame sequence. e.g.: playing only every alternate frame, every 5th frame, or in general, every dth frame (d = fast forward distance). However, the RVS implementation of ff and rw has several significant drawbacks: 1. Increased network and storage BW requirement: RVS approach increases the resource requirements at the server in the form of increased buffer and storage and network BW. Since the interactive behavior is typically unpredictable, any deterministic guarantees to interactive operations would require the server to be highly overengineered. Also, no matter how the server makes n (fast forward factor) times the normal playout BW available for fast forward, the network may not be able support such increases in BW requirement without high cost and/or significant blocking. 2. Inappropriate for real-time decoders: Most of the decompression engines at the client handle real-time decoding of incoming data at a rate smaller than or equal to a maximum frame rate. Thus, a MPEG decoder hardware can decode at the most 30 fps. Therefore, any attempt to increase the decoding/display rate by increasing the data rate will not work. In other words, it does not help to use complex (presumably intelligent) algorithms at the server to send data at higher rate, if the client cannot handle it. 3. Increased buffer requirement at the client: Given the cost factor, a typical realtime MPEG decoder has minimal (3 frames worth) buffers. This implies that data coming in at n times the normal playout rate cannot be buffered at the client and will be dropped, wasting all the “good” work server and network did to transport it to the client. Due to these compelling reasons, we choose to implement fast-forward (ff) and rewind using SVS approach. We distinguish fast-play (fp) and ff as two different operations and implement fp using RVS approach. Similar distinction can be made between fast-rewind (fr) and rw. Typically, fast-forward/rewind will be supported for all active connections, whereas fast-play/fast-rewind will be a special operation, subject to resource availability within the server and the network. Note that operations such as slow play

17 and slow rewind can only be implemented using RVS scheme. However, these operations reduce the resource usage, and hence are easier to implement. Also, operations such as pause, frame-advance and stop-and-return do not fall under any particular classification and are easy to implement.

2.2 Clients of Multimedia-On-Demand Services PC / Workstation

Storage Server

High-speed Network This is a description

of

the first

world war

Video Stream

Text Sequence

Image Sequence

Figure 2.4: A World-Wide-Web based Multimedia-on-demand ( MOD) application Two common emerging scenarios in which the future MOD applications will be run are illustrated in Figure 2.4 and 2.5. In the first scenario, popularly known as World Wide Web, millions of computers interconnected into a tangled web by the ever increasing internet, act as information servers and/or clients. The client is typically a browser program such as Netscape, Xmosaic with a point-and-click interface. These clients access multimedia information in the form of text, images, audio, and video from the web servers in the Internet via a special protocol called Hyper Text Transfer Protocol (HTTP) that employs the TCP / IP protocol stack for communication. In the second scenario, commonly termed as Video-on-demand, as conceived by telephone companies, cable operators, and content providers like Time-Warner, the client accesses the multimedia data using a set-top box which replaces the VCR and a remote control device used to communicate control commands. The access interface used in this scenario can be very similar to the universal WWW interface. This infrastructure primarily suites on-demand entertainment programs, personalized news, home shopping, travel information, and network based games.

18 settop Switch Manager

Storage Server

Video Stream

settop Head end Switch

settop

Service Provider

Figure 2.5: A typical Video-on-demand (MOD ) application The client end-systems in these two scenarios differ in capabilities such as the amount of local buffers, compute power, and the availability of specialized media processing. Figure 2.6 illustrates the two possible buffer scenarios at the client. Typically, a Bufferless Client has a few frames worth of buffers required by the decompression hardware. For example, a typical MPEG decoder may have buffer for at least three frames – for I and P anchor frames required to decode B frames and the frame buffer. For example, a hand-held PDA, or a set-top box will have small local buffers. 1

f3

Bufferless Client

f4

f3

f1 f0

f2

Decoder Hardware 2/3 Frames Decoder

2

f64

Buffered Client

f63

f62

f61

f60

f2

f1 f0 Decoder Hardware

Client Buffers

2/3 Frames

Decoder

Figure 2.6: Buffer scenarios at the client

19 On the other hand, a Buffered Client, in addition to the decoder buffer, has a buffer large enough to store several (100 to 200 - approximately. few seconds worth) frames. Availability of such buffer makes network delay and jitter a non-issue, but requires buffer management on the part of the client. Common workstations, PCs that implement software decompression or contain hardware support for decompression fall in this category. A bufferless client represents the most challenging client device that the MOD services may have to support. The lack of buffers at such a client requires that the server and the network pace data in a very steady fashion. This in turn requires that the network interface hardware and the OS software in the MOD server be able to retrieve and transmit data with strict real-time guarantees. Also, the lack of compute power and user programmability in such devices makes error correction and concealment very difficult. In our research, we demonstrate MOD services that are accessed using such a bufferless client called the MultiMedia Explorer (MMX).

2.3 Multimedia-On-Demand Servers This section first describes the hierarchical nature of the network infrastructure over which the MOD services are accessed and motivates need for varying scales of MOD servers. It then presents serveral performance metrics and a taxonomy of such servers. It is desirable that the server architecture scale elegantly for different scales of servers.

2.3.1 Hierarchical Network Model If we assume the true-on-demand service model for all future applications, the aggregate throughput requirements for the global networking infrastructure can be overwhelming. Consider an example scenario in which each active user is provided with an independent 20 Mbps HDTV connection. In the continental US alone, during the primetime, approximately 150 million users in 77 million households will require an aggregate network bandwidth of 1.54 tera-bits/sec [7, 85]. Ideally, if bandwidth and storage were free, a centralized super server that stores all possible programs can serve all the clients. In an alternate scenario, if the storage costs were negligible, one can place as many programs close to the client premises as possible. Clearly, both these scenarios are unrealistic. Though improvements in storage technology have lead to remarkable increases in the storage capacities of devices and show every sign of continuing to do so [116], the storage costs will still be the major component of the total cost of a MOD system. In fact,

20 Servers

International Backbone Network

Super Servers

National Backbone Network

Regional Servers

Regional Backbone Network

Other regional networks

Neighborhood Servers

Neighborhood Servers

High-speed access network (eg: ADSL)

High-speed access network (eg: ADSL)

Figure 2.7: Hierarchical Network Model growing concern is being voiced that the enormous network and storage costs in the design of large scale storage systems and network infrastructure may far outweigh the amount of revenue it may generate. A survey [67] shows that while MOD has a high appeal among 44% of the respondents who were willing to pay for it, only a miniscule 14% were willing to pay more than the existing cable rates for it. Nearly two thirds of the respondents owned a personal computer, with nearly half of them equipped with a modem; hence, we can infer that they were technology savvy and very much knew the difference multimedia-ondemand would make in their lives. Given this, minimizing network and storage costs is very important to make future MOD services affordable. The three main ways to meet network and storage costs in scalable fashion are: hierarchical solutions, caching and sharing. The consensus among the research community is that the network infrastructure that will support future interactive MOD services will be hierarchical in nature and will look somewhat like the one in Figure 2.7. As shown there, it will consist of a international backbone network connecting several national backbone networks which in turn interconnect numerous regional networks. Each regional network may interconnect several access networks through which broadband network access is provided to the homes. The technology of choice for backbone network will be ATM over multiple high speed SONET links. In the short term, the access network, typically provided by a local phone exchange or a cable operator will be based on one of the several technology

21 proposals such as Sub-carrier Modulated Fiber-Coax Bus (SMFCB) or Baseband Modulated Fiber Bus (BMFB) 2 [92]. This access network will connect the user equipment such as a set-top box or a PC to the rest of the network via the head-ends. In the long term, the head-ends may be replaced by an ATM switch of reasonable size such as 1K 1K . A local MOD server will be connected to each head-end. The higher levels in the hierarchy will have increasingly large scale storage servers. In fact, the system will evolve from a synergy between various enterprises [88]: Storage Providers that manage information storage at multimedia servers (a role akin to that of video rental stores and libraries of today), Network Providers, that are responsible for media transport over integrated networks (a role akin to that of cable and telephone and cable companies of today), and Content Providers such as entertainment houses, new producers etc., that offer a multitude of services to subscriber homes using multimedia servers and broadband networks. The storage costs and network traffic in this hierarchy can be reduced by exploiting the spatial and temporal locality property of multimedia document accesses. Studies [6] have shown that the patterns of viewership are very much dependent on the time of the day. Akin to the telephone network, there are peak hours during which the traffic is maximum and the majority of the titles demanded is a small subset of the most recent set of hit movies. If the profile of user requests for various multimedia documents is available apriori, only those documents that are likely to be requested with high probability should be stored on the neighborhood servers and less frequently requested documents should be stored on servers in the higher levels in the hierarchy. This technique, commonly called Info/program caching [85, 88], also minimizes accesses to the servers in the higher levels of the hierarchy and reduces upstream bandwidth requirements during peak hours. Such caching services may be provided by the Storage Providers or Network Providers. The hierarchical and geographically distributed nature of the network infrastructure and the need for distributed caching suggests need for a spectrum of MOD servers with very different bandwidth and storage capacities. For example, an MOD server in a regional service provider center will have to support much larger number of clients and higher storage capacity than a neighborhood MOD server. The challenges in building these different scales of servers can be better understood by studying the performance metrics used evaluate an MOD server. In the following, we describe these metrics and then, use them to define a taxonomy of MOD servers. 1

1 2

Also called Hybrid Fiber Coax (HFC) Also, called Switched Digital Video (SDV)

22

2.3.2 Performance Metrics for a MOD server In order to compare and classify storage server architectures, we define the following set of performance metrics.

Concurrency: The server may have to support thousands of concurrent clients, each with a number of active media streams and each independently accessing the same or different data. The concurrency metric captures this requirement and is defined as the maximum number of clients that can independently access a multimedia document in any playout control state. Thus, in case of an on-demand movie server that can support a maximum of 1000 clients, potentially all 1000 clients should be able to access the same copy of the movie independently at any time. Higher concurrency minimizes need for document replication and storage costs. Access latency and Operation Latency: The access latency metric is defined as the amount of time a client has to wait after sending a request to the MOD server to receive the requested multimedia document. It mostly depends on the service model and the round-trip network latency is a minimal part of it. The access latency should be typically less than a few seconds. The operation latency metric is defined as the time required to start an interactive operation such as ff, rw, pause, stop, random access etc. Highly interactive MOD applications will typically require an operation latency of less than a second. It is desirable that the playout and operation latency be independent of the load on the server. Clearly, these two metrics characterize how effectively the server provides QOS guarantees in the form of bounded delay. Storage capacity and storage, network throughput per dollar: Given the storage intensive nature of multimedia data, the storage requirements for a large scale server can be in excess of tens of terabytes. For example, a movie server that supports one thousand users with 500 HDTV quality movies, will require network and storage bandwidth in excess of 20 Gbps and storage capacity of 10 TBs. This metric defines the cost-effectiveness of the storage server in meeting the requirement of large storage capacity and the large throughput. It must be small for the MOD services to be affordable. Scalability: It is desirable that a server architecture that supports say 100 clients and an aggregate throughput of 500 Mbps, should be easily extendable to 1000 clients and a ten fold increase in network/storage throughput without significant modifications. In other words the server architecture should scale with number of clients. In

23 addition, the server must support heterogeneous clients, multimedia data and access networks.

Extensibility: The server must be able to support multiple application scenarios and must be extendable to support different service models. Fault tolerance: In presence of software and/or hardware failures, the server must be able to degrade services gracefully and affect minimal number of active clients.

2.3.3 Taxonomy and Hierarchy of Storage Servers Using the performance metrics described above (Section 2.3.2), the storage servers can be classified as shown in Table 2.1. Since video is the most demanding of all the multimedia streams, in this classification the term client is synonymous with a video connection. Though the average rate of a video connection depends on various factors such as the size of the image, resolution and the type of compression used, in the following description we assume a standard NTSC quality video connection to be a 5 Mbps MPEG compressed stream. For HDTV quality compressed video streams the storage and network bandwidth will be even higher. It must be noted that the storage capacity and storage bandwidth (throughput) are two attributes of a storage system that are independent of each other. Given that the advances in storage technology have almost always improved the storage capacity more than the storage throughput, extracting higher throughput from a storage system is a more challenging task than providing a large storage capacity. Also, as a general case, the storage in all classes of servers will be hierarchical consisting of magnetic Direct Access Storage Disks (DASDs), magneto-optic devices like optical disks, magnetic and optical tapes. Therefore, the classification below does not include any reference to the storage capacity. Table 2.1: Taxonomy of Storage Servers Scale Number of Clients Small 25 Medium 100 Large 1000 Super 10; 000

Concurrency

25 100 1000 10; 000

Access Latency

< 1sec < 1sec < 1sec < 1sec

Throughput 155 Mbps (OC-3) 622 Mbps (OC-12) 0:5 Gbps

50Gbps

24 Note that the throughput requirements increase dramatically for large storage servers. With the advances in fast packet switching, providing increased amount of network bandwidth is feasible. However, due to modest rate of improvements in the speed of storage devices such as magnetic or optical disks, same is not true of the storage throughput. Also, note that the access and operation latency requirements remain the same irrespective of the size of the server. Satisfying these throughput and latency requirements together can be very difficult. This dissertation aims to design, prototype and analyze a MOD server architecture and associated services that scale from small servers to super servers and meet these difficult requirements.

2.4 Summary In this chapter, we provided background information that highlighted various issues in the design of MOD services and servers. Specifically, we discussed various MOD services, and characterized application and client scenarios in which they will be accessed. Also, we discussed a network model that requires a hierarchy of storage servers, with large scale storage servers required higher up in the hierarchy. We provided a set of perform ance metrics to evaluate MOD storage servers of different scales. In the end, we presented a taxonomy of storage servers which clearly indicates the need for scalable architectures for storage servers and services.

25

Chapter 3 Research Overview The goal of the research reported in this dissertation has been to design, prototype and deploy scalable, cost-effective, high bandwidth, interactive all digital multimedia-ondemand services and servers using off-the-shelf components with minimum changes and learn from the experience. In this chapter, we describe various research problems we solve in this dissertation. We first provide a brief summary of the existing state-of-the art in MOD servers and services and contrast our solutions with it. We follow this with an overview of our innovative solutions and list contributions of this dissertation. In the end, we present an outline for the rest of this dissertation.

3.1 Research Questions In this section, we enumerate the important research questions in the design and implementation of MOD servers and services. The research reported in the rest of the dissertation provides answers to these questions. 1. Access Interface and implementation of MOD services: The world-wide-web has emerged as a universal easy-to use information access interface. Can we build high bandwidth video/audio-on-demand services in such a way that they can be easily accessed from a web browser in a location and end-system independent fashion? Can we separate the high bandwidth data path effectively from the web access and use web only for control operations? What are the basic features required in content creation service such as a multimedia recording service and a content access service

such as a fully interactive playback service? Can more complex built using the basic playback and recording services?

MOD

26 services be

2. Enhancements to server OS for QOS guarantees and high performance: Traditional server operating systems, such as BSD UNIX, have been designed for interactive computing where high resource utilization, efficient resource sharing and quick response are design considerations. In the past, no attention has been paid in design of such an OS to provide Quality-of-Service ( QOS) guarantees to applications in the form of periodic CPU, storage and network access. The real-time operating systems, support guaranteed CPU access to mission critical tasks in real-time systems, but they do not support guaranteed high bandwidth access to storage systems. Also, they are traditionally run on CPUs optimized for use in embedded systems which do not require extensive compute power commonly required for multimedia and general purpose computing. Clearly, such operating systems would require significant changes to support multimedia. Instead, general purpose operating systems can be enhanced to provide soft-real-time guarantees required by MOD services. This viewpoint raises following research questions: (1) What modifications are required to CPU scheduling to allow applications to gain guaranteed periodic and low-latency access to CPU? (2) What changes must be made to storage system components such as file system, disk drivers to support multimedia data types? and (3) What mechanisms are required to achieve efficient data transfers between storage and network subsystems with minimum context switching overhead, reduced number of user-kernel boundry crossings, and minimum data copies? 3. Scalability of MOD services and severs: The MOD services and servers must scale with the storage capacity, the number of active clients, and the number of concurrent accesses to any data. Also, as the services and servers are scaled, the per user cost should remain constant, and the efficiency and load balance of the server should not be affected. The questions that these requirements raise are: (a) How can we build storage servers and services that can support potentially thousands of concurrent clients? (b) Do we need to employ supercomputers and/or massively-parallel machines to do that? (c) Can we devise a server architecture that can scale from 10 clients to 1000 clients? (d) Can we continue to use the OS mechanisms discussed earlier to support high performance and QOS guarantees in this new architecture?

27

3.2 Overview of Existing Solutions In this section, we provide a brief overview of the state-of-art in MOD services, OS support for guaranteed access to CPU, storage resources, and design of MOD servers and contrast it with our ideas. A more detailed discussion of related work in each of the areas is provided in later Chapters 5, 7, 8.

Web based MOD services: In the early phase of WWW, web servers provided access to predominately text and image data using the Hyper Text Transport Protocol ( HTTP). However, multimedia data such as video, audio, animations, and 3D graphics have started to appear on web sites. Recently, serveral companies such as VxXtreme, RealPlayer, Microsoft [5, 8, 9] have developed low bit rate audio-video encoders, decoders and data streaming servers. These applications either use TCP / IP or UDP / IP and employ extensive buffering and application level error recovery mechanisms at the client end. They also use our approach of separating control path and the high bandwidth data path that needs QOS, and use the web for control path operations. However, due to limitations of the current workstations and the internet, the data path they use lacks QOS guarantees and high bandwidth. Though audio playback quality is often satisfactory with these applications in normal situations, the video playback quality and frame rates are very poor. Due to the lack of guaranteed CPU access, running any CPU intensive applications concurrently with these applications dramatically deteriorates the video, audio playbacks. Also, none of these applications support network based recording and publishing of audio-video content. These solutions to multimedia-on-demand employ a software only approach in which the media encoding/decoding is performed in software. Though eventually with the advent of faster CPUs with specialized instructions for media processing, these solutions are expected to deliver better performance, at present they are inappropriat e for high bandwidth, high quality multimedia. The Multicast Backbone (MBONE) infrastructure developed for audio-video conferencing over the internet provides different tools for participating in on-going multicast video/audio sessions, encoding local video and multicasting it in a conference, and recording an active session locally [60, 73]. However, no tools exist to record and playback an MBONE session from a remote server. Also, MBONE was designed to be used in the internet at large where bandwidth is scarce and majority of the hosts are PCs with minimal computing power.

28 Guaranteed CPU access in the server OS: Several research efforts have attempted to enhance existing general purpose server operating systems with real-time scheduling mechanisms to schedule real-time tasks that process multimedia data. The Real-Time thread mechanism implemented in RT- MACH [106], the threads with real-time priorities implemented in Solaris [71], and other research efforts such as [72, 123, 124] are prominent examples of a thread based approach. Another competing mechanism called Real-Time-Upcall [52] developed in our research group follows an event based approach. Most of the tasks in multimedia data handling such as data retrieval, transmission, and processing require bounded, predictable amount of time and thus, are suitable for event based scheduling. The RTUs employ co-operative scheduling and minimize locking and context switching overheads. The RTU mechanism has been implemented in the NetBSD operating system ported on SPARC and Intel i386 platforms. The choice of the CPU scheduling mechanism influences the complexity of the design of the MOD services. Also, the selected CPU scheduling mechanism must be well integrated with any OS enhancements aimed at improving the storage and network subsystems. Guaranteed storage access in server OS: In the area of guaranteed storage access, several research groups have focused on developing [96, 102, 125] disk scheduling algorithms for supporting multimedia data retrievals. A few more recent efforts such as [61, 63, 102] address the issues of disk scheduling and storage bandwidth allocation. The problem of minimizing data copying for data transfer between different I / O devices has received significant attention in the context of variants of UNIX OS, [18, 19, 48, 49, 97] and more recently in Solaris OS [107, 123, 124]. However, with the exception of enhancements to Solaris OS, none of these solutions provide zerocopy data transfer paths between storage and network devices and also do not provide QOS guaranteed access to storage devices. An elegant, and efficient solution to these problems is crucial to realizing high bandwidth MOD servers and services. Scalable sever architecture: High performance storage I / O has been a topic of significant research in the realms of distributed and supercomputing for quite sometime now. Recent efforts, such as [10, 11, 12, 64], have attempted to use large symmetric multiprocessor machines or SIMD massively parallel machines to build large MOD servers. However, traditionally supercomputers have not been optimized for networked applications and therefore do not support high performance network to disk I / O. Also, we believe that using supercomputers to build MOD servers is overkill.

29 Serveral other research efforts such as the Tiger file server project at Microsoft Research [20] and ServerArray project at Eurecom [15] represent efforts in building scalable cluster based servers. Research efforts such as Fellini [80], Symphony [102], and work at UCSD [118], and IBM TJ Watson [33] represent disk array or single disk based servers that are not scalable. In summary, three years ago when we undertook this research, cost-effective, efficient solutions to the problem of building high bandwidth, high quality multimedia-ondemand MOD services and servers did not exist. Devising such solutions has been the primary objective of the research reported in this dissertation.

3.3 Innovative Ideas This section provides an overview of our innovative ideas that form the basis of our solutions to problems outlined in Section 3.1. 1. MOD Services: Our primary ideas here are separation of control and high bandwidth data paths and use of web as a universal, easy-to-use interface for control operations. We use these ideas to design and prototype high performance recording, playback servers and client GUI applications that access services provided by these servers. 2. Server OS enhancements: Our innovative ideas that aim to enhance the UNIX OS to support high performance and QOS for multimedia are: (1) a novel zero copy MultiMedia Buffers (MMBUF) buffer system that unifies the file system buffer cache and the mbuf buffer system used by the network subsystem, and (2) a two level queuing scheme that employs DRR fair queuing for guaranteed storage access. In our prototype implementation, we combined these ideas with well known ideas of system call aggregation to minimize context switching overheads and a co-operative CPU scheduling technique called Real-Time-Upcall (RTU). Our OS enhancements provide excellent QOS guarantees and high performance in our prototype MOD services. 3. Scalability of MOD sever and services: We proposed an innovative highly scalable distributed storage architecture called Massively-parallel And Real-time Storage (MARS). We also designed associated distributed data layouts and scheduling to support large number of concurrent clients. Our architecture uses off-the-shelf components and supports high parallelism and concurrency, and thus, minimizes the

30 need for document replication and storage costs. We have also developed a prototype extension of this architecture that can be used to build Superservers. 4. Load balance and scalable servers: Our main ideas here are: (1) the concept of load-balanced operation of a cluster based storage architecture, and (2) the concept of Safe Skipping Distances (SSDs) to support load balance during interactive playback control operations. We undertook an analytical study to characterize the load-balance properties of various distributed data layouts. These properties are crucial to minimizing violation of QOS guarantees in the presence of large number of connections in arbitrary states.

3.4 Our Research Approach In this section, we briefly describe our approach to solving the research questions in different areas described in Section 3.1.

3.4.1 Building MOD Services MOD Services

MMX Control

MMX control server

Recording

Record GUI Appln

Streaming Protocol

HTTP like protocol

Record Server

Interactive Playback

Playback GUI Appln Playback Server

Figure 3.1: Building MOD Services Figure 3.1 illustrates the four tasks we undertook to build the example MOD services. These tasks are as follows: Client control server: We first developed a client MMX multimedia device control server called mmxd which allows multiple application entities to control the MMX device by exchanging text commands.

31 HTTP+ Request Server

Network

Client

Data

Figure 3.2: Building MOD Services Application level streaming protocol: The web framework consisting of web servers and web browsers (clients) employs a text message based Hyper Text Transport Protocol (HTTP) for command and data exchange. The file transfer paradigm employed by HTTP, wherein entire data must be transfered before it can consumed is unsuitable for multimedia applications. Therefore, we developed a simple application level streaming protocol which allows applications to request data streaming and playback control for multimedia streams. This protocol can be implemented as an HTTP protocol extension or as a stand-alone control protocol. Recording service: We prototyped a recording service which consists of a record GUI application at the client end and a recording server called recordd at the server end. The recordd allows multiple concurrent clients to independently create new content and publish it on the web using their local MMX device, the mmxd server and the record application. Interactive playback service: We prototyped an interactive playback service that consists of playback servers and a simple GUI application with VCR like controls at the client end. Our server implementations support high quality playback with complete playout control and good QOS guarantees. Similar to the recording application, the playback application uses the mmxd control server to control local MMX that supports the multimedia playback.

3.4.2 Building MOD Servers Figure 3.3 illustrates the salient features of two high performance MOD playback servers we designed and prototyped. The first server called Single node MOD (SNMOD) server runs on a single PC equipped with high bandwidth SCSI channels, an OC -3 ATM interface and a large amount of storage configured as software disk arrays (Figure 3.4). The server runs the NetBSD

32 MOD Playback Servers Distributed Server

Single Node Server

Single PC based server with NetBSD

mmbuf based disk-to-network datapath QoS guaranteed Storage IO

Multinode distributed architecture with slave and master PCs

Enhanced NetBSD on all PCs

High BW data striped on all PCs

Aggregated Asynchronous IO Distributed Playback server with distributed scheduling RTUs for guaranteed CPU access

Figure 3.3: Two server prototypes enhanced with mmbuf based zero copy disk-to-network data path, DRR fair queuing in the storage system, and a new system call API called stream API that allows aggregated, asynchronous I / O. We integrated these new OS mechanism with a software disk array. We also prototyped a distributed MARS architecture (Figure 3.5) consisting of 6 OS

slave PCs controlled by a master PC and interconnected using an off-the-shelf 8-port 155 Mbps ATM switch. Each of the PCs in our prototype runs the enhanced NetBSD OS. We designed distributed data layouts which break the high bandwidth multimedia data into constant-time length units called chunks and stripe them on to the slave PCs. Such striping increases parallelism in data access and increases aggregate throughout and concurrency. We implemented a striping service on our PC cluster to support such striping. We also designed a distributed scheduling scheme which uses the periodic timing information multicast by the master server and the knowledge of distributed data layout to guarantee ordered data transmission from unsynchronized storage nodes. Our prototype distributed playback server implements this scheduling scheme.

33 200 MHz Pentium Pro PC MOD Playback Server Application

Read

StreamRead

StreamSend

MMBUF Buffer System

Fast Zero Copy Data Path

NRT

RT VC1

DRR SCSI

VCN

ATM NIC

CCD

To ATM Switch Port

Enhanced NETBSD UNIX Kernel

Figure 3.4: Single node server

3.4.3 Achieving Load Balance An MOD server has to support independent connections in arbitrary playback states, i.e., when some connections are in normal play mode, other connections may be paused or in fast forward or slow-play. Typically, to minimize resource requirements, interactive operations such as fast forward or rewind are implemented by skipping stream frames. We have showed that such frame skipping can lead to potential load imbalance situations in the distributed cluster based storage servers that employ distributed layouts. We analytically proved several interesting load-balance-properties of Generalized Staggered Distributed Cyclic (GSDCL ks ) layouts. Specifically, we provided safe-skipping-distances which when used to implement fast-forward and rewind operations always result in load-balanced operation of the cluster. Our prototype implementation of striping and distributed playback services makes use of these analytical properties.

34 Distributed MOD Server

1

1 Pentium PC Storage

Manager

2

2

ATM Pentium PC

Switch

Storage

N

N

Pentium PC Storage

Figure 3.5: Distributed MARS server prototype

3.5 Contributions The key contributions of this dissertation are as follows:

We designed and prototyped recording and fully interactive web based playback services by building the server and client components to conclusively demonstrate that high quality, full rate video, audio-on-demand can be made a reality. In building these services, we separate the control and high bandwidth data path.We use web for control path operations and use a hardware data path for high performance. We also uncovered serveral interesting effects of client device limitations on server design. We also showed that OS enhancements are necessary for these services to provide high performance and QOS guarantees. We demonstrated that existing operating systems such as 4.4 BSD UNIX can be enhanced to provide soft real-time guarantees to multimedia applications and servers. Specifically, we demonstrated that the Real-Time-Upcall (RTU) technique provides guaranteed access to CPU resource. We developed a novel MultiMedia Buffer (mmbuf) system for zero-copy data path between the disk and network. We showed that

35 the problems of resource allocation in SCSI driver and minimizing disk seek, rotational latency can be easily decoupled using two level queuing. Also, we proved that DRR fair queuing over multiple priority queues in the SCSI driver can be implemented with minimal complexity and provides fair access to storage bandwidth. Our other notable contribution is that we have developed a novel system call API that obviates need for separate system calls for different stream by supporting request aggregation. In short, our enhancements to NetBSD OS proves that 4.4 BSD UNIX can be made a strong candidate for a next generation multimedia operating system.

We have proposed and prototyped a novel distributed storage server architecture that shows that scalable MOD storage server and service can be built in cost-effective way using commodity components such as PCs, and off-the-shelf ATM interconnects. We also showed that using simple principle of data striping, high level of concurrency and parallelism can be supported in such a distributed server and storage replication can be minimized. We also analyzed load-balance properties of distributed data layouts that are crucial to load-balanced operation of generic distributed cluster architecture. These results illustrate how a rich choice of interactive search speeds can be provided to clients of a interactive playback service.

3.6 Dissertation Outline In the remaining chapters, we will present individual components of our research. Each chapter includes a problem statement, a detailed description of our solutions and/or our prototype implementation, and experimental results characterizing its performance. We also discuss the limitations of our design and/or conditions necessary for our solutions to yield high performance. At the end of each chapter, we compare related work with our solutions. Chapter 4 describes the design and prototyping of web based recording and interactive playback services using existing 4.4 BSD UNIX operating system and web server software. Chapter 5 describes in detail several new OS mechanisms designed and implemented in NetBSD UNIX operating system to support guaranteed high performance access to CPU, disk and network subsystems. Chapter 6 describes the design, implementation and performance of a single node MOD server that uses these OS enhancements. In Chapter 7, we address the problem of scalable server and services design. Specifically, we present the Massively-parallel And Real-time Storage ( MARS) architecture and associated

36 data layouts and scheduling techniques. We also describe the prototype implementation and performance of the striping and distributed playback services. Chapter 8 analyzes the load-balance-properties to derive the safe-skipping distances that guarantee that QOS guarantees of connections are not violated. Finally, Chapter 9 presents our conclusions.

37

Chapter 4 Simple Web based Multimedia-On-Demand Services The client/server model has become a popular way of architecting distributed systems wherein the servers provide services which are accessed by the clients using standardized interfaces or protocols. Serveral of the existing Internet services such as email, telnet, archie, and world-wide-web use the client/server model. For example, in the world-wideweb, the web servers provide a file access service which is accessed by the clients using an application called a web browser, such as Netscape Navigator or Internet Explorer. The information exchange between the web server and the browser employs a standardized protocol – the Hyper Text Transport Protocol (HTTP). The present day web servers provide access to only a few types of data such as text, images, graphics and animations. Given the on-going push for integration of multimedia data such as video and audio with these traditional data types, multimedia-on-demand (MOD) services are becoming increasingly important. Two important classes of these MOD services are: the content creation services that allow end-user to create and publish multimedia data and the content access services that allow end-user to access the multimedia data. Interactive recording and multimedia document composition are two simple examples of a content creation service, where as the fully interactive movies-on-demand, orchestrated presentations, personalized agent assisted news etc. are a few examples of the content access services. It is desirable that these MOD services be accessible using a universal and easy-to-use access interface. Since the web interface has become a preferred universal interface for information access, the new MOD services should be accessible from a web browser.

38 Several rudimentary MOD playback services that employ low bandwidth, low quality audio and video have already begun to appear on the Internet. However, with the ongoing improvements in compression and higher network speeds, push for high quality, high bandwidth multimedia services is getting stronger. Therefore, in our work we focus on high bandwidth and high quality MOD services accessed using the web interface. In this chapter, we describe prototype client and server system component of two simple web based services: a interactive recording service for content creation, and a fully-interactive playback service for content access. We have architected our services to allow construction of more complex MOD services from these basic services. In the prototype systems we describe in this chapter, the MOD services are accessed using a client multimedia device called MultiMedia Explorer (MMX). We first describe this device in some detail and highlight its strengths and limitations that are crucial to understanding issues in building such services. We then present the detailed design and implementation of each of the services.

4.1 MultiMedia Explorer (MMX) Client Device From Host

To Switch From switch

ATM

Y Video Subsystem

NTSC Video

Audiio Subsystem

CD Audio

From video src To video display

ATM

ATM

From audio src To Speakers

ATMiZer T

To Host ATM

Figure 4.1: Basic internal architecture of

MMX

The MultiMedia Explorer (MMX) is a multimedia device designed and prototyped at the Washington University’s Applied Research Laboratory[93, 94, 95]. It consists of three basic subsystems (Figure 4.1): (1) An ATMizer, a 155 Mbps ATM interface which allows it

39 to be connected to a standard OC -3 SONET port of an ATM switch. (2) A Video subsystem capable of full motion, full rate duplex compression and decompression of MJPEG video. (3) An Audio subsystem that can encode and decode stereo audio. The MMX supports two standard NTSC analog video input ports and a SVIDEO port. These ports can be used to connect video sources such as camera, laser disks, Digital Video Disk ( DVD), and VCR. The two video output ports can be used to connect an NTSC or SVIDEO compatible display device. The stereo audio sources, such as microphones, CD, DVD players, can be connected to the MMX using standard RCA jacks to port whereas stereo speakers can be connected to AUDIO - OUT- A and AUDIO - OUT- B ports. The MMX supports a second ATM port called WORKSTATION port which can be connected to the ATM host-network interface. It also provides an RS -232 serial port using which it can be connected and controlled from any host with a serial port.

4.1.1 Characteristics of MMX The capabilities and the limitations of the

MMX

multimedia device are described

below:

Capabilities of MMX: The video subsystem in MMX uses a JPEG compression engine to perform full rate (30 fps) compression of digitized video using MJPEG intra-frame compression. The video quality and the bit rate of the compressed stream can be controlled by the quantization factor used by the this engine. The resulting bit stream is sent by the ATMizer over an ATM connection. The video subsystem also contains a MJPEG decompression engine capable of full rate decoding of the compressed video stream supplied by compression of a local source or received by the ATMizer over an ATM connection. It writes the decoded images asynchronously written into a frame buffer, which can store up to two previously decoded frames. The decompression engine has a built-in speed advantage of

20 % which allows it to decode a video stream at up to 35 fps. The audio codec in MMX supports up to 8 different audio sampling rates - 44:1, 29:4, 22:05, 17:64, 14:7, 11:025, 8:82 and 7:35 KHZ. A digital audio bit stream received from the network (via the RXFIFO bus) is reconstructed to generate the analog audio signal. The DSP processor in the audio subsystem of MMX provides the volume control, left/right stereo channel mixing, amplification, multi-source mixing functions. It also supports a simple selective sample discard to compensate minor rate mismatches between sender and receiver.

40 Lack of playout buffers: The MMX was originally designed to be used in scenarios where the sending device on the ATM network would be an another MMX. The ATMizer subsystem in the MMX drains the data produced by the audio and video subsystems on a regular basis and produces a smoothed cell stream. As long as the intermediate network connecting the sender and the receiver MMX does not introduce excessive jitter and/or loss, the received cell stream at the receiver MMX is faithfully reconstructed in hardware. This clearly suggests that the data flow between the two MMXes connected by a loss-free ATM data pipe, is clocked by a hardware clock and processed in a hardware pipeline. Any rate mismatches between the sender compression and receiver decompression/reconstruction processes are of the order of clock skews. Also, the speed advantage in the video engine keeps the RXFIFO nearly empty. For these reasons, the designers of MMX provided very small buffers in the video/audio data paths. The RXFIFO and TXFIFO used in the audio and video channel are typically 1024 bytes in size. Such small buffer size qualifies the MMX to be an example of a bufferless client described in Chapter 2. This characteristic of MMX requires that if a non- MMX device, such as a video server attempts to send data to MMX, it must pace the data very carefully to avoid buffer overflows or underflows. Also, the clock used at the sending end must be very accurate; any large transient mismatches between the sending rate and the expected rate at the receiving MMX can not be compensated easily due to lack of buffers. Lack of capability to compress/decompress integrated audio/video bit stream: The MMX generates audio and video streams completely independent of each other and transmits them over two separate ATM connections. In fact the audio bit stream is an uncompressed stream and does not carry any timing information that can be used to correlate audio samples to corresponding samples/frames in the video bit stream. So in the event, video and audio data is lossy or not paced accurately, audio-video synchronization degrades dramatically. Lack of capability to handle lossy video bit streams: The video decompression engine in MMX expects a completely error-free data stream. In the event of loss of ATM cell(s) that result in loss of video/audio data or vital frame/field markers in the video bit stream, it cannot perform any error-concealment or frame discard and quickly stalls resulting in a “freeze” in the playback. Lack of support for AAL5 segmentation/reassembly: The ATMizer block in MMX at present only supports AAL 0 or raw ATM cell stream. The lack of support for AAL5

41 segmentation/reassembly protocol which is widely used in ATM data networking applications, complicates integration of MMX in systems in which receiving or sending entities are non-MMX devices such as a workstation or a server. Most of the current ATM network interfaces used on desktop workstations or network servers do not provide efficient support for segmentation or reassembly of AAL 0 and thus, integration of MMX into a high quality end-to-end system wherein one of the end point is a non-MMX device is quite challenging. In remaining chapters, as we describe our system prototypes, we will present implications of these characteristics on the performance of our system.

4.1.2 MMXD: A MMX control multiplexing daemon

Appln1 REQ

Appln10

HOST

RESP

mmxd

RS232

SNMP

Monitor 155 Mbps ATM Switch Port

MMX Laser Disk

VCR

Camera

Figure 4.2: MMXD control multiplexing daemon The MMX device in its present form provides very low level commands for controlling the various subsystems described above. In common MMX application scenarios, a multimedia application running on the host would exchange such commands over a serial connection or using SNMP protocol1 run over the ATM port. If multiple instances of the same or different applications want to control the MMX, the control commands must be serialized and consistent state maintained across all applications. In order to relieve the 1 SNMP is the Simple Network Management Protocol commonly used to manage and control network entities such as routers, bridges and switches.

42 application programmer from these low level tasks and to provide an abstraction of a device that understands a set of simple commands, we have developed a control multiplexing daemon called mmxd (Figure 4.2). This daemon, currently available for UNIX and Solaris operating systems, runs on the host to which the MMX is connected via a serial RS -232 line or via the workstation port. Any application that runs on the host and wants to control the MMX, opens a TCP / IP connection to the mmxd and exchanges simple text commands. In the present form, mmxd also allows connections from applications running on other hosts. At a given time, any number of applications on the host can multiplex their control commands to the MMX through mmxd. The high level commands supported by the mmxd are as follows:

INIT MMX: This command initializes all MMX subsystems.

OPEN CONN : This com-

mand informs the ATMizer to open ATM connections with requested VCI, VPI numbers. It also sets up the data flow between the ATMizer and the video/audio subsystem. For example, in response to a command OPEN CONN TX VIDEO 1 0 400, mmxd sets up the compressed video data to flow from the local video source over the ATM connection with VPI=0 and VCI=400.

CLOSE CONN : This command

closes the

connections with specified VPI and VCI opened before using the OPEN CONN command. For example, the command CLOSE CONN TX VIDEO 0 400 closes the ATM connection with VPI= 0, VCI=400 on which the compressed video from local active video source is being transmitted.

ATM

SET VIDEO :

This command sets the active video source among the three video ports – CIN 1, CIN 2 and SVIDEO. Also, based on the type of the device – laser disk, VCR or the camera, it executes low level MMX commands required to achieve proper synchronization at analog level. For example, SET VIDEO VCR CIN1 initializes the video subsystem to synchronize to the NTSC compatible VCR on port CIN 1 of MMX. SETQ [rxq] [txq]: This command sets the video compression and decompression

quality by setting the quantization factor used by the

JPEG

encode and decode engines. The

quantization factor used in the compression engine directly controls the network rate of the compressed video. It can range from 1 to 255, where smaller values result in better quality.

43 VOLUME CTRL [AA][AB][BA][BB][ALB][BLB]: This command is used to control

the volume levels of stereo channels A and B. Since there are two output audio ports and four possible audio input sources (2 channels received over the network and two channels from local audio sources), six different gain values need to be set to achieve audio mixing. MUTE AUDIO, SELF MUTE These commands allows applications to mute or un-mute the

audio playback. HELP, help If the mmxd is used via a telnet session, this command provides online help

of command syntax. It is useful during a remote control session with the MMX. CHANGE CONFIG When the mmxd is run it reads the

default configuration of the

MMX

devices from a .marsrc file. This command allows the

configuration to be changed on the fly.

We have successfully integrated the mmxd in our implementations of MOD services which are described in the rest of this chapter.

4.2 Design of MOD Services The multimedia-on-demand (MOD) services that we describe in this chapter can be accessed from the web page of a web server. Figure 4.3 illustrates the service access web page of the currently deployed MOD server. The web page lists basic services that the server provides, which are as follows: Change of password: A user can access the services if a password protected account is setup for him. The user can change the assigned password any time using this service. Start mmxd: By clicking on this link, the user can run the client mmxd described earlier.

MMX

control daemon -

Kill mmxd: Clicking on this link, user can terminate the active instance of the mmxd daemon. Recording service: This service allows the user to record, playback and publish audio, video content to the MOD server. Fully Interactive Playback service: Also called True-Video-On-Demand, this service allows user to open a dedicated session with the MOD server. It provides full interactive control on the playback.

44

Figure 4.3: Service access web page In order for client to be able to access these services, a configuration file .marsrc must exist in the user’s home directory. In the remaining sections we will describe each of these services in greater detail. The implementation of these services at the server assumes an abstract directory structure illustrated in Figure 4.4. As shown in this structure, the content of the MOD server is stored in the / MARS directory on the server. This directory contains three main subdirectories: accs, html and root. The html directory contains all the MARS web documents, whereas the root subdirectory contains the data, images and the helper applications such as password administration programs. The actual user accounts are maintained in the accs subdirectory which contains the storage mount-points such as s1, s2, and s3 on which file stores are mounted. These file stores can be created on single disks, software disk arrays with multiple disks, or on RAIDs. In our prototype, we use BSD FFS file stores on disks and software disk array. However, other file stores, such as Log File System (LFS) [81], or a multimedia file system [102] can also be mounted on these points. The user accounts are created on the file stores. For example, in Figure 4.4, user account milind is assigned to file store mounted on /MARS/accs/s1, where as file store on /MARS/accs/s2/ contains

45 (root) MARS

accs

s1

s2

cs578

cs577

root

s3 guru

turner

milind

html

cs422

chuck

cs533

topic2 topic3 topic1

lecture3 lecture2

lecture1

lecture1.mjpeg lecture1_audio.mjpeg lecture1.mjpeg.mdata lecture1_small.gif

Figure 4.4: Abstract directory structure the accounts for users guru and turner. The multimedia content created by a user is entirely confined to his/her account directory.

4.3 Web Based Multimedia Recording Service The Figure 4.5 illustrates the client and server components of the interactive content creation service. It consists of the a server recordd running on the MOD server and a client side Graphical User Interface ( GUI). The user can activate the record GUI from web page illustrated in Figure 4.3 by clicking on the recording service web link. In our setup, the recordd server is run on a PC with dual ultra-wide SCSI adapter controlling multiple large capacity storage disks, and a 155 Mbps OC -3 ENI ATM interface. The server runs the NETBSD UNIX operating system with support for software disk striping. Using

46 Record GUI

MARS Recorder Record Control (TCP/IP)

recordd striped

Network Control (TCP/IP)

MMX Control (TCP/IP)

mmxd Monitor

Data MMX LaserDisk VCR

Camera

155 Mbps ATM Switch Port

ATM Network

Figure 4.5: Recording service: components this striping support, multiple storage disks at the server can be configured into a larger software disk array.

4.3.1 Client Application: Record GUI Figure 4.6 illustrates the application that allows a client to access the content creation service. The invocation of this application by the browser relies on the MIME type system used to convey different data types between client and servers in the internet. Specifically, the browser (client) invokes the record application as helper application to handle a new MIME type text/record defined by the MOD server. This requires a new MIME type - text/record record in the .mime.types file and a text/record Record.pl handler to set up the MIME type system. User interaction with record server is modelled at the server and the client by a session object. Typically, a user opens a session with the recording server, performs content creation, editing and playback tasks, and in the end terminates the session. The GUI illustrated in Figure 4.6 contains three distinct parts described below:

Session control: A user controls a session with the server using OPEN SESSION, CLOSE&Reopen, CLOSE SESSION, QUIT buttons in the top half of the GUI. The current recording application is single threaded and thus, allows the client to open only one session at a time. To open multiple sessions to the same or different or same server, the client must run independent instances of the application.

47

Figure 4.6: Client application for accessing recording service

Document and account processing: The set of buttons and the list box at the left hand side of the GUI allows the user to navigate its account directory and process audio/video documents as needed. Session properties: In the right lower part of GUI the user specifies the playback or recording properties of the session. For example, in order to record an audio/video document, the user specifies the device to record, the sampling rate, video quality, document name and network connection numbers. To minimize the user input, the application can also use predefined defaults in the .marsrc file in the user’s home directory. To simplify the GUI, this portion of the GUI can be displayed only when a record or a play operation is requested. The typical user interaction in a session with the record application is as follows.

OpenSession: The user clicks on the OPENSESSION button to open a session with the recordd daemon. A pop-up EntryBox prompts user for the server name, port number, account name and the password. The application then opens a TCP / IP connection to the requested recordd server and if the account information is valid, displays the contents of the user’s home directory on the recording machine. For example, in the Figure 4.6, user milind has opened a session to the demand9.arl.wustl.edu

48 server and the directory content containing several movies such as StarWars, ElvisComback.. etc. are displayed in the listbox. The user can use CHDIR button to traverse the directory tree in its account directory and use DELETE to remove any unwanted content.

Recording a video/audio document: This operation uses the Session Properties part of the GUI and requires following steps: – Specify multimedia device properties: The user selects the appropriate media - video, audio or both to record. The radio buttons corresponding to audio sampling rate, recording device type and port number, and video quality on the sliding dial are assigned default values. The user can change these parameters if desired. – Specify document name: The user enters a name in the file name entry box, say Sample. The recordd server uses this name to create a subdirectory Sample in current working directory of the session. It also sets up the audio Sample audio.mjpeg.adirty and/or Sample.mjpeg.vdirty video files for recording audio and/or video. – Specify network connection properties: The record application allows the user to request three different types of network connections: Manual, Permanent Virtual Circuits ( PVC) Server and Switched Virtual Circuits ( SVCs). If the PVC server option is selected, the application connects to a well-known network resident PVC server to obtain VPI and VCI identifiers for audio and video ATM connections to the recording server. In case of the SVC option, application performs appropriate network signaling to obtain these identifiers. In the event neither of these two options is supported, user must select the manual mode and provide valid VPI and VCI numbers. – Start recording: Pressing the START button begins the recording process. In response to this event the application sends a text message to request a record session with appropriate properties. If this command succeeds at the server, the application connects to the local mmxd daemon controlling the MMX and exchanges control commands to setup (1) the video device, (2) the sampling rate, (3) video quality, and (4) video and audio connections with requested VPI and VCI in transmit mode. If these operations succeed, the MMX streams the audio/video data to the recording server over the network where it is captured and recorded to the storage devices.

49 – Stop the recording: In response to user pressing the STOP button, the application sends a text message to the recordd and also, exchanges commands with the mmxd to close the ATM connections. If the above steps complete without errors, the user has successfully created a multimedia document with requested number of streams.

Preprocessing of video/audio Files: The recorded documents need to be pre-processed to generate meta information used during video/audio playback. The video pre-processing extracts video frames and sends messages containing the number and frame size information to the application. The successful completion of preprocessing creates a video data file Sample.mjpeg and a corresponding video meta information file Sample.mjpeg.mdata at the server. A statistics file - Sample .mjpeg .stats file that records statistics information such as the average frame size, maximum and minimum frame size, maximum network rate, and a histogram of frame sizes is also created. These statistics can be used by the playback server to make intelligent admission control and resource reservation decisions. Extracting images and viewing images: The user can extract and convert frames from the recorded video stream into JPEG and GIF images using the a GETIMAGES button. In response, the application prompts the user to enter the range of frames for which images need to be created. If the operation succeeds, the server creates, full size and thumb-nail size images for the requested frames. All the lossy frames are ignored. Using the SHOWIMAGES button, the user can view these extracted images. The GETIMAGES and SHOWIMAGES functionality is useful to decide if the recording was successful. Playing back recorded video/audio: The record application also allows the content creator to playback the recorded multimedia document. The user can request selective playback of audio, video or both streams. The steps in playback are as follows: – Select the playback type: Select one of the AUDIO, VIDEO or BOTH buttons. In order to play the document stored under name Sample, a user enters the subdirectory corresponding to that document. – Network connection: Specify appropriate mode of network connection between the MMX and the recordd server as in case of the record operation.

50 are

– Start and stop the playback: The START and STOP buttons on the GUI used to accomplish these tasks. The application also executes appropriate MMX commands, such as OPEN, CLOSE connections to start and stop the playback.

Publish recorded documents: The recorded multimedia documents can be published by using the PUBLISH button on the GUI. In response to the PUBLISH command sent by the application, the recording server runs a recursive PUBLISH script on its file stores. If the script succeeds, the documents are made available on the web page of the MOD server, which can be accessed using the document playback service by anyone with a valid MOD account . Close/Quit Session: Clicking the CLOSESESSION button closes the current active session. Clicking the QUIT button terminates the application.

Our recording application is representative of future multimedia content creation applications and can be extended easily to support multimedia devices other than MMX and additional service functions.

4.3.2 The Recording server - recordd Request Socket

Record GUI

Request Process

Connected Socket

VCI=100

Spawn Process

VCI=100

SessionHandler Process

MMXD VCI=101

VCI=101

NATM Sockets

FFS Filesystem on a Virtual or Real Disk

Figure 4.7: Process structure of the recordd daemon The recording server runs as an ordinary UNIX process on the recording machine (Figure 4.7) and waits for client requests on a well defined port. It accepts a new session

51 connect request, if the number of active sessions does not exceed a predefined number of sessions. It spawns a handler process which is responsible for all command processing for the admitted session. This handler process implements the following command set.

OPEN SESSION ACC PASSWD : In response to this command, the server first vali-

dates the password-account pair and upon success, sends the contents of the home directory of the corresponding account. It also creates a session object to keep track of state information, such as open files, control and data connections, playback/record state, and the work directory of this active session.

CHDIR sub-dir: In response to this command server changes the current work

directory and allows the user to navigate its directory hierarchy.

CREATE DIR [ xyz ]: This command creates a directory in the current work di-

rectory and changes work directory to this new subdirectory.

START RECORD [BOTH VIDEO AUDIO] [FILE=StarWars] [Proto=ATM | SVC | PVC] ASAMPLE QFACTOR VVPI VVCI AVPI AVCI [AAL=AAL0 | AAL5]: This command sets up the recording for audio and video

streams. In this particular example, it creates a new subdirectory StarWars in the current work directory and opens files for recording video and audio. It uses the network protocol information field to decide the type of network connections to setup. If this field is specified as ATM, it uses the VPI and VCI values specified in the command to setup the standard BSD sockets corresponding to these connections. The standard read calls on these sockets retrieve data sent by the MMX. It also creates a hidden information file .StarWars info in which all the session properties such as network connection numbers, video quality factor, audio sampling rate etc. are recorded. The recording process then continuously reads the data from the network connections and writes them to the opened files.

STOP RECORD: In response to this command, the server terminates the on-going

recording by closing the video/audio files and the network connections for the data path.

START PLAY BOTH FILE ATM VVPI VVCI AVPI AVCI and STOP PLAY FILE VVPI VVCI AVPI AVCI: These commands perform the tasks complimen-

tary to the ones to setup and tear down a recording session. Once the playback is setup, the server attempts to send video and/or audio data periodically.

52 PROCESS FILE FILE: This command is used to process the recorded files, which involves two steps: (1) cleaning data files and (2) Processing “cleaned” files to extract the timing markers and save their location information in a separate meta-data file. Figure 4.8 illustrates the block diagram for these two steps. StarWars.vdirty

StarWars.mjpeg

Cleaner

StarWars.mjpeg.mdata

Video Parser

Figure 4.8: Steps in preprocessing the recorded multimedia The cleaning step is necessary due to limitations imposed by the MMX client device and the ATM interface that we employ in our prototype. Specifically, the MMX supports only AAL 0 adaptation layer. The ENI ATM interface in our prototype does not support reassembly of multiple AAL0 cells into a single large data unit and causes an excessive overhead of one CPU interrupt per ATM cell. To minimize this overhead and avoid data loss, we modified the network interface driver to emulate cell reassembly. This however, results in raw ATM cells being stored in the disk files. and requires that the ATM cell headers be removed later by off-line processing. We call this step cleaning. Once cleaned, the data files containing the original cell stream are discarded. For example, as illustrated in Figure 4.8, a video file StarWars.vdirty containing the bunched cell stream when cleaned results in a file StarWars.mjpeg that contains only the payload portion of all cells. The StarWars.vdirty file created at the time of recording is removed after the cleaning step is completed. In the second pre-processing step, the server employs a parser routine to scan the video or audio bit stream to extract timing information. The format of the MJPEG interlaced video generated by the MMX is illustrated in the Figure 4.9. Each video frame in the stream consists of two fields: an even field and an odd field. The even field begins with an even-field marker SOF0 and ends with an end-of-even-field marker EOF0. A start of the odd-field marker - SOF1 immediately follows the EOF0 marker. The odd fields end with an end-of-odd-field marker SOF1. Table 4.1 shows the 16-bit bit-patterns for each of these markers.

53 SOF0

EOF

SOF1

EOF

SOF0

0xffe0

0xffe2

0xffe1

0xffe2

0xffe0

Even Field

Odd Field Video Frame

Figure 4.9: Format of the MMX

MJPEG

video

Table 4.1: MMX video markers Field markers SOF0 SOF1 EOF0 EOF1

Bit pattern

0xffe0 0xffe1 0xffe2 0xffe2

The video parse extracts information on the even, odd fields of the video frames. Continuing our example above, the server parses the StarWars.mjpeg video file to find N video frames and 2N video fields, and creates a meta information file – StarWars.mjpeg.mdata with N fixed size meta-data entries. Each meta-data entry (Figure 4.10) describes the start, end and size of the even,odd fields and start, end and size of the entire frame. Since the parser routine depends upon the format of the media bit-stream, a separate routine is required for every compression standard such as MJPEG, MPEG, and H .261. The audio bit stream generated by the MMX is an uncompressed stream and does not contain any timing information markers. So parsing this audio stream creates meta information based on the audio sampling rate and the sample size. The drawback of this is that the audio meta information can be often incorrect.

META CREATE FILE: This command is similar to the PROCESS FILE command but only executes the step (2).

CREATE IMAGES FILE from up to: This command instructs the server to use

the video meta information created during the PROCESS FILE or META CREATE command to extract video fields for frames in the range from to upto. It interlaces the

54 Frame meta information EvenStart EvenEnd OddStart OddEnd FrameStart FrameEnd FrameSize LossFlag

Figure 4.10: Format of the meta data video fields and generates the corresponding GIF, and JPEG images using standard compression/decompression functions available in the jpeg library.

SHOW IMAGE FILE: In response this command, the server transfers the image file

to user application where it is displayed.

DELETE FILE: This command removes the requested file or directory in the user

account.

PUBLISH LOCAL | : This command requests the server to pub-

lish the content of the specified account or all accounts on a web page which is accessible through a web server. The server accomplishes this by executing a publish script which recursively scans the accs subtree of the abstract directory structure shown in Figure 4.4. The leaf nodes in this tree – i.e., the directories that do not contain any subdirectories, represent the multimedia content. The script processes the meta and other session information files to generate a session description file that is used during playback. For every leaf node, it records an HTML link entry in the web page created in the account directory to which it belongs. For example, after scanning tree in Figure 4.4, the publish script generates lecture1.mmjpg session description file in the lecture1 leaf node directory. It also creates a web page guru.html in the home directory – /MARS/accs/s1/guru of the user Guru. This page among other

55 entries contains an entry for lecture1.mmjpg with lecture1 small.gif as the thumbnail. HTML

The above text-based message protocol between the server and the client application is similar to the HTTP protocol used in the WWW and can be easily extended to support new functionality. Summary We described the design and implementation of client and server components of the interactive recording service. Our prototype recordd server runs on a 200 MHZ Pentium Pro PC with a 155 Mbps ENI ATM adaptor and 27 GB storage in the form of three 9 GB 2-disk CCD software disk arrays. The 2 MB on-board memory on the ATM adapter was statically configured to be shared among the transmit and receive VCIs. The NetBSD UNIX version 1.3 kernel with an ATM driver and an enhanced socket interface with support for native mode ATM connections is run on the machine. The kernel did not have any enhancements to support multimedia data. Our prototype server supports up to 60 Mbps aggregate recording throughput on a single CCD. Faster disk arrays with larger number of disks provide higher write bandwidth and therefore, smaller data loss rate. However, currently, the maximum AAL0 throughput limits the maximum recording throughput to

60Mbps.

4.4 Web Based MOD Interactive Playback Service The WWW based interactive playback service allows a user to access multimedia documents recorded and published by every user of the recording service. It supports fully interactive playback which allows the user to control the playback by performing operations such as pause/resume, fast-forward (ff), rewind, slow-play, random search or even content-based searches. The available MOD documents are presented to the client of this service via a web page such as the one in Figure 4.11 This web page is updated using an auto-generation mechanism that ensures creation of new documents or deletion of old documents is immediately reflected in the web page. Some links in this web page have rightward arrows indicating that the link actually is another web page describing additional MOD documents. Such links are non-terminal links. For example, in the Figure 4.12, the link Movies by Milind is a link to a web page that describes MOD documents made available by Milind. If the user clicks on such a link, it displays the web page which

56

Figure 4.11: MoD Document Access: Main page contains hypertext links that describe movies recorded by user milind. We call such links terminal hypertext links. In response to a click on such a link, the server sends a session description file of the mime type video/mmjpg to the client. Figure 4.13 illustrates the client and the server components of this service. In response to a user click on a terminal link corresponding to a MOD document, the browser receives the session description file and consults its mime type system to invoke the playback GUI. Clearly, the mime type system must contain a new type definition video/mmjpg mmjpg and a handler definition - video/mmjpg mmxplay.pl to process the documents of this new type. The client application and the MOD server are described below in greater detail.

57

Figure 4.12: MoD Document Access: Milind’s video page

4.4.1 A Simple Application Level Streaming Protocol As shown in Figure 4.14, the WWW consists of three entities: the web server, the client in the form of a browser such as NetScape, and the internet that connects them. The web server provides access to text, images and data by implementing a set of application protocols such as GOPHER, FTP, HTTP layered on the TCP / IP protocol stack. The Hypertext Transfer Protocol ( HTTP) represents the most common protocol used for information exchange between web clients and servers. In essence, it consists of a set of methods - GET, FORM , LINK , POST, PUT, which are sent to the server by the client. The server executes these methods and returns the response which is a status notification with or without additional data. The GET method is the most common method used to retrieve a document - an ordinary text, hypertext, postscript data, image, or audio/video file, from the server. However, this file transfer paradigm used in the (HTTP) framework is unsuitable for bandwidth

58 Playback GUI Network Control (TCP/IP)

MMX Control (TCP/IP)

mmxd Monitor MMX LaserDisk VCR

Camera

155 Mbps ATM Switch Port

ATM Network

MARS Server

Figure 4.13: Fully Interactive Playback Service: Components and storage intensive streams such as video, audio, graphics, and high quality animations. Instead, a data streaming paradigm that allows data to be sent at a regular rate from the server to the client and allows control messages to be exchanged between them is necessary for used for such data types. To this end, we have proposed modifications to WWW framework to achieve this. Specifically, we have added a new method called GETSTREAM to the available set of methods in the HTTP protocol. Our extension can be implemented in two ways: (1) as an integrated web server that supports this new method along with the standard HTTP methods, or (2) as a stand-lone MOD server completely independent of the web server [101] that uses the new method as a streaming protocol. The new getstream method is used to request the web or the MOD server to stream the data corresponding to video/audio files. The syntax of the getstream method allows the client to specify stream operations such as playback, interactive control, and content based searches. A few examples of getstream requests from the client to server are shown below: 1] GETSTREAM COMMAND=OPEN?/mpeg data/alien.mpg+3000 +TCP HTTP/1.0

59

Client xv

Server

mpeg_play

Xmosaic or Netscape

xanim

httpd* Gopher

HTTP

TCP

Gopher

FTP

HTTP

UDP TCP IP

ATM NI

FTP

UDP

Data

Network

Ether NI Request

IP ATM NI Ether NI

Figure 4.14: Basics of WWW In this request, the client requests the server that the MPEG movie alien.mpg in /MPEG DATA/public html directory be streamed over a TCP connection to the client waiting on port number 3000. 2] GETSTREAM COMMAND=OPEN LOOPPLAY?/mpeg data/alien.mpg+ 3000+CMTP HTTP/1.1

On the other hand, in this request, client requests the server that the same movie be streamed using a video transport protocol called Continuous Media Transport Protocol CMTP [89] and that it be opened in the LoopPlay mode.

60 3] GETSTREAM COMMAND=RANDOM ACCESS?4500 HTTP/1.0 4] GETSTREAM COMMAND=PAUSE? HTTP/1.0 5] GETSTREAM COMMAND=FRMDADV?45 HTTP/1.0 6] GETSTREAM COMMAND=LOOPPLAY? HTTP/1.0 7] GETSTREAM COMMAND=CONTENTSEARCH¿‘SearchElvis’’ HTTP/1.0

Rest of these examples specify various playout control operations on an already open video connection. For instance, example 3 above requests a random access operation starting at 4500th second from the start of the movie. An alternate way to specify all interactive commands using a single PLAY command with three arguments, speed, slow factor, location is illustrated below. 8] GETSTREAM COMMAND=PLAY?-1+-1+4500 HTTP/1.1 9] GETSTREAM COMMAND=PLAY?0+-1+-1 HTTP/1.1 10] GETSTREAM COMMAND=PLAY?PLAY+2+-1+-1 HTTP/1.1

In this command syntax, if the speed is set to zero, the requested state is Pause, where as any non-unit positive (negative) value of speed changes the state to fast-forward (fast-rewind). A non-unit positive integer value for slow factor decides the temporal slowdown for the stream playback and puts the session in slow-play mode. The third variable – location defines the temporal position at which the playback should begin. Thus, in the examples above, (8) is an example of Random Access from 4500 second of the movie whereas the (9) is an example of the Pause operation. The last example requests fast forward at twice the playback rate. The interactive playback service implementation shown in Figure 4.13 employs two connections – a control connection and a data connection – for communication between the server and the client. In our prototype the control connection is implemented as a reliable TCP connection and is used for exchange of getstream method calls. The data connection can be one of the three types: a completely reliable TCP connection or completely unreliable UDP or native-mode ATM data stream, an RTP connection, or a partially reliable CMTP connection [89]. The flow and congestion control mechanisms of TCP aimed at providing reliable data delivery render it unsuitable for video transmission. On the other hand, the UDP, and NATM protocols do not provide any flow and error control. The RTP protocol provides time-stamp information and employs UDP for data transmission. It has

61 been widely used in the Internet for audio and video data transmissions. In our prototype, since the MMX is a native ATM device, we use an NATM data connection.

4.4.2 Client Application - MMX Control Interface

Figure 4.15: Client application for fully interactive playback Figure 4.15 illustrates the GUI front-end of playback application. When the application initializes, it establishes TCP / IP connections with the local mmxd daemon and with the playback server on the server machine. Once it is connected to the playback server, the application receives the thumbnail image and displays it as the document icon in the left hand corner of the GUI. The text label and the image icon are useful when multiple documents are accessed concurrently. A typical user interaction with the playback application is as follows: Select streams to play: The user selects appropriate media - audio, video or both - to play by selecting the appropriate button. The application cross-checks the streams requested with the streams listed in the session description file and warns user of incorrect choices. Open a playback session: The user requests a playback session to be started by pressing the OPEN button. In response, the application first obtains the network connection identifiers based on the default option. If the user selects the “Manual” option and if the .marsrc does not list any default PVCs for the server in the session file, the

62 application prompts the user to enter valid PVCs. In the case of the PVC server or SVC option, the application communicates with a host or network resident signaling entity to obtain the connection identifiers. Finally, it composes and sends an OPEN command in the form of a text message to the server. If the OPEN command succeeds, the application communicates with the local mmxd to setup the video and/or audio connections, video quality factor and the audio sampling rate described in the session description file. If all these operations complete without errors, the MMX receives the data over the requested ATM connection and resumes the stream playback. Playout control: The user can click on ff, fast-rewind, Slowplay, Slow-rewind, PAUSE, GOTO - START buttons to interactively control the playout much like a standard VCR. The minute and seconds sliding dials and the Random Search button together allow the user to start playback in the current state from the specified random search location. Using activate: The playback service allows the user to concurrently access multiple documents. However, the MMX can display only one document at any time. In this case, only one document is “active” and rest are “inactive”. The ACTIVATE button on the GUI makes the corresponding document the “active” document by reopening the appropriate ATM connections at the MMX. Close session: The CLOSE SESSION button allows the user to send close commands to the playback server and the mmxd. A successful close session operation releases the network and storage resources held by the session at the server and the local MMX device.

4.4.3 Interactive Playback Server Implementation We implemented the playback server that provides the interactive playback service as an integrated web and multimedia streaming server. We modified the public domain web server - httpd available from National Center for Supercomputer Applications (NCSA) [3] to support the streaming protocol in the form of the new getstream method. The process architecture of the original httpd server, illustrated in Figure 4.16, consists of a parent process and a set of child processes. Each child maintains a bidirectional interprocess communication ( IPC) channel (constructed out of bidirectional socket pairs) to the parent. The parent process works as a request distributor, whereas the child processes

63 sd IPC Channels

Parent Process

Child Process 1

Child Process 2

Child Process 3

Child Process N

csd

Figure 4.16: Organization of the NCSA httpd server implement various application protocols such as FTP, GOPHER, and HTTP. The parent process receives new TCP connection requests and if the number of back-logged connections is not exceeded, forwards the new request to an idle child process which processes the HTTP method calls on that connection. sd Parent Process

sd

GETSTREAM

Client Child Process j

Ctrl_sd

Data_sd

DATA

OPEN. LOOP, FF, RW, STOP, RESUME Commands

Figure 4.17: Modifications to httpd We modified this web server to implement simple streaming protocol described in Section 4.4.1. In this new server, called httpd+ (Figure 4.17), if the child finds a getstream method request, it opens a separate socket of appropriate protocol type for streaming multimedia data. The TCP connection obtained during regular HTTP processing is used as a control connection for interactive playout control commands.

64 Prefetch Process Virtual Disk FFS

read()

send()

Pinpong buffer in Shared memory

Send Process

Video VCI

TO Network

Audio VCI

NATM sockets

Figure 4.18: Streaming architecture for httpd+ Figure 4.18 illustrates the producer-consumer process architecture used for streaming audio, video streams in a multimedia document. The child process that processes the getstream request first sets up shared memory buffers for video and/or audio streams, opens the ATM network connections and forks a copy of itself. The main process serves as a data pre-fetcher and the other process serves as a data sender. The two processes share the video/audio buffers in the shared memory space. The data fetcher reads frame data from the video and audio data files using the standard read() system call. It uses the stream frame number n, and the meta-data file to find out the size of the nth frame and its offset in the stream data file. The data sender checks for the validity of buffers in shared memory space and attempts to periodically transmit the video/audio frames using the send() system call. It uses the frame size and frame period (33 msec) information to compute the rate at which the data must be paced by the network interface. This rate specification is passed on to the network driver using the ioctl() calls. The data prefetcher process receives and processes the client commands for the interactive operations such as fast-forward/rewind, pause/resume, slow-play, and random-search. It appropriately modifies the prefetch schedule and the session state maintained in the shared memory.

4.5 Experiences Our recording and playback services have been deployed and routinely used and demonstrated in our ATM testbed for well over 18 months. We have used the recording service to create multimedia documents from movies on video cassettes and laser disks,

65 TV programs, CD players, and live lectures/teleconferences. We have observed the average aggregate bandwidth of the recorded multimedia documents with one video and one audio stream to be 10 ? 12 Mbps. Our recording application has been found to be very easy to use and requires little knowledge of the system. Our playback service provides an intuitive interface with ordinary VCR functions and delivers stereo quality audio and VHS NTSC quality video. Activating and managing multiple independent sessions is easy in our prototype service. Upto 13 concurrent 10 Mbps sessions can be easily activated by each MMX client station to fully utilize the 155 Mbps OC -3 ATM link. Our services have formed basis of more advanced services such as multimedia composition service which has been prototyped and deployed [114]. Such advanced services can be realized by modifying only the control path without any changes to the high bandwidth data path. At present, due to lack of ATM signaling support for MMX and lack for NATM protocols in NetBSD, our services rely on pre-configured PVCs supplied in a configuration file or manually by the user. This limitation has often proved to be cumbersome when multiple sessions are to be activated from the same MOD server. Also, the lack of playout buffer at MMX, lack of timing information and network induced loss often lead to loss of irreparable loss of audio-video synchronization. Despite these limitations our services conclusively demonstrate that high quality MOD services are feasible with existing technology.

4.6 Summary In this chapter we described the design of simple web based interactive MOD services. Specifically, we described an example network connected multimedia device called MMX and presented the design of web based interactive recording and fully interactive playback services that use this device. We also described the design of the client applications and the server components of these services.

66

Chapter 5 OS Extensions for QOS Guarantees and High Performance The multimedia data handled by Multimedia-on-Demand (MOD) services such as recording and playback require Quality-of-Service ( QOS) guarantees in the form of guaranteed bandwidth and bounded delay. The end-to-end nature of these services requires that the end systems, namely the storage server and the client device, and the network that connects them provide such guarantees. Within a server system, such services periodically transfer data between the storage and network subsystems; thus, both these subsystems must provide QOS guarantees. In the previous chapter, we described MOD services implemented on a server that runs 4.4 BSD UNIX OS. We conclusively demonstrated that the existing 4.4 BSD UNIX OS lacks mechanisms to support guaranteed CPU and storage access required by such services. In this chapter, we describe innovative OS enhancements that we have proposed and implemented to rectify these limitations. The rest of this chapter is organized as follows: In Sections 5.1, and 5.2, we describe the sources of inefficiencies and the lack of QOS on the existing control and data paths in 4.4 BSD— UNIX for network-destined data retrievals from the storage subsystem. Section 5.3 clearly demonstrates this using performance measurements on the playback server described in Chapter 4. Section 5.5 provides an overview of our OS extensions that rectify these limitations. Specifically, Section 5.6 describes a new CPU scheduling technique called Real-Time-Upcalls (RTU) and presents results on the guaranteed CPU access it provides to MOD recording and playback servers. Section 5.7 describes at length the design of the new mmbuf buffer management system that shortens the data path from the disk to the network interface. Section 5.8 describes our modifications to SCSI driver to support fair resource sharing among requests of different priority classes. In section 5.9, we present

67 details on the new system call API available to a user application to access these new OS services. Section 5.11 presents detailed performance evaluation of these OS modifications and discusses performance benefits and limitations. Finally, we summarize this chapter.

5.1 Limitations of UNIX CPU Scheduling Traditionally, the UNIX OS has been used in interactive computing environments where efficient sharing of CPU, storage and network resources is the primary design consideration. The existing CPU scheduler in UNIX systems aims to provide adequate response time to active tasks but does not provide any guarantees on periodic execution. A typical UNIX scheduler employs N priority queues and serves them in round robin fashion. At any given time, a process in RUN state belongs one of these N queues and its priority changes over time based on the history of its CPU usage and I / O activity. Due to the periodic nature of multimedia data, multimedia applications such as MOD servers and multimedia compression/decompression require periodic access to CPU resource. For example, an MPEG decoder decoding a video stream received over a network connection must be able to execute every 33 msec to periodically read data from the network connection, decode it and display the resulting frames. Similarly, a MOD playback server streaming 30 fps video must be able to periodically gain access to the CPU to request data retrieval and send operations. The existing dynamic priority based UNIX scheduling does not provide such soft-real-time guarantees required by multimedia applications.

5.2 Limitations of Existing Storage and Network I/O in UNIX In this section, we first provide an overview of the existing disk and network I / O systems and discuss various software layers involved in the data transfer to and from these systems. We then illustrate the control and data paths for disk-to-network data transfers using a function call trace in NetBSD 1.3 UNIX. Figure 5.1 illustrates the layered architecture used in the storage and network I / O systems of current UNIX operating systems. The common user-level accesses to storage devices are through the file system interface which consists of several layers. The lowest level is the device driver that allows access only in terms of fixed sized blocks. The next layer called the buffer cache layer employs caching to minimize disk traffic and improves

68 Appln Buffers

Application

write(sd,buf,sz)

read(fd,buf,sz) User-Kernel Boundry

buffer cache

mbufs

VNODE Layer

Socket Layer

UNIX FFS

UDP

TCP

NATM Buffer Cache

NRTQ

Disk Driver

IP

if_snd

if_recv

ATM Interface

Figure 5.1: Existing file and network I/O effective disk throughput. The remaining two layers, namely the vnode layer and local file store layer, provide file system services such as data organization and naming. Note that all I / O through the file system uses the buffer cache system. When a user issues a read request to the file system, the control path moves through these layers. The data retrieved in response is first copied from the disk to the kernel buffer cache, and then from the buffer cache to the application buffer. The network I / O subsystem consists of three logical layers: the network interface driver at the lowest level that controls the interface hardware, the protocol layer that provides various communication semantics, and the application interface layer called the socket layer. The network I / O system presently uses an mbuf buffer system that is completely different from the buffer cache used by the file system layers. This system provides variable length buffers that can be linked together to form buffer chains and is thus ideally suited for present-day communication protocols that add and remove headers and data to memory buffers to form variable length packets. As shown in Figure 5.1, when a user application sends data to the network, it is first copied from the user level buffers into the kernel level mbufs and subsequently into buffers on the network interface card. In the following subsections, we discuss the common API and the disk and network I / O layers in greater detail.

69

5.2.1 Common Read/Write Application Interface The file and network I / O system services are accessed through a uniform API consisting of read()/write() system calls which work on file descriptors. For every process, the UNIX OS abstracts its I / O activity by the file entry structures, each of which corresponds to an open file or a network socket. This structure provides generic interface functions such as fo read, fo write, fo close and stores state information such as the byte offset at which I / O is currently in progress. As illustrated in Figure 5.2, a file descriptor is an index into table of pointers that point to these file structures. Thus, for a process with proc structure pointed by p, using descriptor i and the descriptor table p->p fd, the OS can obtain the file structure fp=p->p fd[i]. The file structure in turn points to a vnode structure in case of a disk file or a socket structure in case of network I / O. For disk file I / O, the open() system call instantiates the file structure, the vnode and the corresponding inode, whereas in case of the network I / O, the same task is performed by the socket() system call.

Process Structrue

file structure proc p

vnode object

inode

f_data

File Descriptor Table j

file structure socket

i f_data

Figure 5.2: proc, file table and other relevant data structures We now provide a more detailed view of both file system I / O and network I / O.

5.2.2 File System I/O The file system interface provides a structured access to the underlying disk devices by providing an abstraction of a file as a succession of fixed sized data blocks. All accesses

70 through this interface use a special disk driver called “block device driver” which allows data accesses in terms of fixed sized blocks that are an integral multiple of the smallest block (called a disk sector). Buffer cache layer: BSD UNIX avoids expensive disk accesses by providing a file-system independent layer called the buffer cache which manages the memory used for transferring data to and from the disk and also caches recently used disk blocks. In a general purpose interactive computing environment, where small sized I / O operations are common, it has been shown that up to 85 % of the implied disk operations are satisfied out of the buffer cache [81]. Figure 5.3 shows the format of a buffer. The buffer header contains information used to find the buffer and to describe the buffer’s contents including the vnode whose data this buffer holds, the starting offset within the file, the number of bytes contained in the buffer, and a pointer to the data area of the buffer. Whenever a read is performed, the buffer cache layer attempts to find the requested buffer in the cache and if successful returns it without any actual disk access. If the buffer is not found, buffer cache layer gets a new buffer, adjusts the data area of the buffer to the desired size and passes it on to the disk driver. In the event of non-availability of free buffer blocks, the least-recently-used buffers are removed from the buffer cache and reused. File system layer: The file system layer performs two independent tasks: first, it manages the local file store that organizes data for files on the underlying disks and second, it organizes and maintains files and their attributes (such as locks, protection etc.) in a hierarchical name space commonly called the directory tree structure. 4.4 BSD UNIX supports several local file stores, prominent of which are the Fast File System ( FFS) and the Log-based File System (LFS). These local file stores employ a common data structure called an index node (inode) that maps a file offset to disk locations, and also, use common name space management code. All file or directory read/write activity is centered around inodes. Vnode layer: In the recent past, Network File Systems [70, 81] that allow access to local file stores at a remote machine have become commonplace. These remote machines may be running different operating systems and may be connected to different networks. In order to allow transparent access to heterogeneous file systems without requiring complex changes to the internal workings of the OS, 4.4 BSD uses a new extensible object oriented layer called the vnode layer. A vnode is a generic object that encapsulates an underlying file system specific object. For example, in case of a local file system, a vnode would encapsulate an inode, while in a network file, it encapsulates an nfs node - i.e., a protocol control block that describes the naming information and the network location of the file.

71 hash link free list ent B_INVAL, B_LOCKED B_DIRTY, B_BUSY, B_DONE, B_READ etc.

flags b_bufsize

Allocated buffer size

b_bcount

valid byte count

b_resid

remaining io

b_dev

Associated device from which io

b_lbkno

Logical Block #

b_blkno

Underlying Physical Blk #

b_vp

Device Vnode

b_proc

Associated Process

b_actf

On Device Driver Queue

b_actb

Data Area

b_data REST

Figure 5.3: Buffer cache block structure To summarize, the control and data path for file access moves through the vnode layer, the local file system layer, the buffer cache and the block device driver.

5.2.3 Network I/O The network I / O subsystem consists of three logical layers: the socket layer API for interprocess data transport, the protocol layer for protocol specific processing, and the network interface layer consisting of the device driver interfacing to the network hardware. These three software layers use a dynamic memory allocation system called the Memory buffer (mbuf). The mbuf system is designed to meet the need of variable size buffers for communication protocols that routinely prepend or remove headers from packetized data. Figure 5.4 shows the structure of an mbuf. It can contain three sets of header fields. The first part, m hdr is always present and describes the attributes of the mbuf, the address of the next mbuf on the chain, and the start address of the data section contained in

72 m_next m_nextpkt int

m_len

m_hdr

m_data m_type m_flags 128 B

MH_Pkthdr

Cluster Page

2 KB

ext_buf ext_free ext_size

m_data

Figure 5.4: Mbuf structure the mbuf. The second set of header fields are present if the mbuf contains a packet header. The third set of mbuf header fields are used in the event an external mbuf cluster is associated it. When an external page is attached to an mbuf, the M EXT flag is set, and the mbuf’s data is stored in an external page. Such an mbuf is called a cluster mbuf and is commonly used for data corresponding to large packets exceeding 100 bytes in size. Typically, the size of the cluster page is configured to be 2 KB, large enough to hold protocol headers and data for the largest Ethernet frame which is 1500 bytes in size. When the system is booted, 4 KB of physical memory is allocated for mbuf clusters. Further memory may be allocated on-demand up to a compile-time configurable limit (256 KB by default). Once the memory is allocated for mbuf clusters, it is kept in an mbuf free list and is never freed. The mbuf layer provides a set of routines that allow network protocols to allocate and free mbufs, and manipulate the data in an mbuf chain. We will describe some of these routines when we discuss the design of the mmbuf system. In the following, we will take a closer look at the representative function flow in a 4.4 BSD UNIX system for a read() from a disk file followed by a send() on an NATM socket.

73 1

Flow from the User space upto SCSI driver Appln Buffer

read()

Kernel read call

sys_read()

Vnode layer read

vn_read()

Local file layer read Buffer cache read

bread() bio_doread

Device driver call

uiomove

ffs_read()

Data copy function

Buffer Block

biowait

Sleep waiting

VOP_STRATEGY

Figure 5.5: Read function trace

5.2.4 Control and Data Paths for Disk Read and Network Send In the following discussion, we assume that the user process has obtained the file descriptors fd and sd corresponding to the file and socket, via the open and the socket system calls. The file read system call takes three arguments: (1) the file descriptor on which the read is performed, (2) a pointer to the user buffer into which the data is read, and (3) the amount of data to be read. Figures 5.5, and 5.6 illustrate the function flow for this system call. The path traced by the light arrows represents the control flow whereas the darker arrows indicate the data flow. The user level system call traps to the kernel and the kernel function sys read is called with the system call arguments and the proc process structure describing the process that made the system call. Using the fd descriptor and the process’s file descriptor table, the sys read function obtains the file entry and the corresponding vnode (Figure 5.2). It then creates a uniform I / O (uio) structure to describe the user buffer and invokes the generic vnode layer function vn read which in turn calls the FFS file store specific function ffs read. This function converts the read into one or more reads of logical file blocks and calls the buffer cache layer bread function. The bio doread function called by bread attempts to find the requested buffer block in the buffer cache using the

74 logical block number as a hash key. If the block is present in the cache, it is returned; otherwise, bio doread gets a new buffer from the buffer cache (using allocbuf), adjusts its data area to the desired size by stealing or giving up memory space to other buffers in the cache, and then calls VOP STRATEGY which calls the disk drivers routines. The sdstrategy SCSI driver routine that is called as a result enqueues the buffer in the job queue (buf queue in sd softc) and invokes the upper layer function sdstart in the driver. If the disk is idle, sdstart removes jobs from the job queue and passes them to the lower level driver for further processing. At this point, the process which made the system call is put to sleep in bio wait waiting for the disk I / O to complete. When the disk read is complete and the buffer is filled, the disk controller issues an interrupt which is fielded by the lower layer interrupt handler routine - bha intr of the driver. This routine calls the scsi done routine which in turn calls the buffer cache layer routine, biodone, that is responsible for waking up the process which is waiting for the I / O to complete. As shown in Figure 5.5, the kernel uses the standard uiomove() data movement function available in the kernel to copy data from the buffer cache block to the user buffer. Thus, the data gets copied from the disk into the kernel buffer and then into the user buffer. When the data are read from the disk to the user level buffer, the user process sends this data on an ATM connection by invoking the write system call to write the user buffer to a socket with descriptor sd. Figure 5.7 illustrates the function flow for the send activity. As for the read, the write call traps to the kernel level sys write function which extracts the socket object and calls the generic socket level function so write. This calls the sosend() function, which first copies the data from the application buffer into a kernel mbuf chain, and then, calls the protocol specific send function using the pr usrreq interface. The PRU SEND request issued results in a call to the NATM protocol function called atm output. This function extracts the interface if corresponding to the socket and enqueues the mbuf chain in the send queue of the device driver. If the network interface is currently idle, the driver performs the DMA operations to copy data from the MBUF chain into a per VCI queue on the interface card and frees the mbuf chain. Once the data is copied to the network interface, it is paced out to the network as per the pacing parameters. The discussion above clearly shows that in a MOD server the data transfer path from a disk to the network interface involves two memory copies: the first copy operation within the file system layer copies data from the kernel buffer to a user space buffer. The second copy by the socket layer copies the data from a user space buffer into the mbuf chain in the

75 2

Flow through Upper and Lower Layers of the driver From VOP_STRATEGY

bio_done() Buffer

sdstrategy

sd_softc

sdstart Driver Upper Half

scsi_scsi_cmd scsi_free_xs()

scsi_cmd

scsi_execute_xs

ahc->scsi_cmd scsi_xfer

scsi_done()

Driver Lower Half

bha_done() bha_intr()

Figure 5.6: Read function trace in SCSI driver kernel. Although this approach works well for general purpose I / O, it is clearly sub-optimal for high bandwidth MOD applications.

5.3 Demonstration of the Limitations of UNIX as a Multimedia OS In this section, we present results that conclusively demonstrate the limitations of existing OS in supporting guaranteed CPU access and high performance guaranteed disk to network I / O.

76 write(sd,buf,sz)

Application buffer

sys_write()

USER

System Call

*fp->f_ops->fo_write() soo_write()

Extract so = (struct socket *fp->f_data )

SOCKET

uiomove() sosend(so) so->so_proto->pr_usrreq

atm_output() IF_ENQUEUE(if->snd,m)

mbuf chain ‘m’ Call Protocol Userrequest (PRU_SEND)

Extract npcb and ifnet “if” structure

NATM

Enqueue to device driver of interface if

en_start(if->snd,m)

Device Driver

Figure 5.7: Function trace for data send

5.3.1 Effect of CPU and storage loads on MOD playback In this experiment we used the MOD playback server and service described in Section 4.4, to activate a single 10 Mbps video session playback. The playback server processes – namely the data prefetcher and data sender competed with other UNIX processes that create background CPU and storage load. In an MOD server, the examples of background CPU and storage load are in the form of ordinary HTTP traffic and non-critical processing performed by other services such as recording. We generated the background CPU using multiple copies of a CPU intensive PRIMES program that continuously computes prime numbers and performs no I / O. The background storage load was generated using a program which continuously reads a large video file and writes it to a new file. The read and write activity together eliminate buffer caching effects and lead to constant disk activity.

77 MOD with background storage load Deadline miss vs. Background load

0

10

Deadlines miss prob

No Storage Load With Storage Load

−1

10

−2

10

−3

10

0

5

10 # Bgload

15

20

Figure 5.8: Effect of CPU load on MOD playback Our httpd+ server implementation uses the standard usleep() call to achieve periodic data sends and pre-fetches every 33 msec. However, increased CPU load reduces the CPU share for playback processes and increases the variance of usleep based scheduling. This in turn increases the probability of deadline misses for the prefetch and send activities. Figure 5.8 plots the deadline miss probability for different amount of CPU load and illustrates this point.

MOD with no QoS

MOD with no QoS

Variance (msec)

Variance (msec)

42.0

40.0 Send (Without STL) Send (With STL) Fetch (With STL)

PrefetchVar with Storage Load SendVar without Storage Load SendVar with Storage Load

30.0 Var (frame time)

Inter frame time

40.0

38.0

36.0

20.0

10.0 34.0

32.0

0

5

10 # CPU Bgload

(a) Inter-send time

15

0.0

0

5

10 # CPU Bgload

15

20

(b) Variance of frame period

Figure 5.9: httpd+ performance in presence of

CPU

and storage load

Similarly, the presence of the background storage load reduces the storage bandwidth available to the data prefetcher and increases the read time for video frames and

78 prefetch deadline miss. This in turn results in data buffers being unavailable to the data sender and increased deadlines misses. Figure 5.8 shows that the combined storage and CPU load lead to much higher deadline misses. For example, with high CPU background load (16 PRIMES programs) combined with one storage load process, the deadline misses are as high as 1-in-100 frames i.e 1 in every 3 second, which results in jerky playback. Figure 5.9 illustrates the plots of average and standard deviation of the fetch and send time observed for two cases: (1) with CPU load and no storage load, and (2) with CPU and storage load (STL). We can see that with no load on the system, the send and fetch does occur 33 msec required for a 30 fps playback. With the maximum storage and CPU load, the average frame send time increased to 37 msec resulting in a frame rate of 27 fps which may still seem to be an acceptable frame rate. However, from the plot of variance of the inter-frame time we can see that the inter-frame time varies from 37 msec to 67 msec which results in a unacceptable video playback. Also, note that the send process suffers less deadline misses than the prefetch process as it is not affected by the storage load.

5.4 Summary of Limitations of UNIX for Networked Multimedia We now summarize the limitations of existing UNIX in supporting networked multimedia and thus motivate the need for our work.

Unnecessary data copying is a performance penalty: From our earlier discussion, we can see that in an MOD server implemented in user space, the data transfer path from a disk to the network interface would involve two memory copies: one from kernel buffer cache to the user buffer and the other from the user buffer to the mbufs. This approach works well for small sized accesses observed in general purpose I / O, such as traditional text, and binary file accesses. However, multimedia data such as audio, video, and animations do not possess any caching properties: first they have a ravenous appetite for memory space and second, they are relevant only for very a small duration from the time of their retrieval. That is, data is often replaced before it can be reused, rendering the extra copy a performance penalty. Consider, a 128 MB machine with typically 6.5 MBs configured as buffer cache. This cache is large enough to store 3 seconds of an average MJPEG file and the kernel has to replace everything in the buffer cache every 3 seconds. Therefore, the buffer cache blocks can be reused only if several processes reading the same video file are phase locked

79 to each other in a 3-second time interval. Such behavior among interactive clients will be rare. Therefore, retrieving multimedia data through a buffer cache does not provide any performance benefits. Also, any application initiated data transfer from a disk file to the network requires use of two different buffer systems, which were designed with different objectives, and leads to excessive data copying and system call overheads. Large amounts of memory-to-memory data copy consumes processor time and precious memory and system bus bandwidth.

Lack of guaranteed storage access: The storage subsystems in the current 4.4 BSD UNIX do not differentiate between real-time and non-real-time applications for disk I / O. The non-real-time file reads and real-time multimedia retrievals from the disk driver are queued into a single job queue and ordered using the elevator algorithm to achieve efficient disk head movement. In the presence of several sessions/processes performing disk I / O, this lack of service differentiation results in large variations in job completion time. We observed this phenomena in the experiments described in Section 4.3 and Section 4.4. Such behavior is undesirable for real-time requests which need to be guaranteed a fixed share of disk bandwidth and need to be completed by a certain deadline for the retrieved data to be useful. Also, the non-real-time requests should not face service starvation and must gain their fair share of storage bandwidth in the presence of real-time requests. Lack of guaranteed CPU access: The existing CPU scheduling does not provide mechanisms to allow user application to request guaranteed periodic and low-latency access to CPU resource. Lack of asynchronous disk I / O: The blocking semantics of read() and write() system calls in existing BSD UNIX forces the design of httpd+ to use two separate processes per session operating as a producer-consumer pair synchronized via a SYSTEM V shared memory. In the presence of multiple active sessions, this results in a large number of active processes and forces expensive context switches as each session is served. Also, each prefetch and send process executes its own read() and send() system calls, resulting in 3N user-kernel boundary crossings for every video/audio frame sent for N sessions. Thus, over a playback duration of T , 6NT system calls are required for standard sessions with a audio and a video stream. Such superfluous system calls reduce CPU availability dramatically.

80 Lack of aggregation semantics in I / O calls: The read and send calls in 4.4 BSD do not allow multiple I / O operations to be combined or aggregated into a single system call. The readv/writev calls do allow such aggregation for I / Os on a given file descriptor. However, 4.4 BSD lacks system calls that allow aggregated I / O over multiple file and/or socket descriptors. Static memory allocation in shared memory: In our design of the httpd+ server in chapter 4, we employed user space shared memory between the data prefetcher and the data sender processes to co-ordinate data transfer between the disk and network. However, the limited amount of per process shared memory available in 4.4 BSD forces a fixed amount of memory to be allocated statically to every session. This in turn makes arbitrary and dynamic allocation of memory for video/audio frames difficult and leads to memory wastage.

A simpler, efficient and optimal design of the MOD playback server would use a single process and aggregate read and send requests for the streams of N active MOD sessions into a single periodic system call. In such a design, playback of N sessions for duration T will require only T system calls, unlike the 6NT calls required for the earlier design in chapter 4. However, this requires OS enhancements to support CPU access guarantees, asynchronous aggregated I / O, system calls with aggregation semantics and zero-copy data paths between disk and network subsystems. In the remaining chapter, we describe our proposed OS enhancements that achieve these objectives.

5.5 Overview of OS enhancements Figure 5.10 illustrates our new control and data path architecture for the disk to network I / O. The salient features of this architecture are discussed below.

5.5.1 Periodic data transfer guarantees In order to achieve constant-rate data transfer between the disks and the network, a user application requires periodic access to the CPU to be able to issue periodic I / O requests. To this end, we employ a novel scheduling mechanism called Real-Time-Upcall (RTU) [52].

81 Applns

MOD Servers

Prefetch RTU

Send RTU

USER

stream_read()

Kernel

stream_send()

UNIX FS UNIX FFS Buffer Cache

mmbufs

mmbufs UDP

TCP

NATM Old

IP

Zero Copy Data Path NRTQ

RTQ

ATM Interface

DRR SCSI Driver (sd+)

Figure 5.10: New enhancements

5.5.2 Multimedia Mbufs: A new buffer management system In current BSD UNIX operating systems, the file system software uses a sophisticated buffer cache to transfer data between user space and storage, whereas the network protocol stacks use a different buffering system called mbufs to transfer data between the user space and the network. Due to this mismatch, any application initiated data transfer between a file system and the network needs excessive data copying. Such data copying for every active client reduces throughput and limits the total number of clients (revenue). Therefore, a fast data path that can provide zero-copy data transfer between storage and network is desirable. We have designed and implemented a new buffer management system called Multimedia mbufs (mmbufs) that provides such a zero-copy data path. By manipulating the mmbuf header, it can be transformed either into a traditional buffer a disk driver can handle or a (cluster) mbuf which the network protocols and drivers can understand. As shown in Figure 5.10, the old buffer cache based data path accessed via the well established read() and write() interface coexists with the new mmbuf based data path. Continuous Media (CM) applications access the mmbuf data path using a new system call API.

5.5.3 Priority Queuing within the SCSI driver Existing device drivers for storage systems do not differentiate between real-time and non-real-time requests and therefore provide no guarantees for storage accesses. The

82 storage driver must prioritize real-time requests over non-real-time requests and also support resource sharing among different types of requests. To achieve these objectives, we modified the SCSI disk driver to support multiple priority classes and implemented a fair queuing algorithm called Deficit Round Robin [104] to support fair sharing of disk bandwidth among them.

5.5.4 Stream Application Programming Interface (API) We designed a new system call API that allows applications to access both the mmbuf based zero-copy data path and the real-time guarantees provided by the SCSI driver. Specifically, we provide stream open and stream close system calls to setup and tear down a stream flow from disk to network. We also provide stream read, stream send, and stream poll system calls to allow read, send and poll operations on mmbuf chains of the stream. These calls are designed to allow requests for multiple active streams to be combined together in one single call, thus eliminating need for multiple system calls. In the remaining chapter, we will systematically illustrate each of these OS enhancements.

5.6 Providing Guaranteed CPU Access In the following, we describe a new scheduling mechanism that provides guaranteed CPU access.

5.6.1 Real Time Upcall (RTU)

User Process rtu_create()

T fn()

fn() User

fd

Kernel RTU Scheduler

Upcall

CPU Scheduler

Figure 5.11: The model for user level Real Time Upcalls

83 The QOS guarantees required by multimedia applications are of two types: first, an application may require guaranteed periodic access to schedule a periodic activity such as protocol processing for TCP and UDP packets, multimedia and bulk data processing, and periodic data retrievals from storage systems. Second, the application may require nearinstantaneous access to the CPU to process some critical events with very low latency. For example, arrival of certain network packets in a distributed simulation or a virtual reality game may require low latency processing. We use a novel CPU scheduling mechanism called Real Time Upcalls (RTUs) designed and implemented within our research group at Washington University [52] to provide such guaranteed CPU access to user and kernel level tasks. RTUs are an alternative to real-time periodic threads and have advantages such as low implementation complexity, portability, and efficiency. RTUs come in two flavors: periodic RTUs that guarantee periodic access to CPU, and reactive RTUs which allow low latency access to CPU. Figure 5.11 illustrates the basic concept behind periodic RTUs. A periodic RTU is a function in a user program that is periodically invoked by the kernel in real-time to perform a certain activity [52]. Unlike a traditional system call, where the user level program calls a kernel level function (a downward flow of control), in the case of RTUs, the kernel calls a user specified function - hence the term upcall. The user process that needs guaranteed CPU access employs a new system call—rtu create() — to create an RTU by specifying a function and the period with which it needs to be executed. Two other system calls, namely, rtu run() and rtu suspend(), allow the user process to start and suspend an RTU. A reactive RTU is created in much the same way as a periodic RTU using the rts creat call with a pre-specified RTU handler and zero period. When a binding between an event-source and the RTU descriptor is established (using an ioctl() call), the RTU handler is activated every time an event occurs at the event-source. The current implementation of reactive RTUs supports binding between a TCP/ UDP socket and an RTU descriptor and thus, allows handlers activated in response to network packet arrivals. The RTU mechanism is implemented in a manner that does not require any changes to the existing UNIX scheduler implementation. The RTU scheduler is a layer above the UNIX scheduler that decides which RTU (and as a result which process) to run. It uses a variant of the Rate Monotonic (RM) scheduling policy. The main feature of this policy is that there is no asynchronous preemption. The resulting benefits are minimizing expensive context switches, efficient concurrency control, efficient dispatching of upcalls, and elimination of the need for expensive locking to do concurrency control between RTUs [52]. Several examples of the effectiveness of RTUs in providing excellent QOS guarantees for

84 media processing and user-level-protocol processing have been reported in [26, 52]. The RTU facility has been demonstrated to be useful for high performance user level protocol implementations. A more detailed discussion of the implementation and related work in this area such as Scheduler Activations [13], Processor Capacity Reserves[83], Q-threads [69] etc. can be found in [52, 54].

5.6.2 A MOD Playback Server with RTUs We enhanced the MOD server described in the Section 4.4 (See Figure 4.18) using RTUs. Specifically, the pre-fetching process and the send process in the enhanced web server each used an RTU to perform the prefetch and send operations. Since the MOD streams 30 fps full rate video, the RTU period is set to 33 msec. The video and audio quality improves dramatically and is very close to that of VHS quality playback from a VCR. We present the experimental results that prove this performance improvement resulting from use of RTUs. Performance under CPU and disk load MOD with background storage load Deadline miss vs. Background load

0

10

−1

Deadlines miss prob

10

−2

10

−3

10

BL (10 ms) BL (20 ms) BL (30 ms) BL (50 ms)

−4

10

−5

10

0

Figure 5.12:

2

RTU

4 # Bgload

6

deadline misses for different disk loads

We measured the performance of this MOD server in the presence of the background load. We activated a multimedia session with a video and audio stream and increased the load on the CPU by running CPU intensive background load. Figure 5.12 illustrates the

85 percentage of deadlines missed in the prefetch as well as the send process as the background load is increased. The prefetch process blocks on every read and can potentially suffer deadline misses if disk reads take longer than 33 msec. However, in this particular test the load on the disk is light and we see that both prefetch and send process do not miss any RTU invocations. As before, we then simulated background disk load by running a process that continuously copies a large file to another file. Multiple copies of this same process performing copies of different files simulates higher disk load. Figure 5.12 illustrates the deadline misses measured in the prefetch as well as the send process. We can see that as the disk load increases, the fraction-of-deadlines missed also increases proportionately. This happens because the prefetch process of MOD server makes blocking disk read calls which take unpredictable amounts of time to complete as the disk load increases. This in turn causes the send process to find incomplete buffers on which send operations cannot be performed. Thus, even if RTUs attempt to provide guaranteed CPU access to the prefetch process, lack of guarantees from the storage system renders CPU guarantees of limited use.

5.7 Design of the mmbuf buffering system a

b

Mmbuf

Mmbuf chain

4KB

pg0

pg1

pg2

pg3

struct mbuf struct buf void *bmptr Buffer Manager

int bmid

mbuf. m_data

mbuf. m_data

mbuf. m_data

mbuf. m_data

mbuf. m_next

mbuf. m_next

mbuf. m_next

mbuf. m_next

buf. b_data

buf. b_data

buf. b_data

buf. b_data

char info[96]

Figure 5.13: New Multimedia Memory Buffer (mmbuf) Figure 5.13 shows the data structure of an mmbuf, which is a superset of an mbuf and the BSD buffer cache block. Each mmbuf consists of a header and a data buffer. The mmbuf header consists of the following four parts:

86 1. Mbuf header: The struct mbuf field in the mmbuf header represents the mbuf header. It is used to store information required to send the data stored in the mmbuf to the network. 2. Buffer cache header: The struct buf field in the mmbuf header represents a buffer cache block header. It is used by the file system to read data into the buffer. 3. Pointer to a buffer manager: The mmbuf header maintains a pointer – bmptr, through which a buffer manager in the kernel can be accessed. The mmbuf can be in four different states: empty, full, read in progress and send in progress. This manager manages the mmbuf’s status and operation, and provides a handler – (bm iodone()) used by the file system to update the mmbuf’s status. 4. Padding: A padding of 4 bytes is used to make the header size 256 bytes to avoid memory fragmentation. Each mmbuf has a data cluster of one or more virtually contiguous pages associated with it (Figure 5.13 (a)). The maximum size of a cluster is a configurable parameter. Both the mbuf and buffer headers in the mmbuf header maintain a pointer to the data cluster. When data are read from the disk, the cluster is accessed from the pointer in the buffer header. When data are sent to the network interface, the same cluster is accessed from the mbuf header. Note that the disk drivers can perform scatter-gather I / O to virtually contiguous clusters greater than a page in size. However, in systems that do not support Direct Virtual Memory Access (DVMA)1 , the maximum size of an mbuf cluster is limited to a page. Due to this, network drivers traditionally do not handle DMA of buffers greater than a page in size. This means that though an mmbuf with a 16 KB cluster can be passed as a single buffer block to the disk driver, it must be passed as a chain of 4 mmbufs to the network protocol stack. This is illustrated in Figure 5.13 (b). Since we unified the buffering structure in file I / O and network I / O, we have to support operations from both domains. The interface to the mmbuf system is described below (Figure 5.15). Mmbuf system initialization: mmbinit() At the system boot time, the initialization routine in the kernel sets up a separate submap — a pagemap mmb map in kernel virtual address space - for mmbuf data clusters. It also invokes the mmbuf initialization function - mmbinit() to allocate wired-down 1

Sun machines support DVMA, while PCI based Intel machines do not support DVMA.

87 1

Clusters list

clpool

2

next

next

next

next

next

clp

clp

clp

clp

clp

next

next

next

next

next

clp

clp

clp

clp

clp

Descriptor list

noclpool

Figure 5.14: Mmbuf systems free lists mmbinit() System Initialize Routines

mmbuf cluster page pool

m_freem()

MMGET(mm) mm_getchain(mm) Prefetch Routine

M_PREPEND() m_adj() empty mmbuf chain

BTOM(mm)

MTOB(mm) Buffer read routines

mtod()

full mmbuf chain

Send routines

Figure 5.15: Invocation of mmbuf interface functions (non-pageable) pages in the mmb map. The mmbuf system maintains two lists of cluster descriptors (Figure 5.14, (1),(2)). The first list, called cluster list is accessed by the clpool ptr and contains descriptors that have cluster of MMCLBYTES size associated with them. The second list, called the descriptor list, is accessed by the noclpool ptr and contains empty descriptors with no associated cluster. Initially, 16 mmbuf data clusters are put on the cluster list. Allocating and deallocating an mmbuf (mmget(struct mmbuf *mp,int flag) and mm free(struct mmbuf *m):) The MMGET routine allocates an mmbuf with an associated data cluster. It removes the cluster from the descriptor at the head of the cluster list and inserts the free descriptor at the head of the descriptor list for reuse. The relevant attribute fields in the mbuf and the buffer header portions of mmbuf header are initialized before it is returned. A similar

88 function, mm getchain(), allocates a chain with (MMCLBYTES / NBPG) mmbufs, each pointing to a page in the cluster. If a cluster allocation is attempted when no data clusters are available on the cluster free list, more wired-down data clusters are allocated from the mmb map virtual memory map and put onto the list. A mmbuf is deallocated using the mm free() function, which removes a free descriptor from the descriptor list, initializes it with the cluster to be freed and inserts the cluster at the head of the cluster list. Using mmbufs in file I / O: Since file I / O uses the mmbuf as a buffer cache block, a macro MTOB(void *m) is provided, which given the mmbuf pointer, returns the pointer to the file system buffer header. Before an mmbuf is passed on to the disk driver, in addition to the standard attributes that are set in the b flags field of the buffer header for normal reads, two new flags B CALL and B MMBUF are also set. The B CALL flag indicates that when the data is read from the disk into the data cluster, before calling the standard b iodone() handler, a custom handler bm iodone() should be invoked. This new handler performs the buffer manager function and appropriately modifies the state of the mmbuf from empty to full. Using mmbufs in network operations It is desirable that network protocol routines and interface drivers be able to transparently use mmbufs as regular mbufs without requiring significant code changes. Typically, network protocols compose packets by adding protocol headers or trailers in the form of mbufs to the head or tail of an existing mbuf chain containing data. Since an mbuf header in the mmbuf describing associated data cluster will be identical to a stand-alone cluster mbuf header, the same operations can be carried out even if data is in an mmbuf chain. However, one crucial difference here is that the data cluster of the mmbuf is allocated from a page pool different from the one for cluster mbufs. The mmbuf chains are allocated using the memory allocation routines of the mmbuf system when data fetch requests are generated. The network I / O routines can add mbufs (allocated by the standard mbuf allocation routines) as protocol header or trailers to such chains. In a typical packet send operation, the network interface driver calls the mbuf memory deallocation function m freem after packet is copied to the network interface card. It is desirable that the driver retain the same call to free an mbuf+mmbuf chain. However, the deallocation of an mmbuf must return the associated cluster pages to the mmbuf page pool. To achieve these two objectives, we make use of an interesting feature in the original mbuf design which allows for a pointer to an external function to be called when the associated

89 cluster page is to be freed. The current mbuf implementation does not use this pointer; instead it has a stand-alone function. This unused feature is exploited in our design; we set this pointer to our own mm free() handler routine when we initialize an mmbuf. The mbuf routines such as m copy and m copym, reference global variables of the mbuf’s cluster page pool. Therefore, these routines need to be modified to ensure that they differentiate between an mmbuf and an mbuf, and update the global variables for different page pools consistently. These routines are commonly used in transport protocols such as TCP which support retransmissions. In our current MOD testbed, we use AAL5 and AAL0 native-mode ATM protocols that have UDP semantics and hence do not require these functions. Therefore, our current mmbuf implementation does not support these enhanced functions which are crucial to ensure that the mmbuf system can be used with the TCP protocol. To accomplish this, a simple change is required: if the MM MMBUF flag is set in the mbuf, the reference counter for mmbuf clusters is updated otherwise the reference counter for mbufs is modified. We believe that it is straightforward to incorporate this change in our current design.

5.8 Periodic QoS Guarantees from Storage System In this section we describe fair queuing of disk requests over multiple priority classes within the SCSI driver. We also discuss the CCD software disk array that can be combined with the mmbuf system and SCSI QOS to achieve much higher disk to network throughput.

5.8.1 Priority Queuing within the SCSI Disk Driver A typical SCSI disk driver in 4.4 BSD UNIX uses a 3-layered software organization illustrated in Figure 5.16. These three layers are as follows: the first layer is the generic SCSI driver – sd that provides an abstraction of a job queue to handle the disk requests from the file system and other kernel components. The second layer is the intermediate SCSI driver which converts the disk read/write requests from the generic driver to appropriate SCSI commands and forwards them to the last layer - the host-bus-adapter ( HBA) SCSI controller specific driver. The lower most layer controls the HBA hardware and handles the interrupts generated in response to SCSI commands. The HBA performs the SCSI bus transactions over the SCSI bus to request read (write) operations from the SCSI disk. It also receives (sends) data from the disks and DMAs it to (from) the appropriate kernel

90 1

Existing SCSI Driver

2

Enhanced SCSI Driver

NRTQ

Q

RTQ

P1

P2

Generic sd SCSI ahc7870

Job selector

PN

Resource Allocation Policy

sd + RTQ = Real-time requests queue NRTQ = Non real-time requests queue

Figure 5.16: New priority queuing SCSI system buffer passed down in the disk request. Note that the generic and intermediate driver are independent of the HBA specific driver. In our server prototype, we use the AHC 3940 dual-channel SCSI HBA from Adaptec. The existing queuing mechanism (Figure 5.16 (a-1)) in the SCSI driver consists of a single request queue maintained by the generic SCSI driver and sorted using a disk scheduling algorithm such as the elevator algorithm. This queue is serviced by an event-driven service function which is invoked when a new request is received for an idle disk or when an ongoing disk read/write request completes. This function drains the requests from the request queue until the HBA request queue is full. Since the multimedia retrieval requests compete with ordinary delay-tolerant non-real-time requests to the disk, the lack of request differentiation results in lack of service guarantees from the storage system. The desired multi-priority queuing structure that rectifies this problem is illustrated in Figure 5.16. This enhanced queuing mechanism supports multiple job queues with different priorities, each representing a single service class. Each job queue may be ordered using a simple FCFS policy or more sophisticated class specific policy such as Rate monotonic (RM), Earliest-Deadline First (EDF) or SCAN-EDF. The job queue with the lowest priority, called the NRTQ, is used for regular non-real-time requests such as those generated by the existing read/write() system calls. The jobs for the other queues are generated by continuous media applications that need QOS guarantees. The service class specified in the disk request decides the priority queue to which the job is assigned. A QOS-aware

91 application can dynamically change its service class or be statically assigned to a fixed service class when it is initialized. Every time an ongoing disk request completes, the driver invokes a job selector which consults a resource allocation policy to extract the next job from an appropriate queue. The job selector must satisfy two requirements: Fair resource allocation: First, it must ensure that none of the service classes are starved i.e., denied access to storage bandwidth for an unbounded amount of time. In other words, it must use a resource allocation policy that guarantees fair sharing of storage bandwidth among all priority classes. This policy should be work conserving i.e., if at any given moment only jobs of a particular class are present, they must get full storage bandwidth. Efficient disk access: Second, the requests to be processed must be selected in such a way that the seek and rotational latencies for disk accesses are minimized and the disk utilization is maximized.

Enhanced SCSI Driver RTQs

P1

P2

PN

Fair Queueing Job Selector Roundi

Roundi+1

IWQ

Disk Scheduling

WQ

FCFS Disk Service

sd +

IWQ and WQ are physically implemented as one queue

Figure 5.17: Service rounds and two-level disk scheduling

92 In our design, we decouple these two objectives by employing a job selector that proceeds in service rounds and uses a two-level queuing scheme illustrated in Figure 5.17. In each service round, the selector extracts jobs out of multiple Level-1 job queues as per a fair resource allocation policy to form a Level-2 work queue for the round. The disk always drains requests out of the work queue in FIFO order much like the request queue in the present disk driver. The work queue is ordered using an efficient disk scheduling algorithm, such as Grouped Sweep Scheduling ( GSS) [125], SCAN - EDF [96], or Symphony [102] algorithm, aimed at satisfying real-time constraints of multimedia data accesses and optimizing disk utilization. In each round, we achieve fair allocation of storage bandwidth to multiple queues by employing Fair Queuing Algorithms originally devised in the context of fair sharing of a communication link among several data flows, each with its own packet queue. Specifically, we propose using a simple fair queuing algorithm called Deficit Round Robin (DRR) [104]. Q1 Q2 Q2

Q1

QN

Q2

Q1

Link

QN

Figure 5.18: DRR fair queuing for a communication link The basic idea in DRR is illustrated in Figure 5.18. Under DRR service, the queues are serviced in a round robin fashion and in each round each queue is provided a fixed quantum of service. Consider the example of a transmission link that uses DRR service. The service quantum Qi in this case is defined in terms of the number of bytes of data to be transmitted in the round from the ith queue. If the current value of quantum Qi is less than the size of the packet k at the head of the queue, the packet is transmitted and the counter is decremented by k. The packets are drained from the ith queue until Qi is less than the packet size. If the quantum Qi = k is insufficient to drain the packet, the deficit k bytes of service which was not used in the current round is carried over to the next round. The quantum size must be set to the maximum size of the packet over all flows

93 to minimize delay. Also, the per flow/queue quantum values need not be identical and if selected different, result in weighted DRR fair queuing. The advantages of DRR are that it is fully work-conserving and requires (1) time to process every packet and is simple and inexpensive to implement. Enhanced SCSI Driver with DRR RTQs

P1

Q0

Q1

P2

PN

Q2

QN

Deficit Round Robin Fair Queueing

Quantum in terms of disk sectors

WorkQueue sd + sd_start()

Figure 5.19: DRR Fair queuing in a SCSI Driver In order to adapt the DRR algorithm for a disk driver with multiple queues, we note that disk read/write requests are always in terms of multiples of the smallest size block - typically 512 bytes disk sector and the maximum size of each request is limited to 64 KBs. Each disk request carries the size of the read/write request in terms of the number of sectors in its header fields. Therefore, we can define the quantum of service offered to a request queue in terms of the number of sectors read in a round. As shown in Figure 5.19, each queue i is assigned a quantum Qi , which can be set statically or changed over time to achieve adaptive resource allocation. The work queue is formed at the start of each service round by removing jobs from the queues until the quantum associated with each queue is exhausted or until there are no more disk requests left. If a queue does not have any jobs at the moment that the work queue is formed, its quantum is not carried over to the next round.

94

5.8.2 Implementation of DRR with two priority queues in NetBSD sd_softc device_info drive_info scsi_link_info disk_params buf_queue

Figure 5.20: Old sd softc data structure We describe a prototype implementation of the DRR fair queuing with two priority classes – a non real-time class and a real-time class, in the current BSD UNIX. The requests generated by the standard read/write or readv/writev system calls are enqueued in the non-real-time request queue, whereas the requests generated by the new streamread() API are queued in the real-time queue. The following modifications were made to the SCSI driver:

Modification on sd softc data structure: Figure 5.20 illustrates the sd softc structure in the existing generic SCSI driver that captures the software state of a SCSI disk. Figure 5.21 illustrates the changes to this structure to support DRR fair queuing. Specifically, we added two new queues – the rtqueue and the workqueue. Two new state variables, rtqnt and nrtqnt keep track of the quantum values for the real-time and non real-time queues. The two other state variables – max rtqnt and max nrtqnt record the maximum amount in terms of number of sectors of data read for each queue in a work round. These variables control the resource allocation and delay experienced by requests in each queue. They can be dynamically altered by an user level application such as a MOD server, a web server or by a Resource Allocation entity (process) in the system by using an ioctl call. Modified enqueue procedure: When a buffer is passed to the driver’s enqueue routine sdstrategy(), the driver checks the buffer flags to see if the buffer’s B MMBUF flag is set. If the flag is not set, the buffer is enqueued into the non-realtime queue, otherwise it is inserted into the appropriate the real-time request queue.

95 sd_softc device_info drive_info scsi_link_info disk_params max_rtqnt

Max. quantum for real time requests

max_nrtqnt

Max. quantum for non- real time requests

rtqnt Current quantum count nrtqnt nrt_queue rt_queue work_queue

Figure 5.21: Modified sd softc data structure In the presence of multiple priority queues, the buffer carries information about the priority class, which is used to queue it to appropriate queue. The class information can be stored in the file entry structure when the stream is created and can be dynamically changed by the application by performing an ioctl() call.

Modified dequeue procedure: In the current SCSI driver, the sdstart routine dequeues the buffers from the buf queue job queue and issues commands to the lower level disk driver. This routine is invoked when a new request is received or when a request in progress is completed. In the our new design, when there is spare capacity in the SCSI adaptor queue, sdstart drains the jobs in the work-queue and sends SCSI commands to the lower level driver. However, if the work-queue is empty, it invokes the sd form workqueue routine which uses the rtqnt and nrtqnt counters to extract jobs from the rtqueue and nrtqueue. These jobs are sorted into the work queue using the standard disksort function. If there are no jobs in the real-time queue and max nrtqnt is set to a small value compared to max rtqnt the work-queue gets formed more often. To minimize this overhead, the driver can monitor the rtqueue occupancy to adapt the quantum allocation.

96

5.9 Streams API We have designed an API consisting of a new set of system calls that allow applications to access mmbufs and real-time guarantees from the SCSI driver for network destined disk retrievals. A novel feature of these calls is that they allow aggregation of multiple read/send requests for the same or different active streams into a single system call, much like a super-call [38]. Such aggregation significantly minimizes system call overheads especially under heavy loads. The streams API interface supports following four main functions: proc p vnode (vn_open())

struct file

p->p_fd i

f_data

bufManager vp next chain_ manager

0

1

n-1

Figure 5.22: State created by streamopen() call

1. stream open(filename,nochains): The stream open() call, like a traditional open() system call, opens a file, initializes the file entry structure, installs a pointer to it in the process file descriptor table, and returns the index of its location. Figure 5.22 illustrates the state created upon successful completion of the stream open call. The streamopen function allocates a buffer manager object and initializes it using the initialization function bm init. This function allocates an array of chainmanager structures with no chains entries and initializes each chain to the BUF EMPTY state. The fp data field of the file structure is initialized to point to the buffer manager and the f ops field is initialized to streamops() array listing the stream functions.

97 The nochains parameter of the stream open call effectively controls the amount of pre-fetching and thus, the number of outstanding disk requests that a process can issue. The size of these chains can be dynamically changed. At present, stream open is supported only for files on local store. However, extending it for files on a network resident remote file store is straightforward. 2. stream read() and stream send(): The stream read() call takes a set of descriptors opened by streamopen and a streamstate array as arguments. For each descriptor, it modifies the state of the associated empty mmbuf chains to reading and initiates data reads. The stream send() call also takes a set of descriptors and a streamstate array as arguments, and for each descriptor, initiates send operations for mmbufs chains that have valid data read by stream read. It also appropriately modifies the state of the mmbuf chain from full to sending. Note that a separate stream rdsnd() call that combines these two calls is also available to achieve further aggregation. All these system calls support blocking (synchronous) and non-blocking (asynchronous, polled) semantics. Two other system calls, stream recv and stream write, are counterparts of stream read and stream send calls and provide zero-copy data transfers from the network to the disk. Our current prototype, however, does not support these calls. Also, note that if no application-level flow control is desired on data flow between the disk and the network, stream open can be instructed to setup the data flow in spliced mode, where a binding is established between the disk file descriptor and the network socket descriptor. In this mode, when a disk read completes, in addition to the appropriate state change operations on the mmbuf chain, the kernel invokes the send functions to immediately queue this chain to the corresponding network socket. Such splicing minimizes system call even further. 3. stream poll(): Using this system call a user level application can poll the state of the mmbuf chains associated with multiple open stream descriptors on which a stream read or a stream send has been issued. 4. stream close(fd): This system call closes a stream setup by a previous stream open call. It ensures that any ongoing disk to network I / O is successfully completed before the descriptor and the associated mmbuf chains are released. We used these system calls in our experiments described in the Section 5.11 and building high performance MOD servers described in Chapters 6, 7.

98

5.10 Concatenated disk driver (ccd)

buffer

virtual disk

ccd

split buffers

physical disks

Figure 5.23: Parallel disk I/O with ccd The Concatenated Disk Driver ( CCD) is a disk striping software developed by Jason Thorpe [108]. It allows one or more 4.4 BSD disks or disk partitions of the same or different sizes to be combined into a single virtual disk or a software disk array. As shown in Figure 5.23, the data stored on this virtual disk is striped across the component disks and therefore, retrievals which are multiple striping units in size result in parallel I / O from multiple disks. The CCD provides near-linear increase in write throughput and sub-linear increase in read throughput as the number of disks in the software disk array is increased. In the following section, we will show that this increased throughput combined with the new mmbuf system provides significant improvement in disk to network data throughput. bm_iodone mmbuf data buffer

B_CALL

b_iodone = bm_iodone

mmbuf layer ccd layer

b_iodone = ccdiodone

split data buffers

B_CALL

B_CALL

B_CALL

Figure 5.24: Interaction of mmbuf and ccd

99 Figure 5.23 (b) shows the interaction between read layers of the mmbuf and the CCD. When an mmbuf is passed to the CCD layer, the B CALL flag in the mmbuf header is set to ensure the custom bm iodone routine is called in the event the I / O on the virtual disk is complete. The CCD layer splits this buffer into multiple buffers, copies the original buffer header to each buffer and using a striping information table enqueues the requests to the job queues of the component disks. The handler routine for these buffers is set to the function ccd iodone() from the CCD code. When all these buffers are filled, CCD concatenates them back into one buffer and copies the header of the original buffer to its header. Finally, it calls the custom bm iodone function from the original buffer to update the status of the mmbuf. Note that CCD differs from the commercial RAIDs in many ways. Unlike RAIDs which perform striping using a hardware controller, CCD is a software disk array. Other than simple mirroring, CCD does not support any other data redundancy techniques, such as the ones supported by various levels of RAIDs (Level 3, 4, 5). Also, at present, a CCD can be composed only from component disks or partitions with BSD file systems. We believe that though CCD is sub-optimal, it represents a very simple and cost-effective way to build small disk arrays.

5.11 Performance evaluation In this section, we will describe the experiments carried out to characterize the performance benefits of our solutions. We have successfully implemented the mmbuf system, the fair priority queuing in the SCSI driver, and the new stream API (system calls) in the latest release of NetBSD. These enhancements have also been integrated with the CCD driver, the RTU mechanism and a locally developed driver for the ATM interface from Efficient Networks [37]. The CCD striping driver is an inexpensive way to concatenate multiple disks and to operate them as a software disk array. It provides sub-optimal, yet almost linear increase in disk throughput. We have also prototyped experimental single node and distributed multi-node MARS video servers using these enhancements [28]. In all the experiments described here, we used a 200MHz Pentium PC with 128 MB RAM, an ENI ATM interface, and an Adaptec dual SCSI AHC -3940 adaptor, running the enhanced NetBSD 1.3 kernel. We used two 9 GB Seagate BARACUDDA SCSI disks, each with a rotational speed of 7200 RPM and an internal transfer rate of 80-124 Mbps. The FFS file system created in our measurements used a block size of 8 KB and a fragment size of

100 1 KB. However, the results reported hold equally well for file system with different values for these parameters. Performance Gains of MMBUF and StreamAPI Stream Read/Send vs. Regular Read/Send 50.0

StreamRd (F=3) StreamRd (F=4) StreamRd (F=32) RegularRd

Time (secs)

40.0

30.0

20.0

10.0

0.0

0

500 File size (MB)

1000

Figure 5.25: MMBUF data path from a fast disk

5.11.1 Performance benefits of mmbufs and stream API The purpose of this experiment is to demonstrate that the use of mmbuf and the stream API provides a significant performance gain over the standard read/send() based data path. To this end, we created two test programs: the NRT and the RT. The NRT program sequentially reads a large video file and sends it over a native-mode ATM (NATM) connection using the standard read/send() data path. The second program, RT, performs the same tasks but uses the mmbuf based data path by employing the stream API. It uses the nochains parameter in the stream open call to control the number of outstanding disk requests in the pipeline. We call this parameter the Fetch Level (F). Note that in this experiment, the stream read() calls in RT are completely blocking and thus, have the same semantics as the reads in NRT. Also, the stream read calls in RT always result in disk requests. This is in contrast to the regular reads in NRT, which, due to the look-ahead pre-fetching performed by the kernel, may be satisfied out of the buffer cache. For both the test programs, we measured the total time taken to read and send data from files of different sizes. We first performed these measurements on an 8 MB FFS memory file system using an 8 MB video file. Since, the size of the memory file system is limited, we simulated large file reads on it by repeatedly read the same file. Figure 5.25 illustrates the results of total time vs. file size for the mmbuf data path with different levels of pipelining, and for

101 the ordinary data path. We can clearly see that for a 1072 MB file, the mmbuf based data path with F = 3 results in a dramatic 32 % improvement in throughput over the ordinary data path. With F = 32, the performance improves by 41 %. Table 5.1: Read time for normal rd/send and streamrd/send File Size read/send on sd read/send on CCD (MB) normal on sd (sec) stream on sd (sec) 40 4.0 3.0 120 11.13 8.03 200 19.01 13.00 280 26.0 18.50 360 33.0 22.41 Throughput 10.90 MBps 16.064 MBps

streamread/send on stream on CCD (sec) 2.38 7.01 11.77 16.39 20.38 17.65 MBps

Performance Gains of Mmbufs 40.0

Time (sec)

30.0

read/send on SD read/send on CCD streamread/send on CCD

20.0

10.0

0.0 0.0

100.0

200.0

300.0

400.0

File Size (MBs)

Figure 5.26: MMBUF data path from a regular disk We repeated this experiment on a file system created on a 2-disk CCD. The CCD breaks each read request of 32 KB into four 8 KB buffers and alternates reads between the two disks. This parallelism in the data reads reduces the disk I / O time for the same amount of data. In this case, the data throughput for streamread/send is 17.65 MBps, which is 10 % faster than normal read/send on CCD, and 60 % faster than stream read/send on sd. The results are shown in Table 5.1 and Figure 5.26. We observed consistently lower completion times for streamread/send on CCDs than for read/send and improvements varied from 5 to 15 %.

102 From these experiments, we can see that the I / O pipelining combined with minimization of data copies on the mmbuf based data path leads to definite throughput improvement. With slow disks, the disk I / O time is much larger than the time spent in data copies and therefore throughput improvements are small. On the other hand, for very high bandwidth disks, the I / O time is comparable to the time spent in data copies and therefore, throughput improvements are dramatic. Clearly, continuing improvements in throughput of disks and disk arrays suggest that the MMBUF based zero-copy data path will be necessary to build high performance MOD servers and services.

5.11.2 QOS guarantees in SCSI PART I Program 1 Program 2 (NRT)

Background Loads

NRTQ

RTQ

sd+

PART II Program 3 (RT)

Program 1 Background Loads

NRTQ

RTQ sd+

Figure 5.27: Setup for Experiment 2 The purpose of this experiment is to demonstrate that the enhanced SCSI driver with DRR fair queuing provides QOS guarantees in the form of guaranteed bandwidth and delay. The measurements in this experiment were done on a file system created on a 2-disk CCD as in Experiment 1.

103 We created two test programs (Figure 5.27): the RT program uses the mmbuf based data path to read a large video file and sends it over an ATM connection. It can be run in continuous mode or set to read a fixed amount of data. The NRT program also performs similar tasks but uses an ordinary read/send based data path. In the first experiment, we measured the time that it took for NRT to read a fixed amount of data in the presence of multiple copies of RT issuing streamread/send requests. We allocated fraction p of the disk bandwidth to the real-time queue and the remaining fraction q to the non-real-time queue. We measured the total read time for various loads and repeated the measurements for different values of p and q . QoS performance of enhanced SCSI Non−real−time read time vs. Real−time load

Read time for NRT (secs)

300.0 (Q=128, NRT=0.33) (Q=128, NRT=0.05) (Q=128, NRT=0.95) (Q=16, NRT=0.95)

200.0

100.0

0.0

0

2

4

6

RT Load

Figure 5.28: NRT read time vs. real-time load Specifically, we controlled the ratio of quantums QR and QNR allocated to the two R , the fraction p = X of disk bandwidth BD is allocated to queues. Note that if X = QQNR X +1 1 the real-time queue, where as the fraction q = X +1 is allocated to the non-real-time queue, and p + q is 1. Figure 5.28 illustrates the read time for NRT to read 81 MB of data as the load on the real-time queue is increased from 1 to 6 processes in steps of 1. We can clearly see that when q = 0:95, almost the entire disk bandwidth is allocated to the NRT queue and hence, the read time is fairly independent of real-time load. The jump in the read time in the presence of the load from the case when there is no real-time load is attributed to sharing of CPU and disk bandwidth. We can also see that when q = 0:05, the NRT time increases almost linearly as its bandwidth share drops with the increase in the real-time load that consumes more and more of 95 % of the bandwidth allocated BW. Figure 5.29 shows the read time for NRT as the bandwidth allocated to the NRT queue is changed. The real-time load in this measurement was fixed to one video stream at

104 DRR Performance in SCSI Read time for NRT vs. Bandwidth share

300 ( Qmin = 128 )

Read time (secs)

250 200 150 100 50 0

0

20

40

60

80

100

% share for RT

Figure 5.29: NRT read time vs. RT fraction allocated 30 fps and one full rate stereo audio stream. The read time for an 81 MB file was measured for different values of p (and q ). We can see that as the share of the real-time bandwidth is increased, the share of the non-real-time bandwidth decreases proportionately and the read time (throughput) for NRT increases (decreases). We performed similar experiments to study the effect of the non-real-time load on the real-time reads. Specifically, we measured the total time taken to read a 270 MB file using the stream API in presence of disk read/writes using the standard API. We simulated the background storage load by running multiple copies of a program that continuously reads a large disk file and writes it to a new file using the read/write() API. Figure 5.30 illustrates the results of our experiment. We can see that the total read-time for the stream based file read remains insensitive to the disk load. Also, as the disk bandwidth allocation of RT queue is decreased by a factor 4 from 80 % to 20 % the total read-time increases approximately by the same factor (from 167 secs to 711 sec). Also, the read time is sensitive to the minimum quantum size Qmin and size of the read. With 20 % RT allocation, 32 KB reads, a Qmin of 16 yields higher read time of (711 sec) than with the minimum quantum size of 128 sectors and a 128 KB read requests. A larger Qmin combined with large reads results in more data being admitted for each read in a service round but increases the service round length. Longer service rounds optimize disk utilization but increase request latency for real-time streams. This suggests that with small sized RT requests, to keep request delay small and minimize NRT requests from swamping out RT requests the Qmin should be kept smaller. On the other hand, with large RT requests, Qmin should be larger.

105 QoS performance of enhanced SCSI Non−real−time read time vs. Real−time load

800

RDtime for RT (secs)

600 RT=0.2 (Q=16) RT=0.8 (Q=16) RT=.2 (Q=128)

400

200

0

0

2

4

6

NRT Load

Figure 5.30: RT read time vs. RT fraction allocated These experiments illustrate that by dynamically controlling p and q , the disk bandwidth allocated to the real-time and non-real-time streams can be controlled.

5.11.3 Periodicity of User Level Data Transfers The purpose of this experiment is to demonstrate that by using RTUs and the enhanced file system, a user level process (such as a web server) can obtain QOS guarantees for predictable storage and CPU access. To this end, we created a test program RT that schedules periodic stream rd/send() calls using an RTU to stream data from a file on a 2-disk CCD array to an ATM connection. The program is capable of managing multiple streams of same or different periods. The DRR allocation in all these experiments was set to 80 % RT and 20 % NRT. qsnd

B0

B1

B2

qfetch

BI

BF-1

Figure 5.31: Buffering scheme used by the test programs The RT program co-ordinates the data prefetch and data send activities for each stream using a per-stream buffer structure illustrated in the Figure 5.31. Specifically, it maintains a circular ring of F buffers and two pointers – a qfetch read pointer and qsnd

106 send pointer. Each of the ring buffers is an mmbuf chain which can be in one of the four states – empty, reading, full, sending. If the ring buffer pointed to by the qfetch pointer is in empty state, a streamread operation is scheduled on it and the pointer is advanced to the next buffer. Similarly, if the buffer pointed to by the qsnd pointer is full state, a streamsend operation is scheduled on it and the pointer is advanced (in modF fashion). The RT program uses an input file that defines the number of streams, the disk file and the VCI identifier for the ATM connection for each stream. It sets up each stream by using the streamopen and socket calls and primes the ring buffer by reading the first N stream blocks. In our first set of experiments, we created multiple identical disk-to-network streams. Each of these streams requests 32 KB data transfers every 30 msec and thus, requires constant disk-to-network bandwidth of 8:73 Mbps. We measured the fraction of the deadlines missed and the amount of bandwidth obtained by the streams for different levels of pre-fetching F = 2; 4; 8; 12 over a 4 minute period. Storage BW performance Deadline miss vs. No. of connections

0

10

−1

Deadline miss prob

10

−2

10

−3

F=2 F=4 F=8 F=12

10

−4

10

−5

10

3

5

7 No. of connections

9

11

Figure 5.32: Deadline miss probability for requests on same storage Figure 5.32 illustrates the maximum deadline miss probability for different values of F as the number of connections is increased. Also, Figure 5.32 shows the deltaBW computed as the difference of requested bandwidth and actual bandwidth obtained. We can see that as the amount of prefetch buffer is increased the deadline miss probability decreases dramatically and the amount of bandwidth obtained closely matches the requested bandwidth. The 2-disk CCD we used can provide sustained 35 Mbps throughput or four 8:73 Mbps connections with F = 8. With F = 12, the same CCD supports 43 Mbps or five

107 8:73 connections with negligible deadline misses. With more than 5 such active connections, even a large prefetch buffer results in high maximum deadline miss probability. Storage BW performance Max. Delta Bw vs. No. of connections

MaxDeltaBW (Mbps)

6

4

2

0

F=12 F=8 F=4 F=2

3

5


9

11

Figure 5.33: Loss of bandwidth for connections on the same storage We also ran a standard benchmark called iozone to measure the application level sequential read/write throughput of a file system. This benchmark creates a large file using the standard open/write calls and reads it back using the open/read() calls. It reports read and write throughput based on time it takes to complete the read and write tasks. Clearly, when performing such sequential reads, the seek and rotational latencies incurred are minimal and therefore the throughput readings represent the best case. In a iozone benchmark test on our test CCD, we measured 45 Mbps read throughput with a 512 MB file. Comparing this to our periodic stream measurements above, we can see that with reasonable amount of pre-fetching buffers we are able to extract close to 85 % of the best case file system throughput. We can see that accessing files from different storage improves the deadline miss performance and allows connections to obtain almost exactly the requested bandwidth. This proves that higher parallelism in storage system improves throughput and delay performance.

108

5.12 Related Work In the recent past, the design of high performance multimedia servers, operating systems, file systems, and specialized disk scheduling techniques for QOS guaranteed multimedia retrieval have been widely researched. Due to space limitations, we do not exhaustively cover all of the related work in these areas. However, in the following, we try to strike a balance between recent active projects and research widely cited in literature. The idea of minimizing data copy to achieve higher performance is well known and has been reported in early operating systems such as Tenex[18] and Accent[97]. The Container Shipping system [90], the DASH IPC [115], and fbufs [47] have addressed the problem of minimizing physical data movement across protection domains in an OS by employing virtual memory re-mapping techniques. However, none of these projects report design of zero-copy I / O between the disk and network subsystems. A more recent paper by Brustoloni et al. [19] proposes new copy avoidance techniques called emulated share and emulated copy which do not require any changes to the I / O API as required by some of the above mentioned techniques (including ours). It conclusively demonstrates the advantages of copy, data passing and scheduling avoidance, using a NetBSD UNIX OS enhanced with an implementation of the Genie I / O system. In this system, an application can specify, in a single Genie call, invocations to single or multiple I / O modules (such as drivers, protocol stacks or file systems). Also, an application can request multiple invocations in a call to be processed in one of many ways: sequential, parallel, periodic or selective. We believe that the Genie framework can support a zero-copy data path between a file system and network protocol stack. However, currently no such design has been reported. Kevin Fall et al.’s [48, 49] work on providing in-kernel data paths has goals very similar to ours, namely, minimizing data copies and supporting asynchronous and concurrent I / O operations to improve I / O throughput. They have designed and implemented a mechanism called Splice in the Ultirix 4.2 operating system to meet these goals. Implemented as a system call, the splice() mechanism arranges within the kernel for pre-specified amounts of data to be moved from a source descriptor to a sink descriptor without user program intervention. However, this mechanism has several drawbacks: first, the current implementation supports splices between two file descriptors, two socket descriptors or a socket descriptor and a frame buffer. It does not support a splice between a socket and a file descriptor which would be essential for the majority of networked multimedia applications. In fact, such splices cannot be supported due to the lack of an mmbuf-like buffering system

109 that can support zero-copy data path between storage and network subsystems. Also, unlike our stream API, the splice mechanism does not provide fine grain control on the data flow between spliced descriptors and thus makes efficient application level flow control difficult. On the other hand, the splice semantics can be easily emulated in our stream API. Moreover, the present splice implementation is not available for NetBSD operating systems and hence, cannot be easily adapted for our needs. A more recent and on-going research effort at the Distributed Multimedia Lab at UT Austin, aims to build an integrated file system called Symphony [102]. Unlike approaches which use a software integration layer to create a homogeneous abstraction of a single file system out of multiple file systems geared for different data types, Symphony handles multiple data types in a physically integrated file system. The Symphony system supports a QOS aware disk scheduler, a storage manager for data type specific placement policies, a fault tolerance layer, and a two level meta information structure. It also supports admission control, server-push and client-pull service models and data type specific caching. The current implementation runs as a single multi-threaded process in user space and accesses the disks as raw devices. Some of the key similarities and differences of our work from this project are as follows: like Symphony, our work also employs differentiation of disk retrievals into multiple priority classes and provides hooks for implementing suitable disk scheduling policies such as SCAN - EDF, CSCAN, Symphony disk scheduler, or Grouped Sweep Scheduling [96, 102, 125] and associated admission control algorithms. We also store all multimedia information in an a single file system - the FFS file system and thus follow the integrated file server approach advocated in Symphony. We do concede that though FFS data placement policies are non-optimal for multimedia data and we can leverage policies investigated in Symphony. Like Symphony, our meta information is two level: the frame level meta info and the traditional UNIX inode information. However, in our design, the frame level info is completely independent of the inode information about the data file and can be potentially stored on different storage system and/or file system [28]. Such meta-data and data separation has been advocated in the literature to decouple high bandwidth data accesses from low bandwidth yet time-critical accesses to meta information. Also, frame level information is specific to the data compression standards and hence, if kept in user space, they can be changed much easily more. Unlike Symphony, we follow the design principle advocated in [122] and keep our file system in the kernel. We support an efficient zero-copy data path for network destined storage retrievals; we believe that service models such as client pull and server push are best implemented in the user space as a part of the MOD server implementation. Our work hopes to leverage standard

110 fault tolerance techniques and specialized techniques such as ones used in Symphony in order to support fault coverage. Also, note that Symphony has been developed for SOLARIS operating system and exploits the multi-threaded nature of the kernel. Our work is entirely based on the BSD class of operating systems (NetBSD, FreeBSD, OpenBSD, 4.4 BSD) that do not support kernel level threads. Thadani et al. [107] report a zero-copy framework developed and implemented for the Solaris UNIX operating system. Their work extends the idea of fast buffers (fbufs) in [47] and provides a new UNIX API for the explicit exchange of fbufs when performing I / O. Implemented as a loadable kernel module, these Solaris enhancements do support zero-copy data transfers between disk and network. However, this work does not provide guaranteed access to storage devices. The work by David Yau et al. carried out in the context of Solaris OS [123] has objectives similar to our work. The two relevant ideas they report are: user managed I / O efficient buffers, and Direct Media Streaming framework. The I / O efficient buffers are buffers that are co-mapped to the user and the kernel and are managed by the user application. Such co-mapping minimizes data copying and is used by the Direct Media Streaming framework to achieve fast data transfers from media devices (such a video board) to the network interface. However, this work does not support a zero copy data path between the disk and the network and also, does not concern itself with QOS guarantees from the storage system.

5.13 Summary In this chapter, we analyzed the limitations of existing 4.4 BSD UNIX operating system in supporting MOD applications. We presented the design of a new mmbuf buffering system and an enhanced SCSI driver with support for fair queuing. We also presented experimental results for our enhanced system. Specifically, we showed that (1) the mmbuf system and the stream API result in 40 % improvement in data throughput from disk device to the network interface, (2) the DRR fair queuing in SCSI system provides good bandwidth guarantees, and (3) user level applications can obtain guaranteed access to CPU and storage resources by employing RTUs to access the new OS enhancements. Clearly, these measurements indicate that our OS enhancements provide QOS guarantees and significant improvements in throughput on the data-path between the disks and the network interface. To summarize, the research contributions described in this paper combined with new CPU scheduling mechanisms such as RTUs [52] make 4.4 BSD UNIX a strong candidate for a true multimedia operating system.

111

Chapter 6 Design of a High Performance MOD Server The OS mechanisms that support efficient zero copy disk-to-network data path and application level guaranteed access to CPU and storage resources are critical to building high performance MOD servers. In the previous chapter, we described our extensions to 4.4 BSD UNIX OS to provide such mechanisms. In this chapter, we describe the design and performance evaluation of a single node MOD server snmod that employs these OS enhancements. We first describe the object oriented design of the playback server and then present performance results to compare it with the server described in Chapter 4. We also provide a discussion on the extensibility and limitations of this prototype. For the rest of this dissertation, we call this playback server as the Single Node MOD (SNMOD) server.

6.1 Design of the MOD Server The SNMOD server we describe in this section provides a fully interactive service which is accessed by the clients using the GUI application (Figure 4.15) described in Chapter 4. This application when activated by the browser as an helper application, processes the session description file supplied by the web server and contacts the SNMOD server described therein. It uses the streaming protocol described earlier in chapter 4 to exchange session setup and control commands. Also, this service prototype implements the MOD server and the web server as two separate entities that receive client requests on two different TCP ports. This approach differs from the integrated web-mod server in Chapter 4 wherein the web server and streaming server run as one entity. In the new stand-alone

112 server approach, the web server acts as a session directory and provides the session description files in response to standard HTTP GET requests. This design is similar to the one advocated in the Real Time Streaming Protocol (RTSP) [101]. Figure 6.1 illustrates the design of our single node SNMOD server. We followed an object oriented approach to make our design modular, flexible and extensible. When the server initializes, it reads a configuration file and sets up three processes: a MOD server, network signaling server (NSS), and a stream open server. Client Requets

Network Signalling Server (NSS)

Client Interface Object (CIS)

Session Manager Object (SM)

Admission Manager

Session 2

Session N

Open Manager

0

Session Table Session 1

Stream Open Server (SOS)

M

StreamClass Server (SCS) RTU


Stream Object

Stream Object

Stream Object

Stream Object

0

N

0

N

Meta Object

Meta Object

Meta Object

Meta Object

Figure 6.1: Design of the single node MOD server The main software objects in our design are described below.

Client interface: The clientInterface object implements the streaming protocol described in Section 4.4.1. It periodically polls the well advertised server port for new playback connection requests from the client and invokes appropriate session manager functions to create a session. It also polls the active connections to

113 receive playout control commands and forwards them to the session manager object for appropriate state changes.

Network Signaling Server (NSS) : The Network Signaling Server ( NSS) object abstracts and implements network signaling operations such as connection setup, tear down, and bandwidth renegotiation. It uses signaling stack such as ATM UNI on the server to accomplish these signaling operations. Session manager: The session manager is the central object in our server. The session manager manages session and streamClass objects which are described below. – Session objects: Each active client session at the server holds serveral server resources such as the network connections, mmbuf buffers, and various state table slots. Each session consists of multiple multimedia streams. For example, a movie or lecture can contain a video and an audio stream. A session object captures the resource usage and other state variables such as the playback location of component streams and the status of the interactive operations associated with a session. At initialization time, the session manager creates a session table with a pre-specified number of session objects. For every new session request, a session object is allocated from this table and marked in use. A session consists of multiple streams that belong to same or different stream classes. Each session object stores a sessionId, a session name, a classId table, a streamId table, and playback state such as playback speed. For the ith stream in a session, j=classId[i] records the id of the class to which the stream belongs, whereas k= streamId[j] entry records index into the stream state table maintained by the j th stream class. Thus, the 4-tuple (sessionId,i,j,k) completely characterizes an active session. The state variables speed, loop flag, slowFactor characterize the current playback state of the session. The session object also stores information about the session description file, the client that owns this a session, and per stream information such as the network connection identifier, and meta and data file descriptors. – Streamclass object: In our design, a stream is characterized by its periodic QOS requirement. For example, audio/video streams that require 30 fps playback are grouped in one streamclass, whereas streams that require 24 fps playback are grouped into another class. We abstract each streamclass using a streamclass

114 object. When the server is initialized, the configuration file describes the number of stream classes that need to be configured. Each stream class is defined by a CLASS directive with three parameters – an identifier (id) , a period, and maximum number of streams that can be handled by the server for this class. For example, a streamclass that manages 30 fps video/audio streams is defined by a class entry CLASS 0 33 256, which instructs the server to initialize a stream class object with id 0, a period 33 msec and maximum 256 streams. Each stream in a streamclass object has state associated with it such as the type (streamType), a meta data object (meta) , a table describing the state of kernel level buffers (streamStateTable), the last fetched frame, and the read (qfetch), and the send pointers (qsnd) used for prefetch and send activity. Each stream also stores a back-pointer (sessionId) to the session to which it belongs. The meta object encapsulated by a stream defines the functions and the format of the meta information used by the server to read the stream data. It stores the file descriptor corresponding to the stream data file. Using the current state of the session to which the stream belongs, the getMeta() method of this object extracts the meta information of a requested frame of the stream.

Stream Open Stream (SOS) server: This object, run in a process separate from the main MOD server, opens all stream files in a session to ensure that the similar operations in the main MOD server process are completely non-blocking. In BSD UNIX the file open operation is a blocking or a synchronous operation which means until the open() system call completes the process invoking the operation sleep waits and thus blocks any other activities in its context. This limitation of BSD UNIX necessitates the OpenStream server, which ensures that the stream class RTUs in the main MOD process do not block when a blocking open() or streamopen() call is performed.

6.1.1 Control Flow to Set Up a New Session The MOD server performs three control path tasks in an endless loop: (1) check for new client connection requests (ProcessNew), (2) process control commands for the active sessions (ProcessActive:) , and (3) process messages from the Stream Open Server

115 for sessions that are created but have not completed initialization (ProcessStreamOpen). In the following we illustrate control operations to set up a new session. The clientInterface object detects a new session request in the ProcessNew step, and in response requests the session manager to set up a new session object. If all the session objects are in use, an error message is sent to the client and the newly accepted connection is closed. If a new session is successfully allocated, it is assigned an unique session identifier which the server functions use to access the session object for required processing. At this stage, the session is created but is uninitialized. The ProcessActive processing step detects an OPEN SESSION request on this newly created session and initiates a set of steps to initialize it. These steps, shown in Figure 6.2, are described below: OPENReq

10

Client Interface Object (CIS)

1

9

Network Signalling Server (NSS)

2

4

initSession

5

3

getConn

Admission Manager

Stream Open Server (SOS)

Session Manager Object (SM)

8

6

openStreams

Open Manager

setUpStreams 7

preFetch

Session Table


Figure 6.2: Function trace for session open

1. The clientInterface receives the OPEN SESSSION text command from the client and parses it to extract the name of the session description file, the streams selected for playback, and network connection info, such as mode of network connections and the VCI information if any to be used for streaming data to the client MMX.

116 2. The session manager invokes the local admission control procedure to decide if the new session with requested number of multimedia streams can be admitted. If the admission control rejects the new session, the allocated session is freed and the client is notified. If the requested mode of network signaling is SVC and PVC, the session manager sends a message to the Network Signaling Server ( NSS). 3. The NSS object performs the appropriate network signaling functions to request ATM connections with required QOS and obtains the connection identifiers – VPI and VCI – for the requested number of streams. If this operation fails, the session is deallocated and the client is notified of the failure. 4. The session manager then initializes the session object. 5. It then contacts the StreamOpenServer (SOS) to open the meta and data files for all requested data streams. 6. The ProcessStreamOpen step in the processing loop of the server detects a response from the SOS server and invokes the session manager to process this response and update the session. If the open operations succeed, the session manager parses the session description file and invokes the streamclass object functions to set up the streams using the stream API described in chapter 5. It also sets up the NATM sockets for data transmission on the ATM connections obtained by the NSS server object and performs data prefetch for each stream. Upon successful completion of the prefetch a session setup, success message is sent to the client indicating successful setup of data streaming. In Section 6.3 we characterize the latency incurred in completing the above steps to setup a new session.

6.1.2 Data Path Architecture for the SNMOD server Figure 6.3 illustrates the data path architecture, based on the mmbuf based zero copy disk-to-network data path, used to stream multimedia data for active streams. Each streamclass object in this architecture manages the state of N streams. At the setup time, each stream is allocated a ring of buffers, each of which is an mmbuf chain. The size of the mmbuf chain depends on the size of the multimedia frame it stores. For example, in the case of a variable bit rate stream such as MJPEG video, each video frame is of different size and therefore, the mmbuf chains will vary in size over time. On the other hand, for

117 a constant bit rate audio or video stream, every frame and therefore every mmbuf chain is of the same size. A buffer element (mmbuf chain) in the ring buffer can be in one of the four states - Empty, Reading, Full, Sending. Each stream maintains two pointers – a read pointer (qfetch) and a send pointer (qsnd) – in the ring buffer to co-ordinate its data read and send activities. Stream state structure for N streams S0

S1

S2

S3

SN-1

STREAM N-1

STREAM 0 qfetch

qfetch

qsnd

F

SND

RD

E

0

1

2

3

MMBUF chains

qsnd

F

SND

RD

E

0

1

2

3

MMBUF chains

Figure 6.3: Streaming architecture Each stream class contains an active RTU which handles the data fetch and send for all streams in that class. Every periodic invocation of the RTU executes a handler function that performs following three tasks: Obtain state of active streams: Using a single stream poll() (super) system call, it first polls the state of buffer chains of all streams in the class. Schedule reads: For each stream, it follows the qfetch pointer to check if the eligible buffer chain is in the empty state. If so, it queries the meta object associated with the stream to obtain the size and the start offset the next media frame in the stream data file. The state of the connection defined by speed and last Fetched Frame are used

118 to compute the frame to be read. If these operations succeed, the chain is marked for the BUF OP READ operation, and the offset and size for the read are appropriately initialized. The qfetch pointer is advanced (modulo nochains) to point to the next chain on which the succeeding read is performed. Schedule data sends: For every stream, if the buffer chain pointed by the qsnd pointer is in the BUF FULL state, thus eligible for a send operation, the chain is marked with a BUF OP SEND operation. The rate at which the buffer needs to be paced by the ATM interface is computed based on the stream slow factor, the stream class period, and the size of the buffer. This rate value is marked in the state passed to the kernel. Once the required buffer chains for all streams are initialized for appropri ate read and/or send operations, the handler invokes a single stream rdsnd() super call to perform these operations. Any deadline misses are recorded in the running stream and session statistics that the server collects. The three variables - speed, slow factor and location, are manipulated by the session manager in response to interactive commands such as rewind, fast-forward, random search etc. but are used only in the periodic RTU handler.

6.2 Discussion In this section we discuss extensibility of our MOD server prototype and also discuss limitations and their solutions.

6.2.1 Extensibility In the following, we describe various dimensions along which the existing prototype MOD server can be extended. Data types The current MOD server prototype supports only playback of compressed MJPEG in proprietary video data format. However, since the basic meta information needed for data streaming namely size of the frame, and the start offset in the file, remains independent of the video format, new video formats can be easily supported. Also, new meta data formats can be implemented by modifying the meta object. Transport and Application level streaming protocols The present MOD server supports data streaming over NATM transport layer. However, since the architecture of the

119 server and data path operations are independent of the data transport, new transport protocols such as RTP, and CMTP can be easily supported by minor modifications. Also, by modifying the clientInterface object, new application level streaming protocols different from our protocol (Section 4.4.1) can be implemented. Admission control Currently, the playback server supports static admission control for every storage unit in the server. However, sophisticated admission control algorithms that use apriori knowledge of the statistics of multimedia documents have been proposed in the literature [30, 31, 32, 120]. In our server, when the multimedia documents are created the recording process compiles statistics information. The AdmissionManager object in the playback server can use the statistics information and the knowledge of the system resources usage to implement sophisticated, aggressive admission control algorithms. Billing, access validation and logging In a commercial MOD server, access validation, accounting and billing are crucial tasks. A object that implements these functions can easily be integrated with the clientInterface and AdmissionManager objects. Service models Our present prototype supports only fully interactive playback service. However, it can be easily extended to support simple as well as more complex MOD services. For example, a new service that supports interactive orchestrated presentations with composite document playback has been implemented using our MOD server as a base server [114]. Other web based MOD services such as simple nearvideo-on-demand, periodic broadcasting with interval caching, and content based multimedia indexing can be implemented by extending the control path of our prototype server and leaving the data path essentially unchanged.

6.2.2 Limitations of the Prototype Playback Server The limitations of the client and server hardware and the limitations of the enhanced BSD UNIX OS result in deficiencies in our prototype playback server. In the following, we discuss these limitations and also mention solutions that can rectify these limitations. Redundant I / O operations In our prototype server, we employ 4.4 BSD FFS file system to store video/audio data and meta data. The disk I / O operations generated by the read/write calls on an FFS file system are always in multiples of file system block,

120 typically 8 KB in size. Clearly, read/writes that are less than 8 KB in size are often satisfied from the blocks residing in the file system buffer cache. However, in case of the new mmbuf data path which bypasses the buffer cache, such small I / O operations result in redundant disk operations and wastage of disk bandwidth. For example, sequential 4 KB stream API reads on a FFS file system with 8 KB block size will generate twice as many disk I / O operations. However, such scenarios do not represent the common case where in the stream API is used for large sized reads for high bandwidth streams for which caching is wasteful. Nevertheless, a simple solution to this problem is to cache the most recently used/read ( MRU) disk block for every file descriptor opened using the stream open interface. The streamread operations can consult this cache to eliminate redundant I / O. (Can be BEST EFFORT)

TX Buffer VCI = i, j, k....

VCI = l, m, n....

VCI = a, b, c....

Channel 0

Pacer

Channel 1

Channel 7

Figure 6.4: Transmit portion of the ENI

ATM

interface

Inefficient ENI ATM interface: The ENI OC -3 ATM interface used in our prototype suffers from several limitations. The implementation of ATM rate pacing in the ENI ATM interface, illustrated in Figure 6.4, uses 8 different transmit channels which are multiplexed at the cell level on to a OC -3 link. The device driver can statically or dynamically assign the active VCs to these transmit channels. Clearly, if there are more than 8 active ATM VCs, multiple VCs will be assigned to each transmit channel. However, such channel sharing results in packet level multiplexing of VCs instead of the desired cell level multiplexing and makes it difficult to meet delay and rate guarantees of the connections. Consider an example where two 8 Mbps video connections to an MMX share a transmit channel, and require that their video frames be paced out in 33 msec. However, since the pacing is performed at the packet level, if the channel rate is set to 8 Mbps, these frames can be transmitted only once every 66 msec. The lack of playout buffer in the MMX requires that frames be sent out at a regular rate – once every 33:33 msec. This limitation eliminates the possibility

121 of setting the channel rate to twice the required rate (i.e to 16 Mbps) to meet the delay guarantees of the connections. Therefore, in order to obtain delay and bandwidth guarantees from the ENI rate pacing hardware, only one connection must use a transmit channel at any given time. This drawback of ENI pacing hardware limits the number of active video or audio connections in our prototype to 8. Another drawback of the ENI transmit hardware is the coarse grain granularity of available rates. The ENI rate pacing uses a 8 bit descriptor to characterize the rate of an ATM connection and therefore, supports only 256 distinct rates. This lack of fine granularity for available rates results in data being sent faster or slower than the desired rate. Such rate variation can normally be smoothed using a client side playout buffer. However, the MMX lacks such a buffer and therefore, our playback server uses an adaptive mechanism implemented in software to dynamically adjust the connection rate to minimize rate overflow or underflows. Audio-video synchronization: In our current prototype, the audio and video bit streams are sent over separate network connections and thus suffer separate network delay and jitter. The coarse grain granularity of rate pacing at the sending end can result in rate mismatches between these connections. Also, due to lack of buffers, the MMX essentially plays the data as it comes off the network and performs no additional processing to ensure audio-video synchronization before data are consumed. The lack of timing information in the audio streams makes it difficult to correlate it to the corresponding video, and to detect losses. Thus, in the event of data loss the synchronization is lost irreparably. Clearly, these limitations make it difficult to ensure long term synchronization between the audio and video streams in this open loop system.

6.3 Performance Evaluation In this section we describe experiments to demonstrate performance improvements in the new MOD server. Specifically, we show improved CPU availability, increased throughput, and better streaming performance. In this section, we refer to the MOD server prototype described in chapter 4, section 4.4 as httpd+ and the prototype discussed in Section 6.1 as the snmod server.

122

6.3.1 Improved CPU Availability From the design described in Section 6.1, we can see that during every RTU invocation, the snmod MOD server uses 2 system calls - namely stream state, and stream rdsnd for all N active streams. Thus, it requires only 60T system calls for a stream playout of T seconds in contrast to 90NT calls required for the first generation prototype discussed in Chapter 5. This suggests a factor of 1:5N improvement in CPU availability. In the following experiment we demonstrate that this indeed is the case. Improved CPU availability with new server Compute time vs. Load 3000.0 P10 (strm) P20 (strm) P10 (shm) P20 (shm)

Time (secs)

2000.0

1000.0

0.0

0

2

4

6

8

# Strms

Figure 6.5: Improved CPU availability with the second generation server In this experiment, we used a CPU intensive primes program that computes the prime numbers as the candidate program for which we measure CPU utilization. We measured the amount of time it takes to compute primes numbers in the range 1::100000 (shown as in P10 in the graphs) and in the range 1::200000 (shown as in P2 0 in the graphs). We first ran the httpd+ server and varied the number of active playback sessions from 0 to 8. In the absence of playback sessions and other CPU load, the primes program gets complete access to the CPU. However, as the number of active sessions is increased, the number of active RTUs that rob CPU from the primes programs increases and we notice a drop in the CPU share of primes program. Figure 6.5 ( P1 0 (httpd+), P2 0 (httpd+)) illustrates the amount of time primes accessed the CPU. We can see that as the number of active RTUs is increased the CPU availability for primes drops and the completion time increases dramatically. We then repeated this experiment with the snmod server. The CPU share observed for primes program is illustrated in Figure 6.5 ( P1 0 (snmod), P2 0 (snmod)). We can clearly see that with snmod server, as the number of sessions is increased the amount of time

123 taken by the test program increases far less dramatically than in the previous case and is quite low compared to the httpd+ case. This indicates conclusively proves improved CPU availability with the new SNMOD server. We also noticed excellent interactive performance for other routine tasks such as telnet sessions, web browsing, and web server. On the contrary, with the httpd+ server the interactive performance deteriorates rapidly as the number of active sessions increases.

6.3.2 Improved Streaming Performance In this section, we measured a set of performance metrics such as the aggregate throughput, deadline miss probability, average RTU handler processing time, and average session setup latency as the load on the video server is increased by increasing the number of active connections. The setup for these measurements, illustrated in Figure 6.6 Aggregate throughput and deadline miss probability

Multi- CCD MOD Server

200 MHz Pentium Pro

2-disk CCD

S0

2-disk CCD

S1

2-disk CCD

S2

Figure 6.6: Experimental setup We measured the aggregate server throughput and the maximum fraction of deadline misses among all active sessions as the number of active connections is increased. A deadline miss occurs when the server attempts to send a audio/video frame and finds the required buffer unavailable due to an incomplete disk read. In the event of a deadline miss, the server can process the buffer in two ways: (1) it can advance to the next buffer and skip the delayed buffer; (2) it can send the delayed buffer. The first option results in a skip in the playback but maintains the frame rate constant. On the other hand, the second option results in slower effective rate of playback. Note that with video playback, only excessive

124 deadline misses (of the order of 1 in 100) result in perceptible degradation in playback quality. However, with audio even small deadline misses can cause audible artifacts. We first activated all the connections from the same CCD unit (such as S0 in Figure 6.6) for F = 4; 8 video frames worth of prefetch buffer. We refer to this case as S - CCD. In the second set of measurements, we activated the video connections from different storage units (such as S0 , S1 , S2 in Figure 6.6) and set the prefetch buffer to F = 8 frames. We refer to this case as D - CCD. In the case of both these measurements, each active connection was a MJPEG variable bit-rate video with an average bandwidth of about 8 ? 10 Mbps. MOD with multiple connections Deadline miss vs. No. of Connections

0

10

B=8 (S−CCD) B=4 (S−CCD) B=8 (D−CCD) −1

Max. miss prob

10

−2

10

−3

10

−4

10

0

2


6

8

Figure 6.7: Deadline miss probability vs load As the number of active connections is increased, the load on the storage and network subsystems increases, and probability of deadline miss is increased. A larger prefetching buffer insulates the data sends from variation in the disk bandwidths and minimizes the deadline miss probability, and thus, improves the per session and aggregate server throughput. The plot of deadline miss probability vs. number of connections shown in Figure 6.7, and the plot of aggregate server throughput vs. number of connections shown in Figure 6.8 illustrate these observations. Specifically, we can see that in the S - CCD case, with 4 active connections, increasing the prefetch buffer from F = 4 to F = 8 reduced the worst case frame loss rate from 1-in-300 frames to zero. Also, the maximum aggregate throughput of the server increased 20 % from 35 to 42 Mbps and the maximum deadline miss probability decreased by 27 %. This suggests that a single 2-disk CCD can comfortably support up to four 10 Mbps video MJPEG or 25 MPEG1 connections with a

125 MOD with multiple connections Total throughput vs. No. of Connections 80 B=4 (S−CCD) B=8 (S−CCD) B=8 (D−CCD)

Thr (Mbps)

60

40

20

0

0

2


6

8

Figure 6.8: Throughput performance of the new server prefetch level of 8 frames. Note that 8 frames at 10 Mbps require approximately 340 KB of buffer. In the D - CCD case we observed far better performance as the number of connections increased. This clearly is not a surprise, as the increased parallelism in storage access provides higher storage bandwidth. With our example set of streams, we observed that the maximum aggregate throughput increased 50 % from the 42 Mbps in S - CCD case to 63 Mbps and deadline miss probability improved 99 % improvement. Also, from the throughput plot we can see that when we activated the maximum number of connections, storage systems at the server still had spare capacity left. This clearly indicates that we can easily get close to 120 Mbps with 3 CCDs. However, one drawback of using multiple smaller CCDs is that it limits concurrency and requires document replication. This observation coupled with our performance measurements suggest the use of faster hardware disk arrays such as RAID3 [34]. In all our experiments we also observed that the use of RTUs renders the throughput and deadline miss performance to be insensitive to the background CPU load. RTU Stay time In our prototype server we also collected the statistics for the amount of time spent in the RTU handler as the load on the server is increased. This time, called the RTU stay time, includes the time spent in executing the user level handler code and any kernel level

126 MOD with multiple connections RTU stay time vs. No. of connections 5 Avg Max

Ctime (ms)

4

3

2

1

0

0

2


6

8

Figure 6.9: Average time spent in RTU code executed to field the storage and network system interrupts. As the load on the system is increased the storage and network load and interrupts proportionately increase and therefore the stay time would increase. Figure 6.9 clearly illustrates the plot of the average and maximum stay time in the D - CCD scenario. We can see that when the system is loaded to 45 % capacity, the average amount of stay time is 1:42 msec and the maximum stay time is 4:52 msec. Clearly, under full load conditions, even with a conservative factor of 3 increase will keep stay time well under the RTU period of 33 msec and maintain RTUs schedulability. Session setup and prefetch latency We measured the amount of time spent by the server to setup a new session as the number of active sessions is increased. The session setup latency consists of two components: (1) Session setup latency – the time spent from the instance the session open request is received until a valid session entry is created in the session state tables and all required data, meta files are opened. (2) Session prefetch latency: – the amount of time spent in pre-fetching data (audio/video frames) to setup read-send pipeline. The setup latency should be independent of the load, whereas prefetch latency will increase with increased load on the server. We measured these latencies for a fixed session activated from a server setup with F = 8 frames worth of pre-fetching per video session. Figure 6.10 illustrates the plot of setup, prefetch and total latency for this representative session. We can see that the session setup latency ( 17 msec) is insensitive to the load and is a small fraction of

127 Latency Performance Latency vs. No. of connections 400 Pref Total Setup

Latency (msec)

300

200

100

0

0

1

2

3 4 No. of connections

5

6

Figure 6.10: Average latency for session setup operations the total latency. Also, the prefetch latency increased 35 msec per connection when disks are operating in the non-saturated mode, but increased dramatically ( 70 msec per connection) when the disks become overloaded. Our performance results clearly indicate that even with an ordinary 200 MHZ PC with as many as six disks, using our enhanced 4.4 BSD OS and MOD server software we can effectively build a MOD server capable of 120 Mbps storage and network throughput. We believe that with a faster 400 MHZ PC with 64 bit PCI I / O interconnect, larger hardware disk arrays, and two ATM interfaces, we can support in excess of 200 Mbps throughput without any modifications to our software systems.

6.4 Summary In this chapter, we described detailed object oriented design and prototyping of a single node MOD playback server that uses the innovative OS extensions described in Chapter 5. This playback server supports all interactive operations such fast-forward,rewind, slow-play, slow-rewind, pause/resume and random search with sub-second latency. We also presented performance evaluations that conclusively demonstrate improved performance. Specifically, we show higher CPU availability and better QOS guarantees, compared to the prototype in Chapter 4. To summarize, this chapter conclusively demonstrated that high performance MOD playback servers can be built using commodity PC hardware running a suitably enhanced general purpose OS such 4.4 BSD UNIX.

128

Chapter 7 Towards Highly Scalable Servers and Services The recording and playback services and servers described in the previous chapters 4, 6 can support a few tens of clients. The performance of these servers can be enhanced by using faster I / O interconnects, storage systems and network interfaces. However, even with such enhancements these servers cannot be classified as large scale servers which must support 100s to 1000s of concurrent clients. For such large scale servers, it is desirable that all concurrent clients be able to independently access any documents. The multimedia documents must be stored on such servers in such a way that document replication is minimized and the number of clients that can access a document from a single copy is maximized. Minimizing document replication reduces storage cost which is a dominant cost of a storage server. Also, the per connection (client) cost of the server must be maintained constant to achieve a linear scaling in performance. Meeting these goals of minimum storage replication, high concurrency, and high parallelism is crucial to building large scale servers. This chapter describes design and prototype of our distributed storage server and services architecture that aims at satisfying these goals. The rest of this chapter is organized as follows: we first describe an innovative distributed storage cluster architecture for scalable MOD servers. We then describe the distributed data layouts techniques to support a large number of independent guaranteed concurrent access to any data. We also present a distributed scheduling scheme that guarantees load-balanced operation of the storage cluster and implements interactive operations with low latency.

129

7.1 Massively-parallel And Real-time Storage Architecture

CPU

CPU

Memory

Manager

High-speed

High Speed Network

ATM-based Interconnect

Storage Node

Storage Node

Storage Node

Figure 7.1: Massively-parallel and Real-time Storage ( MARS) system This section proposes the idea of Massively-parallel And Real-time Storage (MARS).

7.1.1 Basic Idea The MARS architecture, shown in Figure 7.1, consists of a set of independent storage nodes that are connected together by a fast packet based interconnect. The terms “massively-parallel” and “real-time” signify that the system allows real-time retrieval of data streams from a large storage system in a parallel fashion. The MARS server consists of following four main modules: Server interconnect: The server interconnect can be a packet switched bus, a ring or even a multicast switch based on a general purpose networking technology. The MARS server interfaces directly to a high speed network such as an ATM network. The use of a fast interconnect that uses the same technology as the external network allows transparent interfacing of storage devices to the network. In our studies, we assume that the interconnect is based on the ATM technology and that the server interfaces to one or more ports of an ATM network switch. In near future such interconnects will continue to scale in bandwidth. For example, inexpensive 155 Mbps per port

130 OC -3 ATM switches are commercially already available and faster 1.2 and 2.4 Gbps per port switches are being built. Other forms of interconnect such as a desk area or a system area network constructed out of general purpose ATM host-interface chips such as APIC [45, 46] are also increasingly becoming attractive. Thus, as the network speeds scale and faster interconnects become available, the server bandwidth can be scaled transparently. Note that interconnects based on alternate high-speed networking technologies such as gigabit IP routers can also be be used as server interconnects. However, we do not consider such interconnects in our research. Storage node: A storage node manages a large amount storage and provides file system, scheduling, admission control and compute support. It can be constructed out of an embedded system or a general purpose PC using several off-the-shelf components such as CPU systems, SCSI storage devices, ATM network interfaces. Such an approach guarantees cost-effectiveness and ensures that the storage node and server will scale with the improvements in the CPU, memory, storage and the networking technologies. Central manager: A central manager serves as a front end for the distributed storage server constructed out of several storage nodes. In addition to critical functions such as the admission control, resource management, storage node control, it may provide additional functions such as access control, billing, accounting, and data base query/search services. Depending on the range of functions supported, a central manager may be in the form of an ordinary PC or a multiprocessor workstation running a general purpose OS. Distributed Control Protocol: The central manager and the storage nodes maintain a master-slave relationship using a distributed control protocol. The central manager acts as the master and uses this protocol to co-ordinate data path activities at the slave storage nodes. Such co-ordination is necessary when the high bandwidth multimedia is physically distributed among storage nodes. is a stateful server as it stores the state information for every active stream. It follows a connection oriented approach with resource reservations to provide QOS guarantees required by the media streams. Before a client can access a multimedia stream, the MARS server checks if it can reserve resources required at one or more of the component nodes to guarantee QOS for the new connection without affecting any of the existing active MARS

131 connections, and accordingly accepts or rejects the connection. Typically, the central manager receives the new connection requests from the remote clients. It uses an admission control procedure, that runs entirely locally or in consultation with a node level admission procedure to admit or reject the new request. Once admitted, the MARS server reserves resources such as the network bandwidth, storage bandwidth, compute support, and CPU, I / O scheduling at each node and provides statistical or deterministic guarantees for the stream retrieval. Each storage node provides a large amount of storage in one or more forms, such as large high-performance magnetic disks, large disk arrays, high capacity fast optical storage or tape libraries. The storage system collectively formed by all storage nodes may be homogeneous or hierarchical. If all storage nodes provide identical storage, such as disk arrays, the system will be homogeneous. On the other hand, a hierarchical storage can be constructed by assigning storage of different types to subsets of storage nodes. A client can be served the required data, either directly from the particular storage node (level) or by a staging mechanism, in which data is first moved to a node with faster storage (higher level) and then served from that device. For example, the nodes that use optical storage and robotically controlled tapes can be considered as off-line or near-line tertiary storage. When a client attempts to access the data stored on these devices, it is first cached on the magnetic disks at the other nodes and then served at full stream rate. Thus, the collective storage in the system can exceed a few tens of terabytes and still, allow a large number of concurrent clients to access documents at the standard stream rate. In order to ensure a large throughput from the entire system, the MARS server physically distributes the data for a stream among a subset of the storage nodes. The meta-data information associated with the data is also distributed among various storage nodes and the central manager. Such data and meta-data distribution (striping) depends on various factors such as the length of the document, demand in terms of the number of concurrent clients, degree of interactive behavior, and the design of the data layout scheme. Typically, the storage manager decides this distribution at the time the document is stored on the server. The range of functionality at the storage nodes decides the complexity of tasks performed by the central manager. For example, the admission control and scheduling can be performed either entirely by the central manager or in a distributed fashion if the storage nodes perform local tasks and cooperate with the central manager. Thus, depending on how sophisticated the storage nodes are and the richness of the ATM interconnect, requirements of large scale servers mentioned earlier can be met to a variable extent.

132

7.2 Storage Node Architecture The main objective in the design of the storage node is to provide a high bandwidth, efficient data path between the external network and the storage devices. In addition to providing a storage facility, each storage node supports one or more of the resource management functions outlined below.

File system support: Each storage node performs typical file system functions such as data and meta-data management, meta-data and data cache management, and data buffer management. In addition, it may support advanced database functions, such as the ones proposed by several research groups [76, 98, 79], for efficient browsing and content based searching of multimedia information. Admission Control: Each storage node keeps track of usage of local resources such as storage and interconnect bandwidth and performs local admission control. The resource reservation and admission algorithms used to accomplish this will be similar to the ones proposed in the context of the ATM networks and the end systems [50, 110, 31, 32, 120]. Scheduling support: Each node completely manages real-time scheduling of local data read and write functions for each active stream. In addition, the storage node participates in a distributed scheduling scheme that allows unsynchronized storage nodes to correctly stream a striped multimedia document. Section 7.4 describes one such scheme. Compute support: A large amount of computing power can be provided by using sophisticated media processors such as the MVP [58] at each storage node. Such processors will be embedded on the path between the storage and the network, and can be used to perform media processing functions such as transcoding, speech recognition, image processing, and character recognition required by future multimedia applications.

The architecture of the storage node described here assumes the storage in the form of a disk array such as a RAID. However it can be extended easily to accommodate other forms of storage such as a set of independent high capacity optical and magnetic disks. In order to understand the design requirements for such a storage node, we examine the data and control path within the storage node.

133 The high bandwidth multimedia data such as video, audio is typically pre-fetched and stored in a buffer before it is transmitted on to the network. In high bandwidth MARS servers that provide in excess of Gbps throughput, sustained throughput in and out of memory system at each storage node must be of the order of Gbps. For example, a VOD server that supports 5000 MPEG-2 clients must support 2 Gbps aggregate bandwidth and a memory system bandwidth of same magnitude at each node. The memory systems commonly found in commercial network file servers are constructed out of commercial 60 ns DRAMS that provide a peak bandwidth of 500 Mbps. Newer memory technologies such as Synchronous DRAM with 64 bit data paths operated at 100 MHZ offer a peak bandwidth of about 6 Gbps. These new memory technologies combined with the new I / O and memory control chips [1] can improve sustained throughput in and out of the buffering system at the node. However, since memory bandwidth will continue to be a precious resource, data copying operations must be avoided. A rough estimate of the size of such a memory system is also in order. If the average BW requirement of an MPEG stream is assumed to be 5 Mbps, the interconnect BW of 1.2 Gbps can accommodate roughly 240 clients. If the storage node pre-fetched 1 GOP of typically 9 frames, per connection prefetch buffer 0:2 MBs, and the aggregate buffer requirement for 240 connections would be 48 MBs. Increasing this prefetch granularity increases the required buffer size. Also, as much as twice the size of prefetch buffers may be required to ensure smooth playout and playout control operation. In short, the data path must provide high capacity of the order of a few hundred megabytes and must sustain throughput of the order 2 Gbps. The control path at each node is responsible for admission control as well scheduling of pre-fetches from disk arrays and transmissions by the NIC. These tasks involve allocating/deallocating the data buffers, and scheduling related data structures, periodically updating scheduling information, and retrieving and managing the meta-data. Such operations will typically be implemented in operating system software at each node and will require periodic timely access to the memory, storage and network system to guarantee overall correctness of scheduling operations. Clearly, in addition to the data path throughput, this added load must be sustained by the memory system. A storage node can be constructed out of off-the-shelf PCs or using specialized embedded systems. A representative architecture of current generation PCs is illustrated in Figure 7.2. Typically, it uses a proprietary CPU- memory interconnect which is not accessible to I / O peripheral system designer. However, the I / O interconnect is an industry standard bus called

134 CPU

Memory

Cache CPU-Memory Bus PCI Bridge PCI IO Bus

Ether NIC

HBA

ATM NIC

D D

Figure 7.2: A PC based storage node Peripheral Control Interconnect (PCI) bus with a 32-bit data path operated at 33 MHZ i.e. rated at 1 Gbps bandwidth. The standardized 64-bit extension of this bus that is getting commercialized in high-end PCs operates at 66 MHZ and provides peak 4 Gbps throughput. A bus adaptor commonly called as a PCI bridge connects the PCI bus to the CPU-memory bus and manages the data transfers between the peripheral devices on the PCI bus and the main memory. It acts as a memory access multiplexor which multiplexes memory accesses from the CPU and peripherals. The SCSI subsystem in a PC typically consists of a Host-BusAdaptor (HBA) which interfaces to the SCSI bus on one end and the PCI bus on the other end. Any SCSI compatible storage system, such as a disk, RAID optical storage, tape-library can be connected to the PC via this interface. The HBA transfers data between storage devices and the main memory by performing appropriate SCSI and PCI bus transactions. Similar to the HBA, the host-network interface, such as ethernet or ATM interface, connects to the PCI bus and transfers data to and from the main memory. One serious drawback of this architecture is that the any data transfers between the storage system and the network cause data to traverse twice on the PCI I / O bus: first, for transfers between disk and the memory, and the second for transfer between the memory and the network interface. This reduces the useful PCI bandwidth available for storage-to-network data transfer by a factor of two. An alternate architecture, illustrated in Figure 7.3, that can rectify this limitation exploits new features that are becoming available in the PCI bridge chip-set. In addition to traditional memory access from PCI I / O devices and the CPU accesses, the new bridge

135 Node CPU

Cache

CPU

Memory Bus

PCI Bridge

6 Gbps SDRAM

PCI Bus (4 Gbps)

Control ASIC

DSI

NIC

SCSI 2 Array Controller

D

D

D

D

Commercial RAID

D

Network Storage Interface (NSI)

Figure 7.3: Storage Node Design chips multiplex accesses from additional data port commonly called Advanced Graphics Port (AGP). The AGP port was conceived to allow high speed memory accesses for graphics software and hardware. A new control ASIC or FPGA that interfaces to this port, multiplexes the memory accesses from the HBA and the network interface which otherwise would take place over PCI bus. This control ASIC can provide standard PCI interface on its three ports and thus, allow use of off-the-shelf PCI compatible HBA and NICs. Such an architecture can be realized using commercially available embedded system modules [2]. We contend that current PCs are quite effective; they are inexpensive, provide large amount of compute power and a scalable internal I / O interconnect. Also, several general purpose as well as real-time OSs have been ported to PC platform. We believe that a PC based storage node running a public domain OS such as 4.4 BSD UNIX enhanced with extensions described in Chapter 5 is an attractive candidate for MARS architecture.

7.3 MARS Storage Server Examples In the following, we discuss three examples of a MARS server.

136 Server Central Manager CPU

CPU

CPU

MMU & Cache

MMU & Cache

MMU & Cache

Main Memory

Main System Bus

High Speed Network

Link Interface

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

ATM Interconnect Client

Figure 7.4: A prototype implementation of a MARS server

MARS server with an APIC Interconnect Figure 7.4 shows a proposed architecture of a MARS server that employs a desk area network based interconnect constructed using a host-network interface ASIC chip called APIC (ATM Port Interconnect Controller). The APIC chip is the basic building block for a high bandwidth networked-I / O subsystem, that provides a direct interface to the network for the host (workstations as well as servers) and a variety of I / O devices. In its simplest form, the APIC behaves like a 3 3 switch, two of whose ports can be treated as ATM ports and the remaining one as a non- ATM port. Using one of the ATM ports and a line interface, APIC can be directly interfaced to the input port of an ATM switch. As shown in Figure 7.4, using the ATM ports, multiple APIC chips can be connected in a bidirectional daisy chain. Since each port is designed to operate at full 1.2 Gbps ( SONET) rate, the aggregate data rate on the interconnect is 2:4 Gbps. The non-ATM port of an APIC can be looked upon as a read/write port to a data source, such as a memory or an I / O device. The APIC performs incremental AAL 5 segmentation and reassembly of ATM cells through this port. Each APIC manages a set of connections by maintaining state information for each active connection in a Virtual Circuit Translation Table ( VCXT). The APIC also performs a single parameter (such as average bandwidth) based rate control or source pacing for each

137 connection, which is important in scheduling multimedia streams from storage node to the network. The central manager shown in Figure 7.4 is the resource manager responsible for managing the storage nodes and the APICs in the ATM interconnect. For every document, it decides how to distribute the data over the storage nodes and manages the associated meta-data information. It receives the connection requests from the remote clients and based on the availability of resources and QOS required, admits or rejects the requests. For every active connection, it also schedules the data read/write from the storage nodes by exchanging appropriate control information with the storage nodes. Note that the central manager only sets up the data flow to and from the storage devices and/or the network and does not participate in actual data movement. This ensures a high bandwidth path between the storage and the network.

MARS Super-Server The Cluster Based Storage ( CBS) architecture, which is a scalable extension of the APIC based prototype implementation discussed above, is illustrated in Figure 7.5. It consists of a set of independent clusters interconnected through a fast multicast ATM switch, such as [111, 112]. Each of the clusters resembles our prototype implementation architecture. The APIC at the end of the daisy chain in a cluster transparently interfaces to one port of an ATM switch. Each cluster has a local cluster manager. The central manager in this architecture, manages the switch, performs the network signaling operations, and co-operates with the cluster level storage managers to perform admission control. Two types of information flows between the CBS server and the external network: one type is the control information such as the client requests for multimedia streams, stream manipulation commands such as fast forward, reverse, pause etc., server responses, and the network signaling information. The other type of information is the actual multimedia data for all active connections. Depending upon the implementation, the CBS architecture may reserve certain switch ports for control information and the rest of the ports for data traffic. The scalability of this architecture accrues from increasing the number of ports and the bandwidth per port of the ATM switch, a trend supported by advances in the field of ATM switching. Note that the availability of multiple clusters provides a greater number of options for data distribution. For example, popular documents can be replicated in multiple clusters and incoming requests for them can be assigned to one of these clusters. If the

138 Central Manager

Storage Cluster

Storage Manager

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

APIC

Link Interface

Packet

High Speed Network

Switch

Storage Manager

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

APIC

Link Interface

Figure 7.5: Cluster Based Storage ( CBS) architecture for MARS number of documents is very large, a subset of them can be assigned and served from each cluster. Also, the documents of very long duration can be conveniently split into smaller parts, each of which can be stored independently on a separate cluster. Note that the packet switch in this architecture allows a connection from any input port to be switched and remapped to a different network level connection at any output port. This allows multimedia data to be striped not only over the storage nodes in a cluster, but over multiple clusters as well. The increased parallelism that results from greater degree of data distribution in turn, can support larger number of concurrent accesses to the same document. In short, the CBS architecture is truly scalable in terms of number of clients, storage capacity and storage and network throughput.

139

MARS server with ATM Switch Interconnect Distributed MOD Server

1

1

Pentium PC Storage

Manager

2

2

ATM Pentium PC

Switch

Storage

N

N

Pentium PC Storage

Figure 7.6: A prototype MARS server using ATM switch The MARS server architecture illustrated in the Figure 7.6 employs an off-the-shelf ATM switch as a server interconnect and uses ordinary PCs as storage nodes. One of these PCs acts as a master or central manager of the server that controls and co-ordinates activities at the other PCs that serve as slaves. Each PC runs a general purpose OS such as a 4.4 BSD UNIX or Windows NT. The server that we have prototyped consists of up to eight storage nodes in the form of 200 MHZ Pentium PCs equipped with ENI OC -3 ATM network interfaces and interconnected using a Bay Networks 155 Mbps per port ATM switch. An additional ethernet network hub that connects these PCs is used as a control network. Each slave node uses a dual channel ADAPTEC 3940 ultra-wide SCSI controller capable of 80 Mbps rated SCSI throughput and supports local storage of approximately 30 GBs. The aggregate storage capacity of 250 GB in the server can support 100 hrs MPEG-2 video. Each PC node runs a local NETBSD (Version 1.3) operating system enhanced to handle periodic multimedia streams with extensions described in chapter 5. The aggregate storage and network

140

throughput of this prototype is expected to be 1:1 Gbps. Upon the availability of the 2:4 Gbps APIC interface chip [45], the existing network interface will be replaced with an APIC card. Sixteen such PCs will then be interconnected into a “storage cluster” by a desk area network constructed by daisy chaining the APIC chips. Eventually, several such storage clusters will be connected to the next generation 2.4 Gbps per port ATM switch to realize a multi-gigabit capacity storage server [23, 112].

7.4 Basics of Distributed Data Layout and Scheduling In this section, we provide an introduction to distributed data layout and the scheduling schemes that are crucial to scalability of distributed storage based MARS architecture. We first describe the motivation behind data distribution or striping and then go on to define Generalized Staggered Distributed Cyclic Layouts (GSDCL ). We then describe a simple distributed scheduling scheme that is a direct outcome of such layouts.

7.4.1 Distributed Data Layouts The periodic nature of multimedia data is well suited to spatial distribution or striping. For example, a video stream can be looked upon as a succession of logical units repeating periodically at a fixed rate. A logical unit for video can be a single frame or a collection of frames, and the period of repetition can be the frame period, say 33 msec/frame, or an integral multiple thereof. Each such logical unit or the parts of it can be physically distributed on different storage devices and accessed in parallel. Consider an example system with five storage nodes (numbered 0 to 4 from left to right) illustrated in Figure 7.7. The frames f0 ; f1 ; f2 ; f3 ; f4 are assigned to nodes N0 , N1 , N2, N3 , N4 respectively. Frame f5 is again assigned to node N0, thus following a frame layout topology that looks like a ring. Given that there are N storage nodes, the ring topology of data distribution has a property that at any given node, the time separation between the successive frames of the stream is N Inter-frame time. This facilitates prefetching of data to mask the high rotational and seek latencies of the magnetic disk storage. In our architecture, we use this basic idea to stripe multimedia data over several autonomous storage nodes within the server in an hierarchical fashion. Figure 7.8 illustrates a distributed data layout called Generalized Staggered Distributed Cyclic Layout (GSDCL ). This layout uses a basic unit called “chunk” consisting of k consecutive frames. All the chunks in a document are of the same size and thus, have a constant time length in terms

141

Link Interface

f0

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

f12

f13

f14

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Figure 7.7: Layout example of playout duration. Typically, the data for bandwidth intensive streams such as video, graphics, animation documents are physically striped over multiple storage nodes, whereas the data for streams such as audio, text, data that are less bandwidth intensive are however confined to a single storage node. In case of a Variable Bit Rate (VBR) video such as MPEG video, a chunk therefore represents a Constant Time Length ( CTL) but a variable data length unit. In case of a Constant Bit Rate (CBR) source, it also has constant size [31, 68]. Different documents may have different chunk sizes, ranging from k = 1 to k = Fmax , where Fmax is the maximum number of frames in a multimedia document. In case of MPEG compressed streams, the group-of-pictures ( GOP) is one possible choice of chunk size. A chunk is always confined to one storage node. The successive chunks are distributed over storage nodes using a logical layout topology. For example, in Figure 7.8 the chunks have been laid out using a ring topology. Each such ring is called a distribution cycle. The layout can thus be thought of as a succession of such cycles each containing D chunks for a layout on D storage nodes. The first chunk in a distribution cycle is called an anchor chunk and the node to which it is assigned is called the anchor node for that distribution cycle. As shown in the Figure 7.8, the anchor node for successive distribution cycles is staggered by a stagger factor ks in a modD order. Clearly, changing k and ks results in a new layout and thus, GSDCL ks (k) defines a family of data layouts. Note that in this scheme, the two consecutive chunks at the same node are separated in time by at least (D ? ks)kTf time units and at the most (2D ? ks )kTf time units. Thus, if the chunk

142 Distribution Cycle C0 Chunk (k)

f0 f1 fk-2 fk-1

C1 fk fk+1

C2 f2k f2k+1

C3 f3k f3k+1

C4 f4k f4k+1

C5 f5k f5k+1

f2k-2 f2k-1

f3k-2 f3k-1

f4k-2 f4k-1

f5k-2 f5k-1

f6k-2 f6k-1

C10

C11

C14

Ks=2

C6

C7

C8

C9

C15

C16

C17

C12

C13

C18

C19

C20

C21

C22

C23

APIC

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Figure 7.8: A Generalized Staggered Distributed Data Layouts is fetched as a single data unit, the stream is slowed down by at least

(D ? ks) from the

perspective of each storage node or the throughput required per stream from each stora ge node is reduced by a factor of (D ? ks). This in turns helps in masking the large prefetch latencies introduced by very slow storage devices at each node. We will evaluate these distributed layouts using two performance metrics. The first one called parallelism (Pf ), is defined as the number of storage nodes participating concurrently in supplying the data for a document f . The second metric called concurrency (Cf ) defines the number of active clients that can simultaneously access the same document f . The value of Pf ranges from 1 to D , where D represents the number of storage nodes. Pf is D, when the data is distributed over all nodes, whereas, it is one when the entire document is confined to a single storage node. A higher value of Pf implies larger number of nodes are involved in transfer of data for each connection/request, which in turn improves node utilization and proportionately increases concurrency. If each storage node n (n 2 [1 : : : D]) has an available sustained throughput of Bn , and the average storage/network throughput required for accessing the document f is Rf , then the concurrency supported by a layout with parallelism Pf is

Bn P f Cf = min n Rf

!

143 From above expression, we can see that the concurrency is a function of the parallelism supported by the data layout. Higher concurrency is desirable as it allows larger number of clients to simultaneously access the same document and thus minimizes the need for replicating the document to increase concurrent accesses. Note that GSDCL ks layouts only define the assignment of chunks to storage nodes. Each chunk assigned to a node is stored on its local stroage device using a node specific layout policy which may be tailored to provide reliability and guaranteed data retrieval. For example, if the storage device used at the node is a RAID, the blocks of a frame/chunk assigned to the node are further striped on the disks in the RAID at byte or block level granularity . If the storage device is just a set of high capacity disks, the entire frame is stored on a single disk. In both cases the actual data layout on the surface of the disk may follow a constrained allocation policy, similar to the one discussed in [117], to ensure bounded seek and rotational latencies in retrieving consecutive blocks. We refer to the GSDCL ks layouts as the Level-1 layouts, whereas node specific layouts Level-2 layouts.

7.4.2 Data Striping Service Figure 7.9 illustrates our distributed implementation of the data striping service. The storage nodes in our prototype server (Figure 7.6) are classified into two clusters: a recording cluster and a playback cluster. The nodes in the recording cluster run the recording server which in addition to functionality described earlier in chapter 4, provides striping service to its clients. The striping server on the playback cluster is implemented as a collection of a master striped (hence forth called MSTRIPED) server run on the central manager and serveral slave striped (hence forth called SSTRIPED) servers - one on every slave storage node. The DNS name, the IP address and the request port of MSTRIPED are well advertised whereas the same information about the SSTRIPEDs is available only to the MSTRIPED. The mechanics of data striping consists of following steps: 1. The content creator (the user of the recording service) requests a recorded document to be striped by clicking on the COMMIT-&- STRIPE button in the record GUI in Figure 4.6. The striping activity and the regular recording functionality at the recording server are completely detached, and once the striping is requested, the client can periodically check the status of the striping session. 2. The recordd server opens a session with the MSTRIPED over a TCP / IP connection and exchanges the parameters such as the name of the video, audio files, distributed

144 recordd striped TCP/IP

1

ATM Switch

MStriped

2

SStriped

3

SStriped

N

SStriped

Master Server

Slave Servers

Playback Cluster

Figure 7.9: Striping service layout properties such as chunk size, stagger factor (section 7.4 and chapter 8) and several other small miscellaneous files (such as images and session descriptions). 3. The MSTRIPED communicates with the slaves to request a striping session with the required amount of storage space. As a part of the initialization, each slave checks if the required amount of storage space is available on the local storage and sets up appropriate directory entries for the new document. Upon successful completion of striping, each slave creates the meta files and the session description files that are used for playback. The data writes at each slave use the standard write interface and thus constitute non-real-time load on storage system. The total duration of a striping session therefore depends on the disk load generated by the real-time playback services. 4. After the striping completes, the user can remove the document on the recording cluster using delete local operation. The striped copy of the document can be removed using the delete global command.

145 The implementation of the striping service makes adding new data types and slave nodes very simple.

7.4.3 Distributed Scheduling: A Simple Scheme The distributed scheduling can be defined as the periodic retrieval and transmission of data for all active connections from unsynchronized storage nodes. It is required to provide QOS guarantees to the active clients during the normal playout as well as playout control operations. Such scheduling needs to be performed at multiple levels in the path of the data retrieval and transmission. In particular, data reads and writes from multiple storage nodes must be synchronized for each active connection. Also, at each storage node, the read operations for all connections must be scheduled over multiple disks in the array. Last but not the least, at each individual disk, the head movement must be scheduled to ensure predictable latencies for the read and write accesses for all requests. In order to satisfy overall delay guarantees, each storage node has to ensure that the data retrieved from the devices is transmitted to the network as per the rate specification of the connection. This requires that the network interface card ( NIC) at the node perform rate control or pacing on a per connection basis. As shown in Figure 7.10, each storage node maintains Ca buffers, one for each active connection. In a retrieval environment, the data read from the disks at the storage node is placed in these buffers and read by the network interface to transmit it on to the interconnect and subsequently to the network. At the time of connection admission, every stream experiences a playout delay required to fill the corresponding buffer, after which the data are guaranteed to be periodically read and transmitted as per a global schedule. The global schedule consists of periodic cycles of time length Tc . Each cycle consists of three phases: data transmit, hand-over, and data prefetch. During the data transmit phase (TTx ), the a storage node reads the buffer and transmits it over the interconnect to the network. Once this phase is over, the storage node sends control information (control cell) to the downstream node, so that it can start its data transmit phase. The last phase in the cycle, namely the data prefetch phase (Tpf ) is used by the storage node to prefetch data for each connection that will be consumed in the next cycle. The cycle length Tc determines the buffer and the network and storage bandwidth requirements. Tc depends on the natural inter-frame period Tf of the active streams. In case

146 TTx TH

Tpf

Node 0

Node 1

Tc Node D-1

Tc: Cycle time TTx: Time to transmit data for all active clients TH: Time to handover the transmission to downstream APIC Tpf: Time to prefetch data Ca: Number of active connections D: Number of storage nodes

APIC

1

2

Ca

Figure 7.10: A simple scheme for reads of heterogeneous connections, the smallest inter-frame period Tf among all connections decides the value of Tc . The three cases that arise are as follows:

Tc = (Tf + TH) N: Since each node is assigned equal share of the cycle, the transmit phase length for this case is Tf . This implies that in every cycle only a frame worth of data is pre-fetched for each connection.

Tc = Tf TkH N ((

+

)

)

and k is an integer: In this case, less than a frame worth of data is pre-fetched and transmitted for each active connection in every cycle. This can help minimize the buffer requirement at each storage node. However, there are two overheads associated with this scenario: first, as the cycle time is reduced the aggregate hand-over overhead can become a sizable fraction of the cycle time, and second, a smaller prefetch buffer makes it difficult to mask the seek, rotational, and transfer latency overhead.

Tc = k ((Tf + TH ) N) and k is an integer:

This case implies that more than one frame worth of data is fetched and transmitted for each active connection in

147 every cycle. Such a case allows more time to prefetch and mask the seek and rotation latencies. As an example, consider a prototype with 15 storage nodes (N = 15) serving a fixed number of video connections which have frame period of Tf = 33 msec (ms). If the APIC interconnect bandwidth is 1.2 Gbps, the effective data bandwidth available is 563 Mbps, excluding the bandwidth lost due to the ATM header overhead in each cell. Using these values, if the cycle time is set to Tc = 33 15 495 ms, the length of

the transmit phase can be at most TTx = 33 ms. Thus, each storage node has to prefetch a frame worth of data for each connection every ((495 ? 33) ? TH ) = 462 ? TH ms. Assuming that the TH is of the order of 1 ms, the prefetch time is 461 ms. A simple RAID constructed using disks with maximum rotational and seek latencies of 10 ms can deliver 5 MBps. Thus, a storage node with such a RAID can prefetch 2.3 megabytes of data in 461 ms. A buffer of this size can store approximately 27 frames, each of 84 KB - the average size of a MPEG encoded 20 Mbps HDTV video stream. Since each frame belongs to an independent connection, 27 independent HDTV connections are possible. The total bandwidth requirement of these connections is 27 20 Mbps = 540 Mbps, which is less than the effective interconnect bandwidth of 563 Mbps. Thus, 27 compressed HDTV clients can be supported simultaneously. Similar calculations show that approximately 110 MPEG compressed NTSC quality clients can be supported in this setup. It must be noted that the disk and array level scheduling policies [40, 96, 102, 125] are required to guarantee retrieval of data for all active connections during the pre-fetching period. To summarize, the two advantages that accrue from our data layout and scheduling schemes are as follows:

First, the successive frames of a multimedia stream stored at a storage node are separated in time by a duration much longer than the normal frame period and thus, the effective period of the stream seen at each node is longer. This allows masking the disk rotational and seek latencies and in turn facilitates other connections that access different data to be served between two successive frame retrievals of any stream. Since the frames of a given multimedia stream are physically distributed on multiple storage nodes, multiple of clients can independently access different frames in the same stream by accessing different nodes. This allows a large number of concurrent accesses to the same or different data.

The scheduling scheme we described supports high degree of concurrency and parallelism for data accesses to a single striped copy of a multimedia document. However,

148 this simple scheme breaks down in presence of active connections performing interactive playout control operations such as fast-forward/rewind. In the following section, will describe the limitations of this scheduling scheme and later present a scheme that rectifies these limitations.

7.5 Issues in Design of a Distributed Scheduling Scheme A distributed scheduling scheme must support playout control operations such as fast-forward/rewind, random access, slow-play, and pause/resume with minimal latency and support buffered or bufferless clients with identical or different display rates. Also, the scheduling scheme must ensure that the interactive operations performed by one client do not affect the QOS guarantees of other active clients. We will show that operations such as fast-forward/rewind indeed complicate satisfying these requirement. In the following, we discuss various issues that affect the design of a scheduling scheme that aims to satisfy these requirements. Later, in Section 7.6 we present our novel scheduling scheme called BEat Directed Scheduling (BEADS) that addresses these issues.

7.5.1 Implications of Interactive Operations The simple scheduling scheme described earlier Section 7.4.3 leads to load imbalance in presence of interactive operations. To understand this, consider an example of a MARS prototype with 15 storage nodes (D = 15), each with a disk array capable of providing sustained effective storage throughput of 5 MBps. In our simple scheduling scheme above, each storage node groups and services all active connections together. Also, the sequence of transmissions from storage nodes is identical for all connections and is dictated by the layout scheme. However, such a scheme breaks down when some subset of clients are performing interactive playout control. In Chapter 2, we outlined two schemes – namely, a Rate Variation Scheme (RVS) and Sequence Variation Scheme (SVS) scheme to implement the interactive playout control operations such as fast-forward, rewind, and slow-play. We argued that fast-forward and rewind operations are best implemented using the sequence variation method wherein the sequence of frame displayed is altered but the display rate is kept constant to keep network and storage throughput requirements unaltered. For example, a fast-forward at twice the regular playback rate would be implemented by skipping every alterant frame. However,

149 such frame skipping leads to potential load imbalance situations in the distributed storage server. Consider a connection in an example system with D = 6 storage nodes. In this example, the set of nodes from which the frames are retrieved in the normal playout is f0; 1; 2; 3; 4; 5; 0; 1 : : :g . Upon ff, this node set is altered to f0; 2; 4; 0; 2; 4; : : :g. Clearly, for the connection under consideration during fast-forward operations, the load on the even numbered nodes doubles and the load on the odd-numbered nodes drops to zero. Another serious implication of this is that as the display rate is constant during ff, node 2, for example, must retrieve and transmit data in a time position which is otherwise is allocated to node 1 in normal playout. Thus, in addition to creation of “hot-spots” or load-imbalance, the stream control alters the sequence of node-visits from the normal linear (modulo D) sequence, and the transmission order is no longer the same for all connections when some of them are doing fast forward or rewind. Therefore, transmission of all connections can no longer be grouped into a single transmission phase. Cycle i+1

Cycle i Node 0 Node 1 Node 2

Node 3 Node 4 Node 5

Note: C0:

C1:

C2:

C3:

Figure 7.11: Revised schedule when C0 performs fast forward Figure 7.11 illustrates this with an example of a system with D = 6 nodes and 4 active connections, of which C0 is performing ff. It shows two consecutive (ith and (i + 1)th) cycles. The transmission order in the ith cycle is represented by the ordered node set Splay = f0; 1; 2; 3; 4; 5g, which is identical for all connections. When the fast-forward request for connection C0 received in the ith cycle becomes effective, the transmission order for it is altered to the ordered node set Sf = f0; 2; 4; 0 : : :g in the (i + 1)th cycle. The transmission order for other connections is unchanged.

150 Activity at node i TTx TH

Tpf

Node i

Frames from node i for Connections doing FF

Frames sent on the wire Frames from different APICs for connections doing FF

Frames from Frames from Frames from Frames from APIC 4 APIC 3 APIC 2 APIC 1 (Normal playout) (Normal playout) (Normal playout) (Normal playout)

Figure 7.12: General case of M out of Ca connections doing fast forward Figure 7.12 illustrates the transmission activity at node i and on the APIC interconnect when M out of Ca active connections are performing fast forward. At a typical node i the transmission occurs in multiple phases, one of which is for connections performing normal playout and the rest are for connections performing fast forward. These phases cannot be combined into a single phase as the transmission order of all the M connections performing fast forward is not identical. The sequence of frames appearing on the wire consists of two sequences: a sequence of frames transmitted from an APIC followed by frames transmitted from possibly all APICs for connections performing fast forward. It must however be noted that at any time, only one APIC transmits frames for a connection. The side effect of this revised schedule is that the prefetch and the transmission phases for a storage node overlap. In presence of a large number of connections doing fast forward and rewind, this overlap makes it difficult for cyclic prefetch scheduling to guarantee that data to be transmitted will be available in the buffers. Thus, a distributed scheduling scheme must decouple prefetch and transmit operations at a node and allow only data transmissions from different nodes to be independently synchronized. The above discussion of implications of fast-forward/rewind on distributed scheduling implicitly assumed that frame skipping required can be performed efficiently. This

151 however may not be true. When the storage node serves a connection in normal playout mode, it can minimize the seek and rotational latency by fetching large chunks in single seek operation. However, during frame skipping for fast-forward/rewind, individual frames must be read, which requires repositioning the disk head after every frame retrieval. For large skipping distances and small frames sizes common in compressed streams, each such read will suffer seek and rotation penalty. Such penalties can be minimized under heavy load, if prefetch load over multiple connection is randomly distributed over each disk and efficient disk scheduling algorithms such as those reported in [120] are used. However, under low or moderate loads, frame skipping may lead to poor disk utilization and cycle overflows. Also, if the Level-1 layout uses a non-unit chunk size, frame skipping causes load-imbalance. In other words, frame skipping is suitable for GSDCL layouts that use unit chunk size . An alternate approach to dealing with this problem is to always use a Level-1 layout with a chunk size of k frames and implement fast-forward/rewind by increasing the granularity of skipping to chunks. This kind of chunk skipping is analogous to segment skipping discussed in [33]. It has the advantage that during normal playout as well as fastforward/rewind, chunks are read from the disk in much the same way without any additional seek/rotation penalties and thus, the average storage and network bandwidth requirement remains unchanged.. However, the visual quality of such chunk skipping is likely to be unacceptable for large chunk sizes.

7.5.2 Implications of Granularity of Data Prefetch and Transmission at a Node The prefetch and transmit options available to a node are described in terms of prefetch granularity Fg , defined as the amount of data pre-fetched per connection as a single unit in a scheduling cycle and transmit granularity Tg , defined as the amount of data transmitted per connection in a single cycle. Both these parameters, specified in terms of number of frames, depend on the amount of per-connection buffer at the server, type of network service used for data transport between the server and the client, the design of the data layout, and the buffer available at the client. Since the prefetch and transmit follow a producer-consumer relation, the prefetch granularity ( Fg ) must always be greater than or equal to the transmit granularity (Tg ) to ensure correct operation. If the server uses a Level-1 GSDCL ks data layout with non-unit chunk size of k frames, the storage nodes can prefetch the entire chunk as a single unit. Such small chunks

152 can be stored contiguously on the storage devices at the node and the seek/rotational latency overhead can amortized over such large reads. Though, smaller chunk size reduces buffer requirement, it results in burstiness in data retrieval and increases seek overheads. In the case of a buffered client, the server can transmit the pre-fetched chunk as a single burst at high rate such as link rate. However, this requires that the network support a transport service that allows reliable and periodic transmission of such large bursts of data, with minimum burst loss and/or blocking probability. Supporting a large number of such active connections is a non-trivial task for a network designer. Other option is to transmit the chunk frame-by-frame. However, with highly bursty variable bit rate (VBR) sources such as MPEG frame-by-frame transmission leads to high peak-to-average bandwidth variation. Such burstiness in bandwidth demand can be smoothed by computing the transmission rate over the duration of the entire chunk or a fraction of the chunk – a technique commonly called lossless smoothing [74, 99]. Such smoothing however requires a small smoothing or playout buffer at the client side. In short, pre-fetching data in chunks and frame-by-frame transmission using lossless smoothing will result in optimum storage node performance.

7.6 BEat Directed Scheduling (BEADS) Scheme In this section, we illustrate the basic scheme and data structure used to schedule periodic data retrieval and transmission from storage nodes. Note that in addition to this distributed scheduling, each storage node has to schedule reads from the disks in the disk array and optimize disk head movements. Also, the explanation of our scheduling scheme that follows uses a GSDCL ks layout with unit chunk size and Fg = 1, Tg = 1, because it is conceptually very easy to understand. In a typical scenario, a client sends a request to the server to access a multimedia document at the server. This request is received and processed by the central manager at the server, shown in Figure 7.1. Specifically, the central manager consults an admission control procedure, which based on current resource availability, admits or rejects the new request. If the request is admitted, a network connection to the client, with appropriate QOS, is established. The central manager informs the storage nodes of this new connection, which in response create or update appropriate data structures and allocate sufficient buffers. If an active client wants to effect a playout control operation, it sends a request to the server. The central manager receives it, and in response, instructs the storage nodes to change the

153 transmission and prefetch schedule. Such a change can add, in the worst case, a latency of one scheduling cycle1 . The global schedule consists of two concurrent and independent cycles: the “prefetch cycle” and the “transmission cycle”, each of length Tc . During the prefetch cycle, each storage node retrieves and buffers data for all active connections. In the overlapping transmission cycle, the node transmits the data retrieved in the previous cycle, that is, the data transmitted in the current ith cycle is pre-fetched during previous (i ? 1)th cycle. A pingpong buffering scheme facilitates such overlapped prefetch and transmission. Each storage node maintains a pair of ping-pong buffers, which are shared by the Ca active connections. The buffer that serves as prefetch buffer in current cycle is used as a transmission buffer in the next cycle and vice-versa. The network interface at the storage node reads the data for each active connection from the transmit buffer and paces the cells, generated by AAL segmentation, on the ATM interconnect and to the external network, as per a rate specification. Note that the cells for all active connections are interleaved together. Table 7.1: Prefetch information at a node VCI

V CI V CI V CI .. .

V CI

1 2 3

100

No. of Frames 1 2 1 .. .

Frame IDs 8 4,5 1000 .. .

1

8500

Frame Address

addr addr ; addr

1000

bufdescr bufdescr ; bufdescr

8500

bufdescr

8

45

.. .

addr

Buffer Descriptor 1

45

.. .

1000

8500

Each storage node has its own independent prefetch cycle, in which it uses the prefetch information, illustrated in Table 7.1, for each active connection to retrieve the data. Specifically, the per connection prefetch information consists of the following basic items stored in a data structure called Prefetch Information Table (PIT): 1) Number of frames to be pre-fetched in the current cycle, 2) identification ( ID) numbers of the frames to be fetched, 3) meta-data required to locate the data on the storage devices at the storage node, and 4) the buffer descriptors that describe the buffers into which the data retrieved in the current cycle will be stored. Thus, for the example of Table 7.1, for V CI = 2, two frames f = 4; 5 need to be fetched using addresses addr4 and addr5 into the buffer described by bufdescr 4;5 . Typically, the buffer descriptors and the buffers will be allocated dynamically in each cycle. 1

A cycle is typically a few hundreds of milliseconds duration.

154 Figures 7.13 and 7.14 illustrate the mechanics of transmit scheduling. As shown there, the global transmission cycle consists of D identical sub-cycles, each of time length Tc yc . The end of a sub-cycle is indicated by a special control cell, called beat, sent periD odically on the server interconnect. The central manager reserves a multicast control connection, with a unique VCI, that is programmed to generate these control cells at a constant c yc . Each of the storage nodes in the interconnect copies the cell to the storage node rate TD controller. Each storage node counts these cells to know the current sub-cycle number and the start/end of the cycle. Also, for each active connection, the storage nodes remember the startBeat at which the session started. The central manager decides this beat number and informs all the slave nodes using a reliable command connection. During each beat, using the currentBeat, the startBeat, and the properties of the data layout, each node can completely determine the media frames/chunks currently being pre-fetched and/or transmitted and the nodes performing those activities. The interactive operations such as rewind, and pause/resume modify the startBeat. Note that in the event of a loss of the beat, the nodes may miss their turn in the distributed schedule and must free up data that may have been pre-fetched. This however, is extremely undesirable as the frames dropped at the server are not retransmitted resulting in playback glitches. In the event of a loss, the local state variables that keep track of pre-fetching must be updated to re-synchronize the scheduling. Clearly, beat losses must be very infrequent which requires that the server interconnect provide strict QOS – zero loss, low delay and guaranteed bandwidth to control connection. Even though networking technologies such as ATM are promising in this regard, research efforts similar to ours [20] have suggested use of separate low latency reliable control network to connect the storage nodes. The scheduling information such as beats sent over this network does not interfere with the high-bandwidth data. Tcycle

Tcycle

Time t

Tsub 1

2

3

4

5

D-1

D

1

2

Figure 7.13: Cycle and Sub-cycle One of the main data structures used by each node to do transmission scheduling is the Sub-cycle Use Table (SUT). The ith entry in this table lists the set of VCIs for which data will be transmitted in the ith sub-cycle. This table is computed by the storage node at the start of each cycle, using connection state information such as the playout state .

155 Distributed Scheduling information (SUB-CYCLE USE TABLE) Central Manager Sub cycle

CPU MMU & Cache

Main Memory

VCI

Sub cycle

VCI

Sub cycle

VCI

0

0, 1, 2

0

0, 1, 2

0

1

5, 6

1

0,1,2

1

7

2

7

2

5,6

2

7

3

NULL

3

NULL

NULL

3

NULL

ATM Interaconnect in the Server 1

Tf

1

Tf

2

3

1

1

1

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

4 ATM control cells sent by the central manager on a reserved VCI to mark the end of a subcycle in a cycle

Reserved connection

Figure 7.14: Distributed scheduling implementation Table 7.2: Frame and node sets for all connections VCI

10 11 12 13

Frame set Sf

4; 5; 6; 7 8; 9; 10; 11 0; 3; 6; 9 4; 5; 6; 7

Node set Sn

1; 2; 3; 0 2; 3; 0; 1 0; 3; 2; 1 0; 1; 2; 3

Figure 7.15 illustrates distributed scheduling with an example. This example shows two documents: Document A stored using the GSDCL ks layout with stagger distance ks = 1 and a Document B stored using GSDCL ks data layout with stagger distance ks = 0. Of the four active connections indicated, the connections with V CI = 10; 11 are accessing the document A and the connection with V CI = 13 is accessing document B in a normal play mode. On the other hand, V CI = 12 is accessing the document B in fast-forward mode by skipping every third frame. Table 7.2 illustrates for each connection, the transmission frame set Sf and ordered set Sn of nodes from which these frames are transmitted during current transmission cycle. The frame set defines the frames of a connection that are transmitted in a given cycle, whereas the node set defines the set of nodes that supply the frames in the

156 f0

Document A (SDCL1 )

f1

f2

f3

f7

f4

f5

f6

f10

f11

f8

f9

VCI = 11

f0

f1

f2

f3

VCI = 12

f4

f5

f6

f7

VCI = 13

f8

f9

f10

f11

ks = 1

VCI = 10

Document B (DCL1 )

0 1 2 3

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

12,13 NIL 11 10

0 1 2 3

10 13 NIL 11,12

0 1 2 3

11 10 12, 13 NIL

NIL 0 1 11,12 10 2 13 3

SUTs

Figure 7.15: An example of connections in different playout states frame set. For example, in Table 7.2, the frame set Sf for connection V CI = 10 accessing the Document A, contains the frames 4; 5; 6; 7. and since, these frames are supplied by nodes 1; 2; 3; 0 respectively, the node set contains the nodeids for these nodes. Using such a table, the SUT at each node can be constructed. For example, node 0 transmits for connections 12; 13 in slot 0, for connections 11 in slot 2, for connection 10 in slot 3, and remains idle during slots 1. The SUT at node 0 in Figure 7.15 records this information. Also, note that the SUTs at all the nodes contain exactly four non-zero entries and one NIL entry per cycle, indicating that the load is balanced over all the nodes.

7.7 A Prototype Distributed Playback Service Figure 7.16 illustrates the distributed MOD playback service. The client GUI application used to access this service is the same as the one described in chapter 4 and is activated in the same way (by clicking on an hypertext link in an auto-generated web page). Our distributed playback server consists of a master disthttpd (hence forth called MHTTPD) server run on the central manager and slave disthttpd (hence forth called SHTTPD) servers run on the slave nodes. The MHTTPD maintains following control connections with the slaves (Figure 7.17):

157 Data N+1 Mhttpd

Control

1

Playback Cluster

2

Shttpd

3

Shttpd

Master N+2

N+3

mmxd N+M ATM Network

N

Shttpd ATM Switch

Slaves

Figure 7.16: Distributed multimedia playback Beat channels: The MHTTPD maintains uni-directional multicast connection called beat channels to all the slaves. It sends timing information in the form periodic marker or beat packets/cells on these channels. The period of the beat is decided by the chunk size used for striping the documents. For example with a chunk size of k frames, the beat frequency is set to kTf where Tf is the frame period. Several beat channels, each with a distinct frequency, may be maintained to support multiple chunk sizes. In our prototype, the beat channel is implemented using IP multicast and beats are transmitted using the UDP transport protocol. Due to the limit on number of VCs that can be activated from the ENI ATM card, we did not use ATM one-to-many multicast to implement the beat channel. Command channels: The MHTTPD maintains a full duplex point-to-point reliable connection (implemented as a TCP / IP connection) to each slave. The master and the slaves use these channels to exchange control commands for distributed scheduling and to process playout control requests received from the clients. The client application communicates only with the master server MHTTPD over a TCP / IP connection and exchanges the standard commands described in Section 4.4.1. In response to an OPEN SESSION command received from the client, the MHTTPD uses the command channels to request slaves to initialize a session. It also performs appropriate signaling operations to setup a many-to-one ATM connection from all the slaves to the output port to which the client device is connected. Each slave performs local admission control on the new request and if the session is accepted, each slave opens appropriate data,

158 1

Beat channel

2

Command channel

S0

TCP

IP Multicast BEAT SRC

Si

Command Generator

S0

Si

TCP

Tf

1

SN-1

2

TCP

Slaves

SN-1

Slaves

Beat source and Command Generator are part of mhttpd

Figure 7.17: Two control connections in the playback server meta files, initializes data prefetch, and sets up the network connections using information passed by the master. A separate control command from the master informs each slave when the data transmission should commence for the new session. B1

B2

∆

τ

δ

1

2

3

B3 [1] Beat Bi received at node i and reactivde RTU activated after a small delay [2] Periodic RTU runs for K invocations to perform data fetch and transmit as per the distributed schedule [3] Periodic RTUs suspend “guard time” before next beat is received.

Figure 7.18: Implementation using RTUs Figure 7.18 illustrates the activity at each slave node in response to a beat received on the beat channel. Each slave initializes two RTUs: a beatRTU and a serviceRTU. The beatRTU is a reactive RTU setup to respond with low-latency to arrival of periodic beats from the master. The beatRTU handler schedules the serviceRTU which is a periodic RTU. If the beat frequency is kTf , where k is the chunk size and Tf is the inter-frame time, the

159 invocations of the serviceRTU for

period for serviceRTU is set to Tf . Thus, there are k every beat. Clearly, a small chunk size k results in frequent beats and requires frequent switchover of fetch/transmission activity among nodes. Also, a small chunk size minimizes per connection prefetch buffer requirement at each node and allows the node to prefetch entire chunks for all connections before the node becomes eligible to transmit. However, unless the distributed scheduling overheads are minimal and client has playout buffer to mask such events small chunk sizes can lead to poor playback quality. On the other hand large chunk sizes reduce the distributed scheduling beat frequency, but complicate pre-fetching. If fetch granularity is an entire chunk, aggregate buffer requirements for all sessions will be high. On the flip side, if the fetch granularity is less than a chunk, the transient overload on the disks at a node will be higher and will reduce concurrency. For a given chunk size, during each serviceRTU invocation in the beat period, the slave consults the Sub-cycle Use Table to decide if a given session is eligible for transmission. For every eligible session, the slave performs data transmission over the many-to-one ATM connection. Recall that for every session, at any given time only one slave node transmits data to the client. However, if under overload conditions the data transmission for a session is not complete before the arrival of the next beat, two consecutive nodes in the data transmission schedule (Section 7.6) overstep or collide with each other. Due to the lack of support in our ATM switch for concurrent senders on a many-to-one multicast connection ([57, 109]), such collisions do not get serialized and thus, result in data corruption. Such collisions can be avoided if the transmissions complete guard time before the arrival of the next beat. However, this requires that the data transmissions occur at a rate higher than the minimum streaming rate. Also, the client end must contain playout buffer to buffer the small data bursts to avoid data loss. Availability of such buffer allows for variance in the “switch-over” of transmission activities to successive nodes and relaxes otherwise stringent requirements on distributed scheduling.

7.8 Performance of Distributed Playback In this section, we describe experiments we conducted to characterize the performance of distributed playback.

160 Storage BW performance Throuthput vs. Node number 10.0 Seinfeld (1 node) Seinfeld (2 node) Seinfeld (3 node) IndRail (1 node) IndRail (2 node)

Throughput (Mbps)

8.0

6.0

4.0

2.0

0.0

0

1

2 Node id

3

4

Figure 7.19: Reduction in throughput requirements

7.8.1 Effect of Number of Nodes In this experiment, we measured the effect of increasing the number of storage nodes. We activated a set of video connections from servers with N = 2; 3 storage nodes. We measured the per-session throughput at each node. Figure ‘7.19 shows the bar chart of throughput from each node for two different movies – IndianRail and Seinfeld. We can clearly see that as the number of nodes is increased from 1 to 3, the per-node session throughput drops approximately by the same factor. For example, the Seinfeld video in our measurements is a 8:63 Mbps MJPEG VBR stream. When it was striped on two nodes and played back we observed 4:306007 Mbps from node 0, and 4:313195 Mbps from node 2. Similarly, when the same video was striped to three nodes, during regular playback we measured 2:869159 Mbps from node 0, 2:895161 Mbps from node 1, and 2:873090 Mbps from node 2. Similar observations hold true for the IndianRail video. Thus, increased striping does reduce the per-session throughput requirements from each node. However, this however comes with an added cost in the form synchronization overhead for distributed playback. In following experiments we characterize this overhead.

7.9 Related Work High performance I / O has been a topic of significant research in the realms of distributed and supercomputing for quite sometime now. In recent years, the interest in integrating multimedia data into communications and computing has lead to a flurry of activity

161 in supporting high performance I / O that satisfies special requirements of this data. Here we summarize some notable efforts.

7.9.1 High Bandwidth Disk I/O for Supercomputers Salem et al. [100] represents some of the early ideas on using disk arrays and associated data striping schemes to improve effective storage throughput. Observing that large disk arrays have poor reliability and that small disks outperform expensive highperformance disks in price vs. performance, Patterson et al. [91] introduced the concept of RAID. A RAID is essentially an array of small disks with simple parity based error detection and correction capabilities that guarantee continuous operation in the event of a single disk failure in a group of disks. The RAID was expected to perform well for two diverse types of workloads. One type, representative of supercomputer applications such as large simulations, requires infrequent transfers of very large data sets. The other type, commonly used to characterize distributed computing and transaction processing applications, requires very frequent but small data accesses [91]. However, measurements on the first RAID prototype at the University of California, Berkeley revealed poor performance and less than expected linear speedup for large data transfers [35]. The excessive memory copying overhead due to interaction of caching and DMA transfers, and restricted I / O interconnect (VME bus) bandwidth were cited to be the primary reasons of poor performance. Also, it is recognized now that large RAID disk arrays do not scale very well in terms of throughput. The recent work on RAID - II at the University of California, Berkeley has attempted to use the lessons learned from the RAID prototype implementation to develop high bandwidth storage servers by interconnecting several disk arrays through a high speed HIPPI network backplane [75]. Its architecture is based on a custom board design called Xbus Card that acts as a multiple array controller and interfaces to HIPPI as well as to FDDI networks. Though the measurements on RAID - II have demonstrated good I/O performance for large transfers, the overall solution employs FDDI, Ethernet and HIPPI interconnects and is ad-hoc. Also, it has not been demonstrated to be suitable for real-time multimedia, where the application needs are different than the needs of supercomputer applications.

7.9.2 Multimedia Servers A significant amount of research has attempted to integrate multimedia data into network based storage servers. However, most of it has addressed different dimensions of the problem such as operating system support, file systems design, storage architecture,

162 meta-data design, disk scheduling etc. in an isolated fashion. Here we will categorize and summarize some of the notable research efforts. Multimedia File Systems One of the early qualitative proposals for a on-demand video file system is reported in Sincoskie [103]. The work by Rangan et al. [117, 118] developed algorithms for constrained data allocation, multi-subscriber servicing and admission control for multimedia and HDTV servers. However, this work assumes an unrealistic single disk storage model for data layout. It is worth repeating that such a model is inappropriate, as the transfer speed of a single disk will be barely sufficient to support a single HDTV channel and is about three orders of magnitude lower than that required to support a thousand or more concurrent customers independently accessing the same data. Some of the recent work by Vin et. al. [119, 120, 102] focuses on developing statistical admission control algorithms for a disk array based server capable of deterministic and/or statistical QOS guarantees. Keeton et. al. discuss schemes for placement of sub-band encoded video data in units of constant playout length on a two dimensional disk array [68]. They report simulation results which conclude that storage of multi-resolution video permits service to more concurrent clients than storage of single resolution video. Similarly, Zakhor et. al. report design of schemes for placing scalable sub-band encoded video data on a disk array. They focus only on the path from the disk devices to the memory and evaluate using simulation, layouts that use CDL and CTL units mentioned earlier in Section 7.4 [31, 32]. However, none of these papers address issues in the implementation of interactive operations. Storage Server Architecture Hsieh et al. evaluate a high performance symmetric multiprocessor machine as a candidate architecture for a large scale multimedia server [64]. They report measurements carried out on a fully configured Silicon Graphics high end symmetric multiprocessor workstation called SGI ONYX. They present two data layout techniques called Logical Volume Striping and Application Level Striping that use multiple parallel RAID3 disk arrays to increase concurrency and parallelism. The focus of their work until now has been on using measurements to characterize the upper limit in terms of maximum number of concurrent users in different scenarios, such as various type of memory interleaving, various data striping schemes, and multiple processes accessing a single file or different files.

163 Similar to our work, they propose a three level layout: the lowest level is the RAID level 3 byte striping. The second level called Logical Volume striping allows data to be striped over multiple RAIDs belonging to a logical volume. The last level, called the Application Level Striping, allows applications to stripe data over multiple logical volumes. The data layout reported in this paper uses data units of size 32 KB, based on the assumption that each video frame is of that size. The platform used in their measurements featured eight 200 MIPS processors, up to 768 Mbytes of multi-way (4-way or 8-way) interleaved RAM, and up to 30 RAID-3 disk arrays. The platform employs a very fast system bus (about 1.2 GBps capacity) and multiple I / O buses (each about 310 MBps capacity). Thus, the prototype architecture to be used as a VOD server is an expensive high end parallel processing machine. All the measurements reported in [64] assume that the clients are homogeneous and perform only normal playout. In other words, cases when a subset of active clients are performing interactive playout control or require different display rates have not been considered. Also, even though the authors recognize that they will require some real-time scheduling to optimize the disk head movements and retrievals from different disks, this particular paper does not report any details on this topic. Similarly, buffer management, scheduling, network interfacing and admission control have not been investigated. Biersack et. al. have recently proposed an architecture very similar to ours, called Server Array [15, 16]. This architecture uses multiple geographically distributed storage nodes interconnected by external network such as a LAN working together as a single video server. Also, it proposes use of Forward Error Correction ( FEC) codes to improve reliability. However, loosely-coupled distributed server approach in this work will make it difficult to support interactive operations with sub-second latency for large number of concurrent clients. Lougher et al. have reported a design of a small scale Continuous Media Storage Server (CMSS) that employs append-only log storage, disk striping and hard real-time disk scheduling [77]. Their transputer based implementation handles very few simultaneous customers and supports small network bandwidth. This implementation is clearly not scalable for large number of users and for high bandwidth streams such as HDTV. Tobagi et al. [113] report a small scale PC - AT and RAID based video server. Similarly, Kandlur et al. [125], describe a disk array based storage server and present a disk scheduling policy called Grouped Sweeped Scheduling (GSS) to satisfy periodic requirements of multimedia data. However, all these server proposals are PC based and thus not scalable. Also they do not support high parallelism and concurrency.

164 The work reported in this chapter has several similarities with the Tiger File System project at Microsoft [20]. Tiger is a distributed, fault-tolerant, real-time file server that uses a distributed storage architecture consisting of cubs PCs controlled by a control manager. It stripes constant-bit-rate ( CBR) data such as video, and audio over cubs in fixed length data units. The striped data is played back to clients over a broadband network using a schedule distributed by the controller via a control network. Unlike our project MARS, the storage nodes (cubs) in Tiger employ Windows NT operating system enhanced to support zero copy data path between disk and network. However, this data path is different from our mmbuf system and does not use any priority based disk queuing.

7.10 Summary In this chapter, we addressed the challenging problem of designing large scale multimedia storage servers. We proposed the concept of Massively-parallel And Real-time Storage (MARS) architecture to meet the requirements of large scale, concurrent accesses with minimal storage replication. We described prototype implementations of the MARS server that uses innovative ATM based interconnect and provide a direct path for data from storage devices to the network. In conjunction with MARS architecture, we illustrated distributed layout scheme and an associated scheduling schemes that support high concurrency and parallelism.

165

Chapter 8 Load Balance Properties of Distributed Layouts In the previous chapter, we described distributed data layouts called Generalized Staggered Distributed Cyclic Layouts (GSDCL ) designed in conjunction with our distributed storage server architecture. We also showed that the implementation of interactive operations such as fast-forward and rewind on documents stored using such layouts lead to potential load imbalance in the storage server. In this chapter we investigate the “load-balance properties” of these GSDCL layouts that are crucial to an efficient implementation of interactive operations. The remainder of this chapter concerns with various issues in the design of such data layouts and is organized as follows. Section 8.2 motivates and defines the load-balance properties of GSDCL layouts. It also defines a performance metric called “safe skipping distance” (SSD) (df , dr ) that guarantees load-balance during fast-forward/rewind. Section 8.1 defines the basic problem of computing the safe skipping distances and develops the basic equations that describe the GSDCL layouts. Sections 8.3 presents the theorem that characterizes the SSDs for the simplest GSDCL layout and motivates the need for additional data layouts to provide a richer choice of fast-forward/rewind speeds. Section 8.4 presents the theorem that characterizes the SSDs for the GSDCL 1 data layout and shows that combining this layout with GSDCL 0 provides almost all skipping distances. Section 8.5 describes a conjecture that characterizes the load balance properties of the most general GSDCL ks data layouts. Section 8.6 presents issues specific to fast-forward/rewind operations on MPEG documents. In Section 8.7 we present the related work. Finally, Section 8.8 summarizes this chapter.

166

8.1 Load Balance Properties of GSDCLks Layouts Distribution Cycle C0 Chunk (k)

C1

f0 f1

fk fk+1

C2 f2k f2k+1

C3 f3k f3k+1

f4k

f5k

f4k+1

f5k+1

fk-2 fk-1

f2k-2 f2k-1

f3k-2 f3k-1

f4k-2 f4k-1

f5k-2 f5k-1

f6k-2 f6k-1

C10

C11

C14

Ks=2

C4

C5

C6

C7

C8

C9

C15

C16

C17

C12

C13

C18

C19

C20

C21

C22

C23

APIC

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Node 1

Node 2

Node 3

Node 4

Node 5

Node 0

Figure 8.1: A Generalized Staggered Distributed Data Layouts We recall that the Generalized Staggered Distributed Cyclic Layouts ( GSDCL ) layouts, illustrated in Figure 8.1, are characterized by two parameters – chunk size (k ) and stagger factor (ks ). The chunk size, defined as a constant time length unit with k media frames, determines the size of the data distribution unit. The stagger factors decides how these data units are physically distributed among the storage nodes. Both these parameters have implications on the implementation of fast-forward/rewind operations and on the load-balanced operation of the storage cluster. In Section 2.1 we described the Rate Variation (RVS) and the Sequence Variation (SVS) schemes to implement various interactive operations. In our work, we implement fast-forward/rewind operation using the SVS schemes which alter the frame sequence but maintain the display rate constant. However, this choice can lead to potential load imbalance situations that adversely affect the QOS guarantees provided to active clients. Consider a simple GSDCL ks layout with a chunk size k = 1 and stagger distance ks = 1. When a document laid out using such a layout is accessed in a normal playout mode, the frames are retrieved and transmitted in a linear ( modD) order. Thus, for a set Sf of any consecutive D frames (called “frame set”), the set of nodes Sn (called “node set”) from which these

167 frames are retrieved contains each node only once. Such a node set that maximizes parallelism is called a balanced node set. A balanced node set indicates that the load on each node, measured in number of frames, is uniform. However, when the document is accessed in an interactive mode, such as ff or rw, the load-balance condition may be violated. We define the fast forward (rewind) distance df (dr ) as the number of frames skipped in a fast forward (rewind) frame sequence. Consider a connection in a system with D = 6 storage nodes, a DCL 1 layout, and a fast forward implementation by skipping alternate frames. The frame sequence for normal playout is f0; 1; 2; 3; 4; 5; : : :g, whereas for the fast forward the same sequence is altered to f0; 2; 4; 6; 8; 10; : : :g. This implies that in this example, the odd-numbered nodes are never visited for frame retrieval during ff. Thus, when a connection is being serviced in ff mode, the load measured in terms of the number of frames retrieved doubles for even numbered nodes and reduces to zero for odd numbered nodes. In other words, the parallelism Pf is reduced from D to D=2 during ff of a connection and the concurrency must be proportionately reduced. Clearly, in presence of a large number of connections independently exhibiting interactivity, this can lead to occasional severe load imbalance in the system and can make it difficult to satisfy the QOS contract agreed upon with each client at the time of connection setup. Thus, if we can ensure that Pf and consequently, Cf are unaffected during ff or rw, we can guarantee load-balance situations. One way to do this, is to use only those frame skipping distances that do not affect Pf and Cf . We call such skipping distances as “safe skipping distances” (SSD). Thus, given a data layout, we want to know in advance all the SSDs a distributed data layout can support. Table 8.1: Road-map for various analytical results

Section number Description Section 8.2 Basic equations for the GSDCL ks (k) layouts Section 8.3 SSDs for GSDCL 0 layouts Section 8.4 SSDs for GSDCL 1 layouts Section 8.5 SSDs for GSDCL ks layouts In the following sections we characterize the safe skipping distances for various data layouts in the GSDCL ks (k) family of layouts. Specifically, we provide analytical results that clearly define the SSDs for GSDCL ks (k) layouts for arbitrary values of stagger factor ks. Table 8.1 provides a road-map for our analytical results. In the following discussion, we assume a chunk size k = 1 for our layouts. Later, we will discuss implications of using non-unit chunk size k.

168

8.2 Basic Equations In this section, we will first derive some basic equations which we later use to derive SSDs for various layouts. Distribution Cycle 0

1

f +D-p f +D-p+1

p

p+1

f

f +1

p+2

f +2

p+3

f +3

j

f + j -p

D-1

f +D-1-p

Figure 8.2: General distribution cycle with anchor node p We will assume that there are in all D storage node and the stagger distance is ks. Recall that a GSDCL layout consists of successive distribution cycles, each with D chunks. The first chunk in a distribution cycle is called an anchor chunk and the node to which it is assigned is called the anchor node for that distribution cycle. The repeating pattern of D distribution cycles is called a stagger cycle. Consider any arbitrary distribution cycle in a GSDCL layout, with the anchor frame f assigned to node p (0 p D ? 1). Figure 8.2 illustrates such a cycle. The j th frame in this distribution cycle is then defined as in Equation 8.1.

8 < fj = : f + D + j ? p 0 j p ? 1 f +j ?p pj D?1

(8.1)

The above equation can be further simplified to Equation 8.2

fj = f + (j ? p) mod D

(8.2)

The anchor frame in the ith distribution cycle is iD and it is assigned to the node iks mod D. Substituting this in Equation 8.1, the j th frame in the ith distribution cycle of the 0th stagger cycle is given by Equation 8.3.

fij

8 < iD + D + j ? iks mod D 0 j iks mod D ? 1 = : iD + j ? iks mod D iks mod D j D ? 1

(8.3)

169 which simplifies to

fij = iD + (j ? iks) mod D

(8.4)

lth frame at node n Note that adding kD 2 to the frames in the 0th stagger cycle, frames in any k th stagger th stagger cycle and kth cycle can be computed. Also, the lth frame at a node belongs to kcyc loc distribution cycle, where kcyc = l div D and kloc = l mod D. Hence, using Equation 8.3, the ID of the lth frame at node n is given as

8 2 < f l = : kcyl D2 + klocD + D + (n ? klocks mod D) 0 n klocks mod D ? 1(8.5) kcyl D + klocD + (n ? klocks mod D) klocks mod D n D ? 1

However, given lD = ((l divD)D +(l mod D))D, the term kcyl D2 + kloc D is same as lD. Also, lks mod D = kskloc mod D. So the Equation 8.5 can be rewritten as follows:

8 < f l = : lD + D + (n ? lks mod D) 0 n lks mod D ? 1 lD + (n ? lks mod D) lks mod D n D ? 1

(8.6)

Clearly, the Equation 8.6 can be further simplified to

f l = lD + (n ? lks) mod D

(8.7)

Frame f to node n mapping The layout pattern repeats itself after every D cycles. This repeating stagger cycle pattern contains D 2 frames. Hence a frame with ID f belongs to a stagger cycle with ID

Cstg = f div D

2

The ID of this frame in this stagger cycle is given as:

f = f mod D 0

2

170 Each stagger cycle contains D distribution cycles. Therefore, the ID of the distribution cycle to which the frame f belongs is i = f divD. Since, f is less than D 2 , 0 i D ? 1. The stagger distance for this distribution cycle is ((iks) mod D). The ID of the frame within the ith distribution cycle is given as f mod D . Hence, the ID of the node n to which the frame f in the stream is assigned is computed as follows: 0

0

0

f

= f mod D i = f divD f 7?! fn = [iks mod D + f mod D] mod Dg 7?! fn = [iks mod D + (f mod D ) mod D] mod Dg 7?! fn = [iks mod D + f mod D] mod Dg 0

2

0

0

2

However, note that iks

(8.8)

mod D can be further simplified as follows:

i = [f ? (f divD )D ] divD iks = [(f divD)ks ? (f divD )ksD] iks mod D = [(f divD)ks] mod D 2

2

2

(8.9)

Substituting Equation 8.9 in Equation 8.8, we get the following result,

f 7?! fn = [((f divD)ks) mod D + f mod D] mod Dg 7?! fn = [f + (f divD)ks] mod Dg

(8.10)

Thus, for any GSDCL layout, the Equation 8.7 allows us to compute all the frames at a given node n, whereas the Equation 8.10 allows us to find the node to which a frame f in a striped document belongs.

k

8.3 Safe Skipping Distances for GSDCL with s = 0 We will now use the basic equations derived earlier to characterize the SSDs for the simplest GSDCL layout with stagger distance ks = 0. Let us take a closer look at the example GSDCL 0 shown in Figure 8.3. Assume that the fast forward starts from frame 0 with a fast forward distance df of 2. First D = 5 frames in the fast forward sequence are f0; 2; 4; 6; 8g, which are retrieved from a balanced node set f0; 2; 4; 1; 3g. If the fast

171 Distribution Cycle

f0

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

f12

f13

f14

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Anchor Frames

Link Interface

Figure 8.3: Simple GSDCL 0 layout with five nodes forward distance is 3, the node set if altered to ordered set f0; 3; 1; 4; 2g, which is still balanced. It can be easily verified that the node set is balanced when df = 4, but is unbalanced when df = D = 5 or df = integral multiple of D. The following theorem relates D and df explicitly for such GSDCL layouts. Theorem 1 Given a GSDCL layout over D storage nodes with ks true:1

= 0 the following holds

If the fast forward (rewind) distance df (dr ) is relatively prime to D, then 1. The set of nodes Sn , from which consecutive D frames in fast forward (rewind) frame set Sf (Sr ) are retrieved, is load-balanced. 2. The fast forward (rewind) can start from any arbitrary frame (or node) number.

Proof: We give a proof by contradiction. Let f be the number of the arbitrary frame from which the fast forward is started. The D frames in the fast forward frame set are then given as:

ff; f + df ; f + 2df ; f + 3df ; : : : ; f + idf ; : : : f + jdf + : : : f + (D ? 1)df g Without any loss of generality, assume that two frames mapped to the same node np .

f + idf

and

f + jdf , are

1 The result was first pointed out in a different form by Dr. Arif Merchant of the NEC Research Labs, Princeton, New Jersey, during the first author’s summer research internship.

172 th Substituting ks = 0 in Equation 8.7, we can see that the l frame at any node n in this layout is given as f l = n + lD.Since any two frames mapped to the same node differ by an integral multiple of D , we have

(j ? i) = k D df

(8.11)

Two cases that arise are as follows:

Case 1: k is not a multiple of df : If D and df are relatively prime, then, dDf cannot be an integer. However, (j ? i) is an integer. Thus, the Equation 8.11 cannot be true, which is a contradiction.

Case 2: k is a multiple of df : If this condition is true, then (j ? i) = k1 D, where k1 = dkf . However, this contradicts our assumption that the two selected frames are in the set which has only D frames and hence, can differ at the most D ? 1 in their ordinality.

Since the frame f from which fast-forward begins is selected arbitrarily, the claim 2 in the Theorem statement is also justified. The proof in the case of a rewind operation is similar and is not presented here. It is interesting to note that the above theorem is in fact a special case of a basic theorem in abstract algebra which states “If a is a generator of a finite cyclic group G of order n, then the other generators of G are the elements of the form ar where gcd(r; n) = 1” [59]. In the scenario described by Theorem 1 above, n = D and the generator a is 1. Under the operation of addition, ar is df . Thus, as per this basic theorem, 1 (normal playout) and df (fast-forward/rewind) generate a group (a set) of D nodes, such that all the nodes are covered once. As per this theorem, if D = 6, skipping by all distances that are odd numbers (1; 5; 7; 11 : : :) and are relatively prime to 6 will result in a balanced node set. We can see that if D is a prime number, then all distances df that are not multiples of D produce a balanced node set. Also, given a value D, there are always some distances df such as when df is a multiple of D or has a common factor with D, that cannot be safely supported. This suggests that if we want to support a richer choice of interactive fast-forward/rewind speeds, we need to investigate properties of other GSDCL ks (k) layouts to find layouts that can support SSDs that are not supported by GSDCL 0 . We therefore now look at load balance properties of GSDCL 1 layouts.

k

173

8.4 Safe Skipping Distances for GSDCL with s = 1 Distribution Cycle

f0

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

f12

f13

f14

f22

f23

f16

f17

f18

f19

f20

f21

f29

f30

f31

f24

f25

f26

f27

f28

f36

f37

f38

f39

f32

f33

f34

f35

f43

f44

f45

f46

f47

f40

f41

f42

f50

f51

f52

f53

f54

f55

f48

f49

f57

f58

f59

f60

f61

f62

f63

f56

f64

f65

f66

f67

f68

f69

f70

f71

APIC

APIC

APIC

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

Node 4

f15

ks = 1

Stagger Cycle

Node 5

Node 6

Node 7

Figure 8.4: Staggered Distributed Cyclic Layout( SDCL 1 ) with ks

=1

We will illustrate some of the special properties of this layout with an example. Let us consider example in Figure 8.4 and a ff implementation by skipping alternate frames (that is df = 2) starting from frame 0 . The original frame sequence f0; 1; 2; 3; 4; 5; 6; 7g is then altered to f0; 2; 4; 6; 8; 10; 12; 14g. The node set for this new sequence is then altered from the balanced set f0; 1; 2; 3; 4; 5; 6; 7g to f0; 2; 4; 6; 1; 3; 5; 7g. Clearly, this new node set is re-ordered but still is balanced. However, if df = 3, the similar node set is given as f0; 3; 6; 2; 5; 0; 4; 7g, which contains 0 twice and hence is unbalanced. It can be verified that cases df = 4, df = 8 and df = m D, where m is relatively prime to D produce balanced nodes sets as well. Note here that 2; 4 are factors of D = 8, but 3 is not. We formalize

174 these observations in the following theorem which defines SSDs for fast-forward/rewind on GSDCL 1 layouts. Theorem 2 Given a GSDCL layout with ks = 1 over D storage nodes, and numbers d1; d2; d3; dp that are factors of D, the following holds true:

Load balance condition for fast forward: If the fast forward starts from any frame f with a fast forward distance df , then the node set Sn is load-balanced, provided: 1. df = di (where 1 where 0 b < df

i p) and fast-forward starts from a frame f = aD + b, 0

2. df 3.

= mD where m and D are relatively prime or df = x + kD (k > 0) for every x that produces a balanced node set. 2

Load balance condition for rewind: The same result holds true for rewind with a skipping distance dr = df .

Proof: Case I: df is a factor of D If the fast-forward/rewind starts at frame set are given as

f , then first D frames in the fast-forward 0

Sf = ffa + idf : 0 i < Dg

(8.12)

The set Sn of nodes to which frames in Sf map can be found using Equation 8.10 as

Sn = f((f + idf ) + (f + idf ) div D) mod D : 0 i < Dg 0

If f0

0

(8.13)

= aD + b, where 0 b df , then (f + idf ) div D = f div D + idf div D 0

0

(8.14)

Therefore, Equation 8.13 can be rewritten as

Sn = f((f + idf ) + f div D + idf div D) mod D : 0 i < Dg = f((f + f div D) + (idf + idf div D)) mod D : 0 i < Dg = f((f + f div D) mod D + (idf + idf div D) mod D) mod D : 0 i < Dg 0

0

0

0

0

0

(8.15)

175

Note that in Equation 8.15, the term f0 + f0 div D is a constant and the term (idf + idf div D) mod D defines a perfect shuffle [62]. A perfect shuffle a;b is a function from f0; : : : ; ab ? 1g to the same set that permutes its order. It is defined as a;b (i) = (bi + p) mod ab, where p = i diva. Clearly, the node sequence f(idf + (idf ) divD) mod D : 0 i < Dg, where df divides D is the perfect shuffle D=df ;df . To see this, substitute b = df and p = i div D in a;b (i). A set subjected to perfect shuffle results in a new set with same members but in a different order. Also, note that adding any constant inside of the mod D expression (idf + (idf ) div D) mod D “rotates” the shuffle, but still maintains it perfect shuffle. This implies that any node number n, 0 n < (D ? 1) appears only once in the node Sn and thus, Sn is a balanced node set. Also, note that if the starting frame is not of the form f0 = aD + b, where 0 b < df , such a frame will be reached within t < D frames after starting the fast-forward with skipping distance df . From thereon, every group of D frames in the fast-forward sequence will be from a balanced node set Sn . The same property holds for rewind with same skip distance (that is dr = df ) with f0 = aD ? b where 0 b < df . Case II: df 1):

= mD, where m is non-zero and relatively prime with D (that is gcd(m; D) =

Again, the fast-forward frame set Sf and the corresponding node set Sn are defined by Equations 8.12, 8.13. In this case, the term (f0 + idf ) div D can be rewritten as follows:

(f + idf ) div D = (f + imD) div D = (f div D + im) 0

0

0

(8.16) Thus, Equation 8.15 can be rewritten as follows:

Sn = = = =

f((f f((f f((f f((f

0 0 0 0

+ imD) + f div D + imD div D) mod D : 0 i < Dg + f div D) + (imD + imD div D)) mod D : 0 i < Dg + f div D) mod D + (imD + im) mod D) mod D : 0 i < Dg (8.17) + f div D) mod D + (im) mod D) mod D : 0 i < Dg 0

0

0

0

= f(a + im) mod D) : 0 i < Dg

(8.18)

176 However, from the well known result in number theory if gcd(m; D) = 1, then fS : (im mod D) : 0 i < Dg = fZ : i : 0 i < Dg [59]. Also, addition of a constant a to im mod D merely permutes the sequence of Z . Hence, Sn is a balanced node set. Case III: df

= x + kD (k > 0), where x leads to a balanced node set 2

We substitute df

= x + kD

2

in the equation 8.13, to get the Equation 8.19

Sn = f((f + ix + ikD ) + (f + ix + ikD ) div D) mod D : 0 i < Dg = f((f + ix) + (f + ix) div D) mod D : 0 i < Dg 2

0

0

0

2

0

(8.19) Comparing the Equation 8.19 with the Equation 8.13 we can see that the two are the same when x = df . This implies that df and df + kD 2 generate the same node set Sn . Hence, even in this case the node set Sn will be balanced. The same argument also holds for rewind operations. Thus, from the cases I, II, III, we can see that using the fast-forward (df ) and rewind (dr ) distances described in this theorem always results in a load balanced operation of the storage cluster. Hence, the proof. Let us consider an example layout with D = 16 nodes and see which distances df D are safe. If we use GSDCL 0 layout, all distances that are relatively prime to 16 are safe distances. Therefore, df = 3; 5; 7; 9; 11; 13; 15 are safe distances. On the other hand, if we use GSDCL 1 layout, df = 2; 4; 8; 16 are safe skipping distances. Clearly, distances which are safe for GSDCL 1 are unsafe for GSDCL 0 layouts. This suggests that if we have extra storage at our disposal, we can use both the layouts to maximize the choice of fastforward/rewind distances offered to the MOD clients.

8.5 Safe Skipping Distances for GSDCL with arbitrary

ks D

To study load balance properties of generalized GSDCL ks layouts, consider a GSDCL layout with eight nodes (D = 8) and a stagger distance of 3 (ks = 3) illustrated in Figure 8.5. Let the fast forward distance df be four frames (df = 4), which is a factor of


f0

f1

f2

f3

f4

f5

f6

f7

f13 ks = 3

f14

f15

f8

f9

f10

f11

f12

f18

f19

f20

f21

f22

f23

f16

f17

f31

f24

f25

f26

f27

f28

f29

f30

f36

f37

f38

f39

f32

f33

f34

f35

f41

f42

f43

f44

f45

f46

f47

f40

f54

f55

f48

f49

f50

f51

f52

f53

f59

f60

f61

f62

f63

f56

f57

f58

f64

f65

f66

f67

f68

f69

f70

f71

APIC

APIC

APIC

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

Node 4

Stagger Cycle

Node 5

Node 6

Figure 8.5: Generalized Staggered Distributed Layout with ks

D = 8.

Node 7

=3

Assume that the fast forward starts from the frame f = 8. Then the fast forward frame set is Sf = f8; 12; 16; 20; 24; 28; 32; 36g. The set of nodes from which these frames are retrieved is given as Sn = f3; 7; 6; 2; 1; 5; 4; 0g which is a balanced node set. On the contrary, a fast forward starting at the same frame with a distance df = 3 produces a node set Sn = f3; 6; 1; 7; 2; 5; 3; 6g which is unbalanced. Similarly, it can be verified that df = 2; 8 produce balanced node sets, but df = 5; 6; 7 do not. Now let us consider a GSDCL with ks = 2 shown in Figure 8.6. If the fast forward begins at frame f = 16 with a distance df = 2, the corresponding node set f4; 6; 0; 2; 6, 0; 2; 4g is unbalanced. Similarly, distances df = 3; 4; 5; 6; 7; 8 produce unbalanced node sets. From these examples we conjecture that if the stagger distance ks is relatively prime


f0

f1

f2

f3

f4

f5

f6

f7

f15

f8

f9

f10

f11

f12

f13

f20

f21

f22

f23

f16

f17

f18

f19

f26

f27

f28

f29

f30

f31

f24

f25

f32

f33

f34

f35

f36

f37

f38

f39

f46

f47

f40

f41

f42

f43

f44

f45

f52

f53

f54

f55

f48

f49

f50

f51

f58

f59

f60

f61

f62

f63

f56

f57

f64

f65

f66

f67

f68

f69

f70

f71

APIC

APIC

APIC

APIC

APIC

APIC

APIC

APIC

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Storage Node

Node 0

Node 1

Node 2

Node 3

Node 4

f14

ks = 2

Stagger Cycle

Node 5

Node 6

Figure 8.6: Generalized Staggered Distributed Layout with ks

Node 7

=2

with the number of nodes then a result similar to the one mentioned in Theorem 2 is possible. Specifically, we conjecture that the following result can be proved. Conjecture 1 Given a GSDCL layout with a stagger distance ks over D storage nodes and the numbers d1 ; d2 ; d3 ; : : : dp that the are factors of D, the following holds true. 1. Load balance condition for fast-forward: If the fast forward always starts from an anchor frame, with a fast forward distance df , the node mapping set Sn is loadbalanced, provided: (a) ks and D are relatively prime and (b) df

= di (1 i p) or

179

= m D, where m and D are relatively prime or df = di + kD (k > 0)

(c) df (d)

2

2. Load balance condition for rewind: A similar constraint defined in terms of anchor frame value, ks and D holds for rewind operation.

8.6 Implications of MPEG The common media streams used in future multimedia applications will most likely be compressed using a compression scheme such as MPEG. A typical MPEG compressed video stream, illustrated in Figure 8.7 consists of a succession of Group-Of-Pictures (GOPs). The size of the GOP in terms of number of frames may vary over the duration of the document. Each GOP in itself consists of I, P and B frames and has, for example, a structure such as [IBBPBBPBB]. The two parameters N and M shown in Figure 8.7 define the structure of the GOP. Sequence GOP

I 1

GOP

GOP

GOP

B

B

P

B

B

P

B

B

2

3

4

5

6

7

8

9

GOP

M = 3 N = 9

Figure 8.7: Structure of MPEG stream The I frames are coded using a Discrete Cosine Transform ( DCT) without reference to any other pictures and hence can be decoded independently. Thus, they can be treated as anchor frames in the stream from which decoding can begin. However, intra-coding achieves moderate compression and hence, I frames are very content intensive. The P frames are encoded using motion compensated prediction based on past I or P frames. They are normally used for further prediction. The B frames achieve the highest compression and hence, have the smallest content. They are encoded using non-causal prediction from the

180 past and the future reference frames (i.e., I and/or P frames). Clearly, P and B frames cannot be transmitted independently and cannot be used as a reference frame. Empirical evidence shows that in a typical MPEG stream, depending upon the scene content, I to P frame variability is about 3:1, whereas P to B frame variability is about 2:1. Thus, the MPEG stream is inherently a variable bit rate ( VBR) stream. When retrieving such a VBR stream, the load on a storage node varies depending upon the granularity of retrieval. If a node is performing frame-by-frame retrieval, the load on a node retrieving an I frame is 6–8 times that on a node retrieving a B frame. Hence, it is necessary to ensure that certain nodes do not always fetch I frames and others fetch only B frames. The variability of load at the GOP level may be much less than at the frame level, and hence selecting appropriate data layout and retrieval units is crucial. In presence of concurrent clients, it is likely that each storage node can occasionally suffer overload, forcing the prefetch deadlines to be missed and thus, requiring explicit communication between the storage nodes to notify such events. Another problem posed by MPEG compression is that it introduces inter-frame dependencies and thus, does not allow frame skipping at arbitrary rate. This in effect means that fast-forward/rewind by frame skipping can be realized only for a few rates. For example, the only valid fast forward frame sequences are [IPP IPP ] or [IIII: : :]. There are two problems with sending only I frames on fast-forward/rewind. First, it increases the network and storage bandwidth requirements at least three to four times. For example, a MPEG compressed NTSC video connection that requires 5 Mbps average playback bandwidth would require approximately 15 Mbps for the entire duration of the fast-forward/rewind operation, if only I frames are transmitted. In the presence of many such connections, the network will not be able to easily meet such dramatic increases in bandwidth demand. Another problem is that if the standard display rate is maintained, skipping all frames between consecutive I frames may make the effective fast forward rate unreasonably high. For example, if I-to-I separation is 9 frames, perceived fast forward rate will be 9 times the normal playout rate. The two ways to rectify this problem are as follows:

Store an intra-coded version of the movie along with an inter-frame coded version: This option offers unlimited fast forward/rewind speeds, however, it increases storage and throughput requirements. There are three optimizations possible that can alleviate this problem to some extent: 1) Reduce the quantization factor for the intracoded version, but this may lead to loss of detail. 2) Reduce the display rate (“temporal down-sampling”), however, this may cause jerkiness. 3) Store the spatially downsampled versions of the frames, that is, reduce the resolution of the frames. This

181 requires the frames to be up-sampled in real-time at the client using a up-sampling hardware. We believe that all three optimizations will be necessary, especially at high fast-forward/rewind rates, to keep the network and storage throughput requirem ents unaltered.

Use the inter-frame coded version but instead of skipping frames skip chunks: In this option, the skipping granularity is increased to chunks instead of frames. For example, a chunk can be a GOP. This option has the advantage of keeping average bandwidth requirements nearly unchanged, and therefore is quite attractive. Also, all the results mentioned in Sections 8.3, 8.4, 8.5 for frame skipping on GSDCL ks (1) layouts with arbitrary stagger distance apply to chunk skipping over GSDCL ks (k) layouts. However, the visual quality of such chunk skipping is likely to be unacceptable at large chunk sizes.

8.7 Related Work This section briefly presents some of the related work. Keeton et al. discuss schemes for placement of sub-band encoded video data in units of constant playout length on a two dimensional disk array [68]. They report simulation results which conclude that storage of multi-resolution video permits service to more concurrent clients than storage of single resolution video. Similarly, Zakhor et al. report design of schemes for placing scalable sub-band encoded video data on a disk array. They focus only on the path from the disk devices to the memory and evaluate using simulation, layouts that use constant data or time length units. However, this paper does not address issues in the implementation of interactive operations. Chen et al. report data placement and retrieval schemes for an efficient implementation of ff and rw operations in a disk array based video server [33]. Our work is completely independent and concurrent this work and has similarities and differences [29, 21]. Chen et al.’s paper assumes a disk array based small scale server whereas our work assumes a large scale server with multiple storage nodes, each with a disk array. They define a MPEG specific data layout unit called a segment, which is a collection of frames between two consecutive I frames. Our definition of a chunk is a data unit which requires a constant playout time at a given frames/sec rate. So a segment in Chen et al.’s segment is a special case of our chunk. They discuss two schemes for segment placement: in the first scheme, the segments are distributed on disks in a round robin fashion, in much the same way as

182 our DCL k layout over multiple storage nodes. For ff/rw operation, they employ a segment selection method which ensures that over a set of retrieval cycles, each disk is visited only once. Thus, here the load balance is achieved over multiple retrieval cycles. In second segment placement scheme, the segments are placed on the disk array in such a way that for certain fast forward rates, the retrieval pattern for each round c ontains each disk only once. Our G - SDCL ks =1 (k) layout over storage nodes with stagger distance of one is similar to this second segment placement scheme. However, our result is more general, as it characterizes many more safe skipping rates for fast-forward and gives a condition for safe implementation of rewind operation.

8.8 Summary In this chapter, we investigated the load-balance properties of distributed data layouts for MARS like clustered multimedia storage servers constructed out of distributed storage nodes interconnected high speed network. We illustrated a family of hierarchical, distributed layouts called Generalized Staggered Distributed Cyclic Layouts (GSDCL ks ), that use constant time length logical units called chunks. We defined and proved a load-balance property that is required for efficient implementation of playout control operations such as fast-forward and rewind.

183

Chapter 9 Conclusions and Future Work In this dissertation we addressed the challenging problem of design of web based scalable MOD services and servers. Specifically, we proposed that using (1) emerging broadband networking technologies such as ATM, (2) commodity components such as PCs, disk arrays, and hardware multimedia devices, and (3) existing software systems such as BSD UNIX operating system and web server enhanced suitably to handle multimedia data, we can build scalable servers and services. We proved our thesis with extensive design and prototyping of scalable MOD services, and servers running on a 4.4 BSD UNIX server OS enhanced to support QOS guarantees and high performance. In this chapter, we summarize the contributions of our research and outline several directions for future work.

9.1 Contributions The main contributions of our research are: (1) a simple extended web framework that separates data and control paths to enable design of high bandwidth web based MOD services that provide simple, flexible and easy-to-use interface, (2) a high performance zero-copy buffering system for disk-to-network data transfers and a storage system with QOS guarantees in the server OS, (3) and a scalable storage server architecture and associated distributed data layout and scheduling schemes that support high concurrency, parallelism and scalable data accesses. Useful outcomes of our research are a working prototype of high performance MOD server and services and an enhanced 4.4 BSD UNIX OS kernel. Our experimental servers and services have been deployed for routine use in our ATM testbed and serve as platform for other projects on advanced MOD services [87, 114]. In the following, we describe our contributions along with specific results.

184

9.1.1 Web based MOD services We designed and prototyped two basic interactive MOD services, namely a recording service for content creation and a fully interactive playback service for content access. We separated the high bandwidth data path that needs QOS guarantees from the low bandwidth control path and used the web HTTP protocol only for the control path operations. Our approach has already been validated by commercial low bandwidth MOD playback applications such as RealVideo, NetShow and VxTreme. We also demonstrated that the MOD playback functionality can be either integrated with the traditional web servers or implemented as stand-alone server.

9.1.2 Enhancements to 4.4 BSD UNIX Server OS We proved that QOS guarantees for multimedia data does not necessarily require use of real-time OS and general purpose OS such as 4.4 BSD UNIX can be enhanced to provide soft-real-time guarantees and high performance. We achieved this with the design, implementation and performance evaluation of following OS enhancements to a public domain NetBSD UNIX OS:

Adapting a novel real-time concurrency mechanism called real-time upcall (RTU) to provide guaranteed CPU access to MOD servers and services. A new buffering system that provides a zero copy data path from the disk to network devices and provides significant ( 40 % ) throughput improvements with fast storage devices. To the best of our knowledge, our mmbuf system is the first such zero copy data path implementations for a 4.4 BSD class of OS.

An enhanced storage system with support for Deficit Round Robin (DRR) fair queuing over multiple priority queues that provides effective bandwidth sharing between realtime and non-real-time streams. We demonstrated that dynamic allocation of storage resources can be realized by using simple ioctl() calls. A novel system call API that allows aggregation of I / O and minimizes system call overheads. We demonstrated excellent improvement in CPU availability with this new API.

Our prototype single node MOD playback server that uses these enhancements supports in excess of 70 Mbps with 5 disks and can easily saturate a 155 Mbps OC -3 ATM

185 link with faster storage system and a better network interface. It provides excellent QOS guarantees with minimal CPU usage. We believe that with faster 400 MHZ PCs, faster disk arrays and better ATM interfaces such as APIC [46] we can easily support up to 200 Mbps MOD server without any modifications to our software systems.

9.1.3 Scalable Storage Server and Services Architecture We proposed and prototyped a novel distributed storage architecture to address the problem of scalability of storage servers and services. Our architecture consists of four main components: a high speed server interconnect, a set of storage nodes, a central manager and a distributed control protocol. Our high performance storage node is constructed using a PC that runs the enhanced UNIX OS and uses off-the-shelf network and storage sub-systems. A high speed ATM based interconnect transparently interfaces several such storage nodes to an external ATM network. Our use of scalable ATM interconnect and the on-going improvements in CPU speed, memory, I / O interconnect, and storage bus bandwidth guarantee a scalable architecture. The central manager in our architecture controls the high bandwidth data path and may provide added functions such as admission/access control, billing, accounting, and data base services. It maintains a master-slave relationship with the storage nodes and implements a control protocol to co-ordinate high bandwidth data transfers. We designed distributed data layouts called Generalized Staggered Distributed Cyclic Layouts (GSDCL ) that break high bandwidth multimedia data into constant-time-length chunks and distribute them over multiple nodes to achieve high parallelism and concurrency in data accesses. We also developed a distributed control protocol called Beat Directed Scheduling (BEADS) that uses periodic timing information from a master to co-ordinate data flow from storage nodes. We have prototyped the distributed striping and playback services and associated client applications. We demonstrated that such distributed data layout and scheduling benefit from the presence of a reasonable sized buffer ( 1 MB) at the client end.

9.1.4 Load Balance Properties of Distributed Layouts We defined the concept of Safe Skipping Distances for implementation of interactive operations crucial to load balanced operation of distributed storage cluster based MOD servers. We analyzed the load balance properties of a family of GSDCL data layouts and provided concrete safe skipping distances for them. This analysis enables a MOD server

186 designer to balance storage space requirements, richness of interactive control speeds, and simplicity of distributed scheduling.

9.2 Future Directions The MOD services, the server OS and the server architecture described in the previous section can be extended in several ways. In the following we discuss these extensions:

9.2.1

MOD

The

services

MOD

services described in this dissertation can be enhanced in the following

ways: Advanced MOD services: We described two example MOD services – namely a Recording service for content-creation and Interactive playback service for content access. New content creation and access services can be built by modifying the control path of these basic services. For example, the basic recording service can be enhanced to allow users to compose complex multimedia documents using simple documents recorded by using the recording service. Similarly, simple playback services such as Periodic Broadcast, and Pay-per-view and advanced playback services such as interactive movies, orchestrated presentations, and personalized agent assisted news programs can be built by enhancing the control path of the basic playback service. Enhancing the data path: The MOD services we prototyped support only MJPEG video and ATM based data paths. They need to be extended to handle new data types such as MPEG, RealVideo, NV video, and to support IP based data paths that employ transport protocols such as RTP. Also, the current service implementations need to be integrated with ATM network signaling to allow dynamic switched VCs. Client devices: Our MOD services make use of a special multimedia device called MMX that can be directly connected to an ATM network and can be controlled from a host workstation or a PC. Newer hardware multimedia devices are designed to be standard I / O devices that use a standard network interface on the host for networked multimedia data. Also, with faster CPUs on host machines, software client devices deliver good performance and are increasingly becoming inexpensive and attractive option. Such client devices need to be integrated in our service prototype.

187 Client applications: Our client applications that access the existing MOD services run as web-browser-helper applications and once they are activated the web browser has no control on them. It is desirable that these applications run in the graphical context of the web browser and be seamlessly integrated to provide a uniform web experience. This can be achieved by implementing the client applications using JavaScript or ActiveX plug-in frameworks.

9.2.2 Server OS enhancements The extensions to 4.4 following ways:

BSD OS

we proposed and implemented can be enhanced in

Extensions to MMBUF system: Our existing mmbuf system implementation supports only a disk-to-network read data path. The reverse write data-path from the network interface to the disk is equally crucial for content creation services. We believe that modifications to existing data path to realize this new functionality are straightforward. Specifically, when the data is received over a network connection, it is copied into an mbuf chain which contains physically and virtually non-contiguous pages. To implement the mmbuf based write data path, virtual memory functions need to be used remap these pages to make them virtually contiguous so that they can be passed as a single chunk to the disk driver. Current implementation of MMBUF system uses a fixed 16 KB external cluster buffer for each mmbuf. With small data reads, this fixed size can lead to fragmentation and underutilization of memory resources. Newer virtual memory implementations such as UVM [39] that allow chunks of arbitrary number of pages can alleviate this problem. The current mmbuf system and the stream API to access it have been successfully integrated with the NATM protocol. They need to be integrated with the TCP and UDP protocols so that they can be used for general applications such as standard network file servers ( NFS) and other applications that use disk to network data path. QoS in storage system: Our existing implementation of QOS guarantees in SCSI storage system supports only two priority classes and has two limitations: the DRR fair queuing algorithm, though simple to implement and provides good bandwidth guarantees, gives suboptimal delay guarantees even with a constant bandwidth server. We believe that newer fair queuing algorithms such as WF 2 Q [17] that provide better delay bounds need to be evaluated for fair queuing with disk devices commonly modelled

188 as non-linear and variable bandwidth servers. Also, in our current prototype, the requests in the real-time queue are sorted using an ordinary SCAN algorithm which is oblivious to request deadlines and can lead to priority inversion. Newer disk scheduling algorithms [96, 102] rectify this limitation. Using a fair queuing algorithm together with one of the new disk scheduling algorithms will enable SCSI system to provide better delay guarantees and in turn reduce the worst case pre-fetching buffer requirements for applications such as MOD servers. File system: Our current prototype employs FFS file system for storing data and meta data information. The FFS disk block allocation and data placement policies are suboptimal for storing large multimedia files. New file stores such as CMFS and SYMPHONY [102] rectify these limitations. Also, separation of meta-data and data files on to different file-stores can minimize interaction of low-bandwidth yet time-critical meta accesses with high bandwidth data access and enhance performance of our servers. Distributed services and servers: Our current distributed server prototype provides suboptimal performance due to lack of buffers at the client end. New buffered clients need to be integrated into our prototype to demonstrate extensive scalability that can be achieved with the distributed architecture. Also, our prototype needs to be enhanced to exploit ATM switch support for many-to-one multicast [57, 109] and to support IP based data path.

9.3 Final Remarks Using extensive design and prototyping of real systems, this dissertation demonstrated that scalable cost-effective MOD servers and services can be built using off-the-shelf hardware and software components. Our software systems such as the client applications, MOD servers and the enhanced 4.4 BSD UNIX are technology independent and thus, will benefit from continuing improvements in CPU, memory and storage speeds. Also, our scalable server architecture, and the associated distributed layout and control protocol though described in the context of an end-to-end ATM environment, are equally valid for other environments. In an ATM environment, improvements in many-to-one multicast, and endsystem OS support for network signaling and NATM transport protocols will enhance the performance of our recording and playback servers. Our experiences with 4.4 BSD UNIX demonstrate that a general purpose OS can indeed be used to provide QOS guarantees to

189 multimedia applications and services. We also learned that hardware client devices provide high quality multimedia but need availability of a playout buffer of reasonable size to reduce the server complexity and simplify audio-video synchronization. We belive that rapid advances in routing, switching and access-from-home technologies coupled with the ever increasing demand from higher bandwidth and better performance will lead to faster internet backbone and access networks. This will enable newer high bandwidth multimedia-on-demand services, majority of which will involve stored media. Clearly, high performance, scalable MOD servers that provide such services will therefore be critical components of the future internet. The prototype high performance scalable MOD server and services that we demonstrated in this dissertation should be regarded as a proof-of-the-concept systems built in the context of ATM networks. Several problems such as QOS renegotiation, resource usage monitoring, pricing, billing, access control, database services, security and privacy etc. need be carefully investigated and solutions from these areas need to be integrated with our system into a coherent whole. We believe that our experimental system therefore will serve as an excellent high performance platform to conduct exciting research in these fields.

190

References [1] Intel 440BX AGPset: http://developer.intel.com/design/agpsets/440bx/index.htm. [2] Intel Embedded Processor Module: http://developer.intel.com/design/intarch/ TOP 800.htm. [3] —, NCSA HTTPd web page,http://hoohoo.ncsa.uiuc.edu/. [4] –, Network Wizards Internet Domain Survey, http://www.nw.com/zone/WWW/, April 1997. [5] –, “http://www.microsoft.com/netshow/vxtreme,”, Web site for Vxtreme encoder and player. [6] —– “Request for Information (RFI) on Digital Media Servers” Cable Labs Inc. [7] Statistical Abstract of the United States. U.S. Bureau of Census, 113th ed. 1993. [8] –, “http://www.real.com/,” Real Player, Web site of Progressive Networks. [9] –, “http://www.real.com/,” Real Player, Web site of Progressive Networks. [10]

MAGIC TM Media Server: A Scalable and Cost Effective Video Server, Sarnoff Real Time Corporation, Princeton, NJ 08543.

[11]

Whittaker TM Media Server, Whittaker Communication Corporation, Oregon.

[12] Oracle/NCube media server, Oracle Corporation, CA. [13] Anderson, T.E., et.al., “Scheduler Activations: Effective Kernel Support for User Level Management of Parallelism,” ACM Transactions on Computer Systems, 1992, pp. 53–79.

191 [14] Berners-Lee, T., Cailiau, R., Groff, J., F., and Pollermann, B., “World-Wide-Web: The Information Universe,” Electronic Networking: Research, Applications, and Policy, no 1, Meckler Publishing, Spring 1992, 52-58, Westport, CT. [15] Bernhardt, C., and Biersack, E., “A Scalable Video Server: Architecture, Design and Implementation,” In Proceedings of the Real-time Systems Conference, pp. 63-72, Paris, France, January 1995. [16] Bernhardt, C., and Biersack, E., “The Server Array: A Scalable Video Server Architecture,” To appear in High-Speed Networks for Multimedia Applications, editors, Danthine, A., Ferrari, D., Spaniol, O., and Effelsberg, W., Kluwer Academic Press, 1996. [17] Bennet, J., C., R., and Zhang, H., “WF2Q: Worst-case Fair Weighted Fair Queueing,” Proceedings of INFOCOM’96, March 1996. [18] Bobrow, D., G., “Tenex, A Paged Time Sharing System for tions of ACM, Vol. 15, No. 3, pp. 135-143, March 1972.

PDP -10,”,

Communica-

[19] Brustoloni, J., C., and Steenkiste, P. “Evaluation of Data Passing and Scheduling Avoidance,” Proceedings of NOSSDAV97, St. Louis, MO, pp. 101-111, May 19-21, 1997. [20] Bolosky, W., et al., “The Tiger Video File-server,” Proceedings of NOSSDAV96, pp. 97-104, Zushi, Japan, April 23-26, 1996. [21] Buddhikot, M., and Parulkar, G., M., “Distributed Scheduling, Data Layout and Playout Control in a Large Scale Multimedia Storage Server,” Technical Report WUCS9433, Department of Computer Science, Washington University in St. Louis, September 1994. [22] Buddhikot, M., Parulkar, G., and Cox, J., R., Jr., “Design of a Large Scale Multimedia Storage Server,” Journal of Computer Networks and ISDN Systems, pp. 504-517, December 1994. [23] Buddhikot, M., Parulkar, G., and Cox, J., R., Jr., “Design of a Large Scale Multimedia Storage Server,” Journal of Computer Networks and ISDN Systems, pp. 504-517, December 1994.

192 [24] Buddhikot, M., and Parulkar, G., M., “Efficient Data Layout, Scheduling and Playout Control in MARS,” Proceedings of NOSSDAV95, Durham, New Hampshire, April 1995. [25] Buddhikot, M., and Parulkar, G., M., “Efficient Data Layout, Scheduling and Playout Control in MARS,” ACM/Springer Multimedia Systems Journal, pp. 199-211, Volume 5, Number 3, 1997. [26] Buddhikot, M., Parulkar, G., and Gopalakrishnan, R., “Scalable Multimedia-OnDemand via World-Wide-Web (WWW) with QOS Guarantees,” Proceedings of Sixth International Workshop on Network and Operating System Support for Digital Audio and Video, NOSSDAV96, Zushi, Japan, April 23-26, 1996. [27] Buddhikot, M., Chen, J., Wu, D., and Parulkar, G., “Extensions to 4.4 BSD UNIX for Networked Multimedia in Project MARS,” Proceedings of IEEE Conference on Multimedia Computer Systems, Austin, Texas, June 27-31, 1998. [28] Buddhikot, M., Wu, D., Jane, X., and Parulkar, G., “Project MARS: Experimental Scalable and High performance Multimedia-On-Demand Services and Servers,” Washington University, Department of Computer Science, Technical report (in preparation). [29] Buddhikot, M., Parulkar, G., M., and Cox, J., R., Jr., “Distributed Layout, Scheduling, and Playout Control in a Multimedia Storage Server,” Proceedings of the Sixth International Workshop on Packet Video, Portland, Oregon, pp. C1.1 to C1.4, September 26-27, 1994. [30] Buddhikot, M., Parulkar, G., Rangan, V., and Sampatkumar, Srihari, “Design of Storage Servers and Storage Hierarchies,” Handbook of Multimedia Systems, Chapter 10, pp. 279-333, Prentice Hall International, Inc. [31] Chang, Ed, and Zakhor, A., “Scalable Video Placement on Parallel Disk Arrays,” Image and Video Databases II, IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology, San Jose, February 1994. [32] Chang, Ed, and Zakhor, A., “Variable Bit Rate MPEG Video Storage on Parallel Disk Arrays,” First International Workshop on Community Networking, San Francisco, July 1994.

193 [33] Chen, M., Kandlur, D., and Yu, S., P., “Support for Fully Interactive Playout in a Disk-Array-Based Video Server,” Proceedings of Second International Conference on Multimedia, ACM Multimedia’94, 1994. [34] Chen, P., et al., “RAID: High-performance, Reliable Secondary Storage,” ACM Computing Surveys, June 94. [35] Chervenak, A, “Performance Measurements of the First RAID Prototype,” Technical Report, Department of Computer Science, University of California, Berkeley, 1990. [36] Chervenak, A, “Tertiary Storage: An Evaluation of New Applications,” PhD dissertation,port, Department of Computer Science, University of California, Berkeley, 1995. [37] Cranor, C., “BSD ATM,” Release Notes, Washington University in St. Louis, July 3, 1996. [38] Cranor, C., and Parulkar, G, “Design of Universal Continuous Media I/O,” Proceedings of NOSSDAV95, pp 83-86, April 1995. [39] Cranor, C., “Design and Implementation of the UVM Virtual Memory System,” Doctoral Dissertation, Washington University in St. Louis, MO, August 1998. [40] Daigle, S., “Disk Scheduling for Continuous Media Data Streams,” Master’s Thesis, Carnegie Mellon University, December 1992. [41] Dan, A, Sitaram, D, and Shahbuddin, P, “Scheduling Policies for a Video-On-Demand Server,” Proceedings of ACM Multimedia, pp. 15-23, October 1994. [42] Dan, A, Sitaram, D, and Shahbuddin, P, “Dynamic Batching Policies for an Ondemand Video Server,” Multimedia Systems, 4 (3),: 112-121, June 1996. [43] Demers, A., Keshav, S., and Shenker, S., “Analysis and Simulation of a Fair Queueing Algorithm,” ACM SIGCOMM, August 1989, pp. 3–12. [44] Dey, J, et al., “Providing VCR Capabilities in Large-Scale Video Servers,” Proceeding of ACM Multimedia’94, San Francisco, September 1994. [45] Dittia, Z., Cox., J., and Parulkar, G., “Catching up with the networks: Host I/O at gigabit rates,” Technical Report WUCS-94-11, Department of Computer Science, Washington University in St. Louis, April 1994.

194 [46] Dittia, Z., Cox, J., R., and Parulkar, G., “Design of the APIC: A High Performance ATM Host Network Interface Chip,” Proceedings of IEEE Infocom95, Boston, pp. 179-187, 1995. [47] Druschel, P., and Peterson, L., “Fbufs: A high-bandwidth cross domain transfer facility,” Proceedings of 14th SOSP, pp. 1892-202, December 1993. [48] Fall, K., and Pasquale, J., “Exploiting In-Kernel Data Paths to Improve I/O Throughput and CPU Availability”, Proceedings of the USENIX Winter Technical Conference, San Diego, California, pp. 327-333, January 1993. [49] Fall, K., and Pasquale, J., “Improving Continuous-Media Playback Performance With In-Kernel Data Paths”, Proceedings of the IEEE International Conference on Multimedia Computing and Systems (ICMCS), Boston, MA, pp. 100-109, June 1994. [50] Golestani, J., “Congestion-Free Communication in High-Speed Packet Networks,” IEEE Transactions on Communications, Vol. 39, No. 12, pp. 1802-1812, December 1991. [51] Golestani, S., J., “A Self-Clocked Fair Queueing Scheme for High Speed Applications,” IEEE Journal on Selected Areas in Communications, pp. 1064-1077, September 1994. [52] Gopal, R., “Efficient Quality of Service Support in Computer Operating systems for High Speed Networking and Multimedia,” Doctoral Dissertation, Washington University in St. Louis, December 1996. [53] Gopalakrishnan, R., Parulkar, G.M., “A Framework for QoS Guarantees for Multimedia Applications within an End-system,” Swiss German Computer Science Society Conf., 1995. [54] Gopalakrishnan, R., Parulkar, G.M., “A Real-time Upcall Facility for Protocol Processing with QOS Guarantees,” (Poster) ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 1995. [55] Goyal, P., Guo, X., Vin, H.M., “A Hierarchical CPU Scheduler for Multimedia Operating Systems,” 2nd Symposium on Operating Systems Design and Implementation (OSDI96), pp. 107-121, October 96.

195 [56] Goyal, P., and Vin, H.M., “Start-time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks,” Proceedings of ACM SIGCOMM, Stanford, August 1996. [57] Grosslauser, M., and Ramakrishnan, K., K., “SEAM: Scalable, Efficient ATM Multicast,” Proceedings of NOSSDAV96, Zushi, Japan, April 1996. [58] Guttag, K., Gove, R., and Aken, V., “A Single-Chip Multiprocessor for Multimedia: The MVP,” IEEE Computer Graphics and Applications, pp. 53-64, November 1992. [59] Fraleigh, B., J., “A First Course in Abstract Algebra,” Addison-Wesley Publishing Company, pp. 52-53, 1967. [60] Hans, Erik, “MBone: The Multicast Backbone,”, Communications of ACM, Vol. 37, pp. 54-60, August 1994. [61] Haskin, R., and Schmuck, F., “The Tiger Shark File System,” Proceedings of COMPCON, Spring 1996. [62] Hawang, K, and Briggs, F., “Computer Architecture and Parallel Processing,” Prentice Hall, Inc., Chapter 7, pp. 497-498, 1984. [63] Holton, M. and Das, R., “XFS: A Next Generation Journalled 64-bit File System with Guaranteed Rate I / O,”, Technical Report, Silicon Graphics, (Online: http://www.sgi.com/Technology/xfs-whitepaper.com) [64] Hsieh, J., et al., “Performance of a Mass Storage System for Video-On-Demand,” Proceedings of IEEE INFOCOM’95, pp. 771-778, April 1995. [65] Hua, K., A., and Sheu, S., “A New Broadcasting Scheme for Metropolitan Video-OnDemand Systems,” Proceedings of ACM SIGCOMM97, Cannes, France, September 14-18, 1997. [66] Hylton, Todd., Coffey., K., Parker, A., and Kent, H., “AdStaR Scientists Detect Giant Magnetoresistance in Small Magnetic Fields, Using Easy to Make Sensor,” SCIENCE, August 1993. [67] Jessel, A., H., “Cable Ready: The High Appeal for Interactive Services,” Broadcasting & Cable, May 23, 1994.

196 [68] Keeton, K., and Katz, R., “The Evaluation of Video Layout Strategies on a High Bandwidth File Server,” Proceedings of International Workshop on Network and Operating Support for Digital Audio and Video (NOSSDAV’93), Lancaster, U. K., November 1993. [69] Kawachiya, K., Tokuda, H., “Q-Thread: A New Execution Model for Dynamic QOS Control of Continuous-Media Processing,” NOSSDAV 96, Japan, April 1996. [70] Kleiman, S., “Design of vnode interface,” Proceedings of the USENIX Symposium, 1986. [71] Khanna, S., et. al., “Real-time Scheduling in SunOS5.0,” USENIX, Winter 1992, pp.375-390. [72] Krishan, K. and Yavatkar, R., “The AQUA Bul.....”, Doctoral Dissertation, University of Kentucky, Lexington, KY, 1997. [73] Kumar, Vinay, “MBONE: Interactive Multimedia on the Internet,” New Riders Publishing, Indianapolis, Indiana, 1996. [74] Lam, S., Chow, S, and Yau, K., Y., D., “An Algorithm for Lossless Smoothing of MPEG Video,” Proceedings of ACM SIGCOMM ’94, London, August 1994. [75] Lee, E., et al., “ RAID-II: A Scalable Storage Architecture for High-Bandwidth Network File Service,” Technical Report UCB//CSD-92-672, Department of Computer Science, University of California at Berkeley, October, 1992. [76] Little, T., D., et al., “A Digital On-demand Video Service Supporting Content-based Queries,” Proceedings of ACM Multimedia’93, Anaheim, CA, pp 427-436, August 1993. [77] Lougher, P., and Shepherd, D., “The Design of a Storage Server for Continuous Media,” The Computer Journal, Vol. 36, No. 1, pp. 32-42, 1993. [78] LoVerso, S., Isman, M., et al. “sfs: A Parallel File System for the CM-5,” Proceedings of USENIX Summer Parallel File System for the CM-5,” Proceedings of USENIX Summer Conference, June 1993. [79] Miller, G., Baber, G., and Gilliland, M., “News On-Demand for Multimedia Networks,” Proceedings of ACM Multimedia’93, Anaheim, CA, pp. 383-392, August 1993.

197 [80] Martin, C.,Narayan, P., S., Ozden, B., Rastogi, R. and Silberschatz, A., “The Fellini Multimedia Storage Server,” Multimedia Information Storage and Management, Editor, S., M., Chung, Kluwer Academic Publishers, 1996. [81] McKusik, M., et al. “The Design and Implementation of the 4.4 BSD Operating System,” Addison Wesley, 1996. [82] McKenny, P., E., “Stochastic Fair Queueing,”, Proceedings of Infocom’90, August 1990. [83] Mercer, C.W., Savage, S., Tokuda, H., “Processor Capacity Reserves: Operating System Support for Multimedia Applications,” IEEE International. Conf. on Multimedia Computing and Systems, May 1994. [84] Minzer, S.E., “Broadband ISDN and Asynchronous Transfer Mode (ATM),” IEEE Communications Magazine, September 1989, pp. 17–24. [85] Nussbaumer, J., Patel, B., Schaffa, F., and Sterbenz, J., P., G., “Networking Requirements for Interactive Video on Demand,” IEEE Transactions of Selected Areas in Communication, January 95. [86] Oikakwa, S., and Tokuda, H., “Guaranteeing the Execution of User Level Real-time Threads,”, NOSSDAV96, Zushi, Japan, April 1996. [87] Parulkar, G., M., and Wu, D., “MARS II: A Integrated Server Farm”, Invited presentation at ETRI, Korea, March 1998. [88] Papadimitriou, C., H., Ramanathan., S., and Rangan, P., V., “Information Caching for Delivery of Personalized Video Programs on Home Entertainments Channels,” Proceedings of IEEE International Confence on Multimedia Computing and Systems, Boston, May 1994. [89] Papadopoulos, C., and Parulkar, G., M., “Retransmission based Error Control for Continuous Media Applications,” Proceedings of NOSSDAV96, Japan, April 1996. [90] Pasquale, Joseph, Anderson, Eric, and Muller, P. Keith, “Container Shipping: operating system support for i/o intensive applications,” IEEE Computer Magazine, 27 (3): 84-93, March 1994.

198 [91] Patterson, D., et al., “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proceedings of the 1988 ACM Conference on Management of Data (SIGMOD), Chicago IL, pp. 109-116, June 1988. [92] Pugh, W., and Boyer, G., “Broadband Access: Comparing Alternatives,” IEEE Communications Magazine, pp. 34-46, August 1995. [93] Richard, W., D., Costa, P., Sato, K. ”The Washington University Broadband Terminal,” Special Issue on High-Speed Host/Network Interface, IEEE Journal on Selected Areas in Communications, Vol. 11, no. 2, pp. 276-282, February 1993. [94] Richard, W., D., Cox, J., R.. Jr., Gottlieb, B.,L., Krieger, K., “The Washington University Multimedia System,” ACM Multimedia Systems, Springer-Verlag, 1993, 1:120131. [95] Richard, W., D., Cox, J., R. Jr., Engebretson, A., M., Fritts, J., Gottlieb, B., L., and Horn, C., “Production-Quality Video Over Broadband Networks: A System Description and Two Interactive Applications,” IEEE Journal on Selected Areas in Communications, Vol.13, No. 5, pp. 806-815. [96] Reddy, A., L., and Wyllie, J.“Disk Scheduling Algorithms for Multimedia Operating Systems,” Proceedings of ACM Multimedia’93, Anaheim, CA, pp. 225-234, August 1993. [97] Rashid, R., and Robertson, G., “Accent: A Communication-Oriented Network Operating System Kernel,”, Proceedings of 8th Symposium on Operating System Principles, ACM Press, New York, pp. 64-85, 1981. [98] Rowe, L., Boreczky, J., and Eadds, C., “Indexes for User Access to Large Video Databases,” IS & / SPIE Symposium on Electronic Imaging Science and Technology, Conference 2185, 1994. [99] Salehi, J., Zhang, Z., Kurose, J., and Towsley, D., “Supporting Stored Video: Reducing Rate Variability and End-to-End Resource Requirements through Optimal Smoothing,” ACM SIGMETRICS Conference Philadelphia, October 1996. [100] Salem, K., and Garcia-Molina, H., “Disk Striping,” IEEE International Conference on Data Engineering, 1986.

199 [101] Schulzerine, H., “A Comprehensive Multimedia Control Architecture for the Internet,” Proceedings of NOSSDAV97, St. Louis, MO, USA, pp. 69-80, May 18-21, 1997. [102] Shenoy, P., Goyal, P., Rao, S., S., and Vin, H., “Symphony: An Integrated Multimedia File System,” Technical Report TR-97-09, Department of Computer Sciences, University. of Texas at Austin, March 1997. [103] Sincoskie, W., “System Architecture for a Large Scale Video on Demand Service,” Computer Networks and ISDN Systems, North Holland, Vol. 22, pp. 155-162, 1991. [104] Shreedhar, M., and Varghese, G. “Efficient Fair Queueing using Deficit Round Robin,” ACM/IEEE Transactions on Networking, 1995. [105] Tewari, R., Mukherjee, R., Dias, D., and and Harrick M. Vin. “Real-time Issues for Clustered Multimedia Servers,” IBM Research Report RC 20020 (88561). [106] Tokuda, H., Nakajima, T., and Rao, P., “Real-Time Mach: Towards Predictable Realtime Systems,” USENIX Mach Workshop, October, 1990. [107] Thadani, M., and Khalidi, “An efficient zero-copy I / O framework for UNIX,” Technical Report, SMLI TR-95-39, Sun Microsystems Lab, Inc., May 1995. [108] Thorpe, Jason, Personal Communication, March 1996. [109] Turner, J, “A Dynamic Sub-channel Mapping Scheme for Many-to-one ATM multicast,” Proceedings of the Workshop on IP-ATM Integration, June 1996. [110] Turner, J., “Bandwidth Management in ATM Networks Using Fast Buffer Reservation,” IEEE Networks Magazine, Vol. 6, No. 5, September 1992, pp. 50-58. [111] Turner, J., “An Optimal Nonblocking Multicast Virtual Circuit Switch,” Proceedings of IEEE INFOCOM94, Vol. 1, pp. 298-305, June 1994. [112] Turner, J., “A Gigabit Multicast Switch: System Architecture Document,” Applied Research Laboratory, Washington University in St. Louis, February 1994. [113] Tobagi, F., Pang, J., Baird, R., and Gang, M., “ Streaming RAID - A Disk Array Management System for Video Files,” Proceedings of ACM Multimedia’93, Anaheim, CA, pp. 393-400, August 1993.

200 [114] Tong, X., Parulkar, G., and Buddhikot, M., “An Interactive Multimedia Document Composition and Playback Service,” Washington University in St. Louis, Tech Report (in preparation). [115] Tzou, Shin-Yan and Anderson, David, “The performance of message-passing using restricted virtual memory re-mapping,” Software - Practice and Experience, 21(3): 251-267, March 1991. [116] Venkatramani, C., and Chiueh, T., “Survey of Near-Line Storage Technologies: Devices and Systems,” Experimental Computer Systems Laboratory, Technical Report #2, Department of Computer Science, SUNY Stony Brook, NY, October 1993. [117] Rangan, V., and Vin, H., “Designing File Systems for Digital Video and Audio,” Proceedings of the 13th Symposium on Operating System Principles, Operating Systems Review, pp. 81-94, October 1991. [118] Vin, H., and Rangan, V., “Design of a Multi-user HDTV Storage Server,” IEEE Journal on Selected Areas in Communication, Special issue on High Definition Television and Digital Video Communication, Vol. 11, No. 1, January 1993. [119] Vin, H., et al., “A Statistical Admission Control Algorithm for Multimedia Servers,” Proceedings of the ACM Multimedia ’94, San Francisco, October 1994. [120] Vin, H., et al., “An Observation-Based Admission Control Algorithm for Multimedia Servers,” Proceedings of the IEEE International Conference on Multimedia Computing Systems (ICMCS’94), Boston, pp. 234-243, May 1994. [121] Viswanathan, S., and Imielinski, “Metropolitan are Video-On-Demand Service Using Pyramid Broadcasting,” Multimedia Systems, 4(4):197-208, August 1996. [122] Welch, B, “The File System Belongs to the kernel?”, Proceedings of the 2nd U SENIX Mach symposium, Nov 20-22, pp. 233-250, 1991. [123] Yau, D., and Lam, S., “Operating System Techniques for Distributed Multimedia,” To appear in International Journal of Intelligent Systems, Special Issue of multimedia computing systems. [124] Yau, D., and Lam, S., “Adaptive Rate-Controlled Scheduling for Multimedia Applications,” IEEE/ACM Transactions on Networking, Vol. 5, No. 4, August 1997.

201 [125] Yu, P., Chen., M., and Kandlur, D., “Grouped Sweeping Scheduling for DASD based Storage Management,” Multimedia Systems, Springer-Verlag, pp. 99-109, December 1993. [126] Zou, W., Y., “Digital HDTV Compression Techniques for Terrestrial Broadcasting,” High Definition (HD) World Review, Vol. 3, No. 3, pp. 4-10, 1992.

202

Vita Milind M. Buddhikot [email protected] Date of Birth

January 7, 1966

Place of Birth

Thane, India

Degrees

D.Sc. Computer Science, Washington University, St. Louis, Missouri, August 1998 M.Tech. Communication Engineering, Indian Institute of Technology, Bombay, December 1989 B.E.. Electrical Engineering, University of Bombay, May 1987

Professional Experience

Research Assistant, Sept’90 - July’98 Washington University in St.Louis. Summer Research Internship, June’94 - August’94 NEC USA, Inc., Princeton, New Jersey. Summer Research Internship, June’93 - August’93 NEC USA, Inc., Princeton, New Jersey. Research Assistant, Sept’89 - August’90 Simon Fraser University Peripheral Software Engineer, Feb ’89 - Aug’89 Godrej and Boyce MFG. Co., Bombay, India

Honors and Awards

ACM Computer Science Conference Student Poster First Place Award, 1997 AGES Research Fair Second Place Place Award, 1997 Network Design Challenge competition, Sponsored by Anderson Consulting, Co., March 30, 1996 Lang Wong Memorial Award, Simon Fraser University, Spring 1990 National Merit Scholarship, Govt. of India, 1981-87

203 Professional Societies Publications

Association for Computing Machinery

JOURNAL PAPERS

Buddhikot, M., and Parulkar, G.,“Efficient Data Layout, Scheduling and Playout Control in MARS,” (Invited paper) ACM/Springer Multimedia Systems Journal, Vol III, May 1997. Buddhikot, M., Parulkar, G. and Cox, Jerome, R., Jr., “Design of a Large Scale Multimedia Server,” (Invited publication) Computer Networks and ISDN Systems Journal, Elsevier’s North-Holland Publishers. Kapoor, S., Buddhikot, M., and Parulkar, G., “Design and of an ATM-FDDI Gateway,” Journal of Internetworking: Research and Experience, Vol. 4, No. 1, pp. 21-45, March 1993. BOOK CHAPTERS

Buddhikot, M., Parulkar, G., Rangan, V., and Sampatkumar, Srihari, “Design of Storage Servers and Storage Hierarchies,” Handbook of Multimedia Systems, Prentice Hall International, Inc. Buddhikot, M., and Parulkar, G.,“Efficient Data Layout, Scheduling and Playout Control in MARS,” Lecture Notes in Computer Science series, Vol. 1018 Springer-Verlag, ISBN: 3-540-606475. CONFERENCE/WORKSHOP PAPERS

Buddhikot, M., M., Chen, X., J., Wu, D., and Parulkar, G., “Enhancements to 4.4 BSD UNIX for Networked Multimedia in Project MARS,” Proceedings of IEEE Multimedia Systems’98, June 28-July 1, 1998, Austin, Texas, USA.

204 Buddhikot, M., Parulkar, G., and Gopalakrishnan, R., “Scalable MultimediaOn-Demand via World-Wide-Web (WWW) with QOS Guarantees,” Proceedings of Sixth International Workshop on Network and Operating System Support for Digital Audio and Video, NOSSDAV96, Zushi, Japan, April 23-26, 1996. Buddhikot, M., and Parulkar, G.,“Efficient Data Layout, Scheduling and Playout Control in MARS,” Proceedings of the 5th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV), April 1995. Buddhikot, M., Parulkar, G. and Cox, Jerome, R., Jr., “Distributed Layout, Scheduling and Playout Control in a Multimedia Storage Server,” Proceedings of Sixth International Packet Video Workshop, Portland, Oregon, U.S.A., September 26-27, 1994. Buddhikot, M., Parulkar, G. and Cox, Jerome, R., Jr., “Design of a Large Scale Multimedia Server,” Proceedings of INET’94/JENC5, Conference of the Internet Society and Joint European Networking Conference, Prague, June, 1994. Parulkar, G., Buddhikot, M., Cranor, C., Dittia, Z., and Papadopoulos, C., “The 3M Project: Multi-point Multimedia Applications on Multiprocessor Workstations and Servers,” Proceedings of IEEE Workshop on High Performance Communication Systems, September 1993. Buddhikot, M., Kapoor, S., and Parulkar, G., “Simulation of an ATM-FDDI Gateway,” Proceedings of the 18th IEEE Conference on Local Copter Networks, Minneapolis, Minnesota, pp. 403-412, September 1993. Also a technical report WUCS-92-36, Department of Computer Science, Washington University in St. Louis. Sterbenz, J., Kantawala, A., Buddhikot, M., and Parulkar, G., “Hardware Based Error and Rate Control in the AXON Gigabit Host- Network Interface,” IEEE INFOCOM, 1992, Conference on Computer Communications. PROFESSIONAL ACTIVITIES

Tutorial Chair IEEE LCN95, LCN96.

205 Tutorial Co-chair IEEE LCN94. Publicity Chair Seventh International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV97). Program Committee member of IEEE LCN94, IEEE LCN95, IEEE LCN96. Program Committee member of the Demonstrations Program of ACM Multimedia95. Referee: Journal of High Speed Networks, ACM SIGCOMM, IEEE Infocom, IEEE Globecom, IEEE LCN, IEEE Multimedia Magazine, ACM/IEEE Transactions on Networking, NOSSDAV95, NOSSDAV96, NOSSDAV97, IEEE RTAS97. Founding member of Association of Graduate Engineering Students (AGES) at the School of Engineering, Washington University in St. Louis. Served as the Treasurer for the period January 95 to September 96. August 1998

Scalable Web based MOD Services Buddhikot, D ... - Semantic Scholar

Scalable Web based MOD Services Buddhikot, D ... - Semantic Scholar

Suggest Documents

Evaluating Web Services Based Implementations ... - Semantic Scholar

Semantics Based Web Services Discovery - Semantic Scholar

WebOS: Software Support for Scalable Web Services - Semantic Scholar

Semantic Web Services - Semantic Scholar

An Interoperable and Scalable Web-based System ... - Semantic Scholar

A Scalable Cluster-based Web Server with ... - Semantic Scholar

Scalable Discovery of Linked Services - Semantic Scholar

Scalable Services for Digital Preservation - Semantic Scholar

Scalable Distributed Concurrency Services for ... - Semantic Scholar

Semantic Web Services and DHT-based Peer-to ... - Semantic Scholar

Web Services Management - Semantic Scholar

Unraveling the Web Services Web - Semantic Scholar

Scalable and efficient web services composition based on a relational ...

Semantic Web and Semantic Web Services - Semantic Scholar

Toward Semantic Web Services - Semantic Scholar

Semantic Web Services, Part 1 - Semantic Scholar

Integrating Semantic Web Services Ranking ... - Semantic Scholar

SADI Semantic Web Services - Semantic Scholar

Semantic Description of Web Services - Semantic Scholar

R&D

Janee D

Janee D

R&D

Scalable Mobile Web Services Mediation Framework