Performance of A Storage System for Supporting Different Video ...

6 downloads 74572 Views 314KB Size Report
Distributed Multimedia Center1 &. Department of Computer Science. University of Minnesota. Mengjou Lin2. Advanced Technology Group. Apple Computer, Inc.
Performance of A Storage System for Supporting Di erent Video Types and Qualities Jonathan C.L. Liu, Jenwei Hsieh and David H.C. Du Distributed Multimedia Center1 & Department of Computer Science University of Minnesota

Mengjou Lin2

Advanced Technology Group Apple Computer, Inc.

Abstract Future Video-On-Demand (VOD) servers will need to support many existing and emerging video data types. These data types include 15-fps (frames per second) animation, 30-fps NTSC (National Television Systems Committee) television quality video and 60-fps HDTV (High De nition Television) video. The di erent display speeds and the frame sizes of the various video types impose a major constraint on the design of VOD storage systems. This paper presents the results of an experimental study, conducted on a Silicon Graphics Inc. Onyx computer system, that investigated the impact of of these video types on the design of a VOD storage system. The key issues involved in supporting these di erent video types in a VOD environment are: (1) the video allocation method, and (2) the proper block size (a block is a basic unit of several contiguous video frames that will be accessed from several disks each time a request is made) to use for data striping and retrieval. Two allocation schemes, logical volume striping and application level striping, along with varying frame and block sizes for each of the three di erent video data types are examined in this paper. The focus of our study was to determine the maximum number of concurrent accesses that can be supported with a guaranteed quality of service. The degree of scalability (i.e., striping data over more disk arrays) of the experimental VOD system used was also studied.

Keywords: Multimedia, Video-on-Demand, Mass Storage System, Disk Array, Data Striping

Distributed Multimedia Center (DMC) is sponsored by US WEST, Honeywell, IVI Publishing, Computing Devices International and Network Systems Corporation. 2 This work was performed while Dr. Mengjou Lin was a PhD candidate at University of Minnesota. 1

1 Introduction Multimedia, and in particular video media, has developed into an exciting computer capability that has the potential to transform communication in many elds including medicine, education and business. The design of a Video-On-Demand (VOD) multimedia server capable of providing concurrent streams of video to multiple users is a critical component of the new multimedia computing era that we are entering. As a result of the rapid advances in processing hardware, storage devices, and communication networks, VOD service will soon become a reality. VOD service o ers a paradigm shift from the traditional passive multimedia environment to a more active, user-controlled environment that provides video information anywhere at anytime with guaranteed quality. The diculty in providing VOD with a guaranteed quality of service is due primarily to the variation in access delays over the delivery path. First, the data has to be retrieved from a multimedia server, then it is transmitted over a communication network to the end-user system, and nally it is displayed on the end-user system at a constant data rate. In this paper, we concentrate on the performance of the storage system component. To simplify the network requirements, we assume that video frames are sent over the network at the same rate as they are displayed on the end-user system. Queuing delays may be introduced by contention on the storage devices due to multiple concurrent accesses and the variable amount of time required to read a data block from disk. Although video les can be stored contiguously such that the seek and latency times can be reduced, servicing many concurrent requests will still impose a large variation of seek and latency times for video accesses. Therefore, the objective is to design a storage system to support a maximum number of concurrent video accesses with minimal delay jitter. A common technique used to support VOD service with guaranteed video quality is to enlarge the block size (i.e., bu er size) for each concurrent access (i.e., stream) to accommodate the variation in disk access time. The dilemma of enlarging bu er size is the following: If the allocated bu er size is not sucient, then the quality of the delivered video stream cannot be guaranteed (i.e., jitter may occur). Allocating a larger bu er than necessary can easily maintain the quality of the video streams, however, fewer concurrent video streams can be supported because the total memory space is limited in the VOD server. Enlarging the memory space increases the cost of video delivery. Therefore, the determination and allocation of proper size memory bu ers is essential in the design and implementation of a large-scale VOD server. The problem becomes much more complicated when one considers that the block sizes chosen for di erent video types play an important role in the performance of a large-scale VOD server. Many video types are already widely used or currently undergoing standardization. Fifteen frames per second (15-fps) animation video is widely accepted for CD-ROM titles (although many animation videos can be speci ed up to 30-fps, 15-fps display speed is usually achieved in personal computer environments). Broadcast quality digital NTSC video on the other hand always demands 30 fps for its display requirement. We believe that digital NTSC video will become more popular when the standardization, quality, and price of compression satis es consumers. In the long-term, high de nition television (HDTV) video is expected to become the 1

major commercially-delivered video type. HDTV video is often characterized by a 60 fps display requirement. Table 1 lists the typical memory requirements for these video types. Advances in compression techniques make it possible to have VLSI chips perform decompression on the y when displaying video. Di erent resolutions result in di erent compressed frames. To perform a systematic experiments, the size of a compressed video frame is assumed to be xed. Current video compression schemes usually adopt the run-length encoding, thus resulting variable frame sizes. However, it is our experience that JPEG video with few scene changes will produce close frame sizes [1]. Similarly, a video stream without severe scene changes also can be encoded as MPEG stream with only I frames, thus resulting close frame sizes. Since a video block consists of several video frames, the performance will be closed to what we obtained. The di erent frame sizes, mainly inherited from the window size and resolution of each video type, impose di erent bu ering requirements on storage and retrieval. Animation video usually imposes small (e.g., 16 KBytes) to medium (e.g., 32 KBytes) frame sizes whereas a medium frame size usually is required for NTSC video. Speci cations of HDTV video usually impose large or very large frame sizes (e.g., 64 to 128 KBytes). Table 1: Memory requirement for animation, NTSC and HDTV video types Video Display Speed Required Resolution Uncompressed Compressed Compression Types (frame/sec) (Width  Height)  (bits=pixel) Frame Size Frame Size Ratio Animation 15 512  480  8 240 KBytes 16 KBytes 15 NTSC 30 512  480  24 720 KBytes 32 KBytes 22.5 HDTV 60 1248  960  24 3510 KBytes 64 KBytes 54.8

We believe it is important that an e ective VOD server be able to support each of these three video types. The varying frame sizes and di erent video types will require di erent bu er sizes, and each will cause di erent performance measurements to be achieved by the VOD system. Therefore, it is important to investigate the impact of the di erent video types and determine appropriate bu er management strategies. We address these essential performance characterizations in this paper. We not only propose how these e ects interact, but also verify this through experiments with an actual system (a SGI Onyx) which could be used as a largescale VOD server. Single disk storage systems were usually assumed in the earlier studies of VOD servers [2, 3, 4, 5, 6, 7, 8]. However, in order to provide a large number of concurrent accesses, the VOD server should be equipped with multiple disk storage sub-systems. Storage systems of multiple independent disks were investigated in [9, 10, 11, 12, 13, 14]. Storage systems of RAID (Redundant Arrays of Inexpensive Disks) architectures [15] were reported in [16, 17, 18]. However, only one video type (e.g., usually MPEG or AVI video) was assumed in these studies. We have already reported the experimental performance of a storage system based on an array of RAID 3 disks in [19] using a Silicon Graphics's (SGI) Onyx symmetric multiprocessor computer. RAID 3 disk array uses bit or byte striping onto 8 synchronized disks. We used 30-fps 2

compressed NTSC video (e.g., MPEG-2) for that study. This paper further explores these issues by addressing the impact of three video types and di erent frame sizes. In this paper, we will use both RAID 3 and disk array interchangeably. We classify animation, NTSC and HDTV video based on three parameters: video frame size (KBytes per frame), display speed (frames per second), and requested block sizes (frames per block) as de ned earlier. The performance of a single disk array is rst measured. The main objective is to investigate how many concurrent video streams for di erent video types can be supported with acceptable quality (in terms of small delay or jitter). Using the performance of one RAID 3 as the basis of performance comparison, we extended our experiments to support the three di erent video types with multiple RAID 3s. One unique performance metric of multiple disk arrays is the scalability (will be de ned in Section 4) when more RAID 3s can be added to serve the increasing demands of concurrent accesses. Server scalability is an important issue that needs to be measured and analyzed. Two allocation methods (will be described in Section 2) for multiple RAID 3s are considered: logical volume striping and application level striping. A logical volume is a system supported mechanism on SGI's Onyx which can map certain portions of the storage space of multiple RAIDs into a single linear space. The data stored on a logical volume can be striped across these RAIDs. Therefore, logical volume striping translates a disk request into sub-requests to multiple RAIDs. The sub-requests can be executed by each RAID in parallel. This scheme potentially can improve the access time of a single disk access because of parallel processing. However, it is not clear how logical volume striping performs for multiple concurrent accesses. We have proposed another allocation scheme called application level striping in [19]. In application level striping, the data can be striped through multiple logical volumes or disks. However, each logical volume is intentionally kept small (i.e., only one-way or two-way logical volume striping is used). Two-way logical volume striping is a striping over two RAIDs. Based on our experimental results, application level striping demands smaller block sizes for all three video types, and more concurrent accesses can be distributed over the storage devices. The experimental results demonstrate that application level striping has excellent scalability for the animation and NTSC video. For 60-fps HDTV video, application level striping achieves a great improvement for a frame size of 64 KBytes. HDTV video with very large frame sizes prohibits further improvement due to the start-up contention among the concurrent accesses. Replication of video les or the adoption of start-up scheduling can reduce contention and further improve the performance. The remainder of this paper is organized as follows. We describe our experimental environment in Section 2. The experimental results on supporting concurrent accesses for di erent video types on a single RAID 3 are introduced in Section 3. Section 4 discusses the performance results for logical volume striping. The performance of application level striping is illustrated and compared to logical volume striping in Section 5. We conclude the paper in Section 6.

3

2 An Experimental Large-Scale VOD Server The experiments presented in this paper were conducted on a Silicon Graphics's (SGI) Onyx symmetric multiprocessor computer as shown in Figure 1. The Onyx is equipped with 20 MIPS R4400 processors, 512 MBytes memory, 1.2 Gigabytes/sec system bandwidth and three powerful I/O subsystems, that can each achieve up to 320 MBytes/sec transfer rate. The Onyx computer was chosen because of its processing power, memory bandwidth, I/O transfer rate, and extensibility. The Onyx computer is connected to a storage system consisting of an array of RAID 3s. Up to 24 RAIDs were used in our experiments. Each RAID 3 is connected to the Onyx by a fast and wide SCSI-II channel (20 MByte/sec) and is controlled by a Ciprico 6710 controller with 8 data + 1 parity Seagate ST12400N 2.1 Gigabyte (formatted) drives. System Controller

CPU Borads (1-5)

Memory Boards (1-4)

System Bus (1.2 GB/sec)

POWERchannel-2 Boards HIO Bus

POWERchannel-2 Boards HIO Bus

POWERchannel-2 Boards HIO Bus

320 MB/sec

20 MB/sec

RAID 3 Disk Arrays

RAID 3 Disk Arrays

RAID 3 Disk Arrays

Figure 1: The Onyx computer hardware con guration. The Onyx computer runs an IRIX 5.2 operating system, a fully symmetric multiprocessing Unix System V derivative. Figure 2 shows the software hierarchy of the storage subsystem. The Logical Volume (lv) device driver implements a basic striping algorithm to allow parallel accesses to multiple disk devices. The dksc driver is the generic SCSI disk driver. Applications can access storage devices via either lv or dksc. When accessing the lv or dksc drivers, applications view the disk device as a large continuous le.

2.1 Data Allocation Schemes

4

Standard Unix file I/O accesses Generic SCSI Device Driver (dksc)

Applications

WD33C95A Device Driver (wd95)

Logical Volume Device Driver (lv)

Figure 2: The Onyx software hierarchy of storage subsystem

2.1.1 Logical Volume Striping A logical volume is a storage entity which behaves like a traditional disk partition, but its storage may span several physical devices. In our case, the physical devices are RAID 3s. Two parameters are used to de ne the con guration of a logical volume:

 striping width: The value of the striping width speci es the number of physical storage devices used in a logical volume.  striping granularity (step size): The striping granularity (or step size) speci es the maximum amount of data that is transferred to or from one RAID 3 before switching to the next RAID 3 for a contiguous stream of data that spans multiple disk arrays within a logical volume.

The logical volume striping scheme may not employ all the participating disk arrays for every disk operation. The number of disk arrays which are utilized in one disk operation depends on the request size and the step size. If the request size is smaller than (striping width ? 1)  step size, some of the storage devices will remain idle during that operation. In order to maximize the parallelism using the logical volume, we set the step sizes accordingly such that all the disk arrays within the logical volume will always be accessed in parallel.

2.1.2 Application Level Striping Logical volume striping provides a service abstraction for users and handles the storing and retrieval of data. On the contrary, application level striping requires applications to handle data storing and retrieval by themselves. This means that applications must know exactly where to retrieve and store video data. In application level striping, the video le is divided into blocks and stored on the constituted storage devices in a round-robin fashion. To retrieve a video le, each user accesses one block of the video le from one of the storage devices at a time. This access method allows other users to access di erent blocks of the same video le, which may be stored on other devices, with the

5

same application level striping scheme. Application level striping can be implemented on disk arrays or logical volumes. It may employ multiple logical volumes with di erent striping widths and step sizes.

2.2 Video Retrieval Processes For each end-user, the VOD server assigns one retrieval process to retrieve the video le. The retrieval process provides a video retrieval service to the end-user with a sustained bandwidth. The retrieval processes periodically send read requests to the storage system (directly to disk arrays or through logical volumes) and wait for completion of the read requests. Each read request sent to the storage system asks for a block of video (normally multiple frames). The read requests should be ful lled by the storage system within a speci ed time period. The speci c time by which the read request must be completed is called the deadline. After the video frames are retrieved, the retrieval process sleeps until the next time interval. While the retrieval process is idle, the transferred video block should be delivered to the network through the network interface. With the retrieve-and-idle paradigm, a retrieval process provides a video stream with sustained transfer rate to its user. However, retrieval processes only retrieve video frames from the storage system into the memory subsystem of the VOD server. The VOD server does not deliver the video frames through the network interface to the remote user at this time. Therefore, we made an assumption that the block of video frames is always successfully delivered in time by the network interface to the user after retrieval from the storage system. In our experiments, we did not apply any scheduling mechanism to the operating system to control the retrieval processes. Since no disk scheduling policies are implemented in the RAID 3 [20], the access requests are served by the physical storage devices in a FCFS (First Come First Serve) basis3 . All processes start around the same time. To minimize the impact of virtual memory management and process scheduling of the operating system, retrieval processes and bu ers are page-locked in the memory to avoid any page fault or program swapping during the experiments. The retrieval process is also set to a non-degrading, high priority mode to reduce the side e ect from the operating system's process scheduling. Ideally, a VOD server should be able to service requests at any time to any position of a video le from the video library collection. To emulate this behavior, we adopted a rule that requires all concurrent accesses to start accessing evenly, separate o sets of the storage device at the beginning of a retrieval session. Since logical volume striping provides a linear space across the disk arrays, di erent o sets can be determined at the beginning of video retrieval. Figure 3 depicts the o set selections to support 12 concurrent accesses of video retrieval for a RAID 3 and a logical volume constructed with 2 RAID 3s. The placements of video les always start from the rst disk array in the logical volume. Unlike logical volume striping, a retrieval process using application level striping needs to 3 This means the requests issued from the device driver to the physical storage device will be served using a FCFS ordering. However, as we will illustrate later, the device driver such as the logical volume does not serve the requests from the user processes in a consistent FCFS ordering. Readers should not be confused with the non-FCFS behavior at the device driver level with the physical storage device's FCFS policy.

6

concurrent accesses

RAID 3 Disk Array

(a) one RAID

(b) 2-Wide logical volume

Figure 3: Supporting concurrent accesses starting at di erent o sets using logical volume striping access an individual RAID 3 directly. In order to provide a fair comparison, we arrange the video les starting from the rst disk array. All concurrent accesses are required to start from the rst disk array because of this arrangement. Of course this arrangement may cause contention among the concurrent streams for the rst few accesses on the disk arrays. However, application level striping automatically smoothes the concurrent accesses such that load balancing can still be achieved. This arrangement may also occur in reality. Figure 4 illustrates a scenario to support 12 concurrent accesses using 2-wide application level striping. Part (b) demonstrates that concurrent accesses are started from the rst disk array, and load balancing is achieved after smoothing in Part (c). concurrent accesses

RAID 3 Disk Array

(a) one RAID

(b) before smoothing

(c) after smoothing

Figure 4: Supporting concurrent accesses starting at di erent o sets using application level striping

2.3 Bu er Arrangement In all experiments, we allocated two bu ers for each retrieval process (we called this bu ering method the two-bu er scheme in [21]). The two bu ers were used in a pipelined style, one for retrieval from the storage system, and the other for delivering data to the network interface. The roles of the two bu ers are then exchanged in the next time interval. During video retrieval, the retrieval process periodically sends a read request to the storage system. If the video frames were retrieved before the deadline (which depends on the number of video frames of each retrieval), the process puts itself in a sleep state to simulate the delivery and display operation until the next time interval. The deadline is determined by the number of video frames retrieved 7

during each request. For example, if video frames are displayed every 33:33 milliseconds (30-fps NTSC video), a retrieval of 16 video frames should be completed and delivered within 533:28 milliseconds (16  33:33 = 533:28). The retrieval process will repeat the same operation when it wakes up at the next time interval.

time i-2

time i-1

time i

time i+1

new time i+1

display i-2

display i-1

display i

display i+1

request i-1

request i

request i+1

request i+2

block i-1

block i

block i+1

block i+2

miss-deadline delay

Figure 5: Retrieval and delivery with two bu ers In our experiments, we do not drop any video frame caused by the existence of a misseddeadline. If the video frames are not ready (part of them or all of them) to be delivered before the deadline (this situation is called missed-deadline delay), the retrieval process will wait until the video frames are ready, then resume the same read operation again. This means the time line must be adjusted. As shown in Figure 5, the retrieval of block i +1 does not complete before the deadline. The process waits until it completes the I/O request, adjusts the time line, and then issues the read request for block i + 2 immediately. Therefore, the more missed-deadline delays that occur during the retrieval session, the longer the time required to play back the video les.

2.4 Performance Measurement The number of retrieval processes which can be supported by the storage system is determined as follows. For each retrieval process, we calculate the percentage of the missed-deadline retrievals, which is the ratio of the number of missed-deadline requests to the total number of read requests. The average missed-deadline percentage was calculated from all the retrieval processes after each iteration of an experiment. We repeat the experiment with the increment of the number of retrieval processes until the average missed-deadline percentage was greater than 1%. This particular cut-o point is decided from our experiment experience. Ideally, we liked to have 0% jitter quality, but the synchronization overhead between data disks and the operation on the parity disk within the RAID 3 almost made 0% jitter impossible to support concurrent accesses. It is our experience that over 1% jitter, all video streams will have very bad jitter ratio (like 30 ? 40%) if we increasing more streams. Thus, we obtain the maximum number of retrieval processes which can be supported by the storage system without any scheduling and control mechanisms. 8

3 Performance of A Single Disk Array We start the experiments with a single disk array, to determine a system benchmark for supporting the di erent video types. In later sections, we scale the experiments with multiple disk arrays, whose performance scalability will be compared to the benchmark in this section.

3.1 Experiment Design and Results In order to characterize the variations of di erent video types, the following parameters are identi ed in the experiments of this study.

 Frame size is the typical video frame size of a particular video type. The actual value

depends on the compression gain and the resolution of the video window. To provide a systematic performance measurement, four sizes (16 KBytes, 32 KBytes, 64 KBytes and 128 KBytes) of video frame are de ned to represent small, medium, large and very large compressed frame sizes. The expected e ect of increasing frame sizes is the reduction of supported concurrent accesses because of the increased data transfer time.  Display speed is the continuous display speed that the storage retrieval intends to achieve. The expected impact from di erent video types is the degree of contention among CPUs and (mainly) storage devices.  Requested block size is the number of video frames in a block. A larger block will introduce longer deadline and less frequency of request submissions. On the other hand, this larger block also implies larger bu er memory and longer disk transfer time for each video retrieval process. Since only one request can be served at a time, the increased transfer time will cause higher probability of storage device contention and thus result in longer queueing delays for other concurrent accesses.

Table 2 lists the results from the interaction of these three parameters. To reduce the length of this paper, we only list the maximal number of achievable concurrent accesses for each combination. Note that for SGI's Onyx machines, 4 MBytes is the maximal size of disk request. Therefore, a retrieval request greater than 4 MBytes, which is the product of the frame size and the requested block size (i.e., number of frames), can not be accomplished using the existing environment. We indicate NAs in the experimental results throughout this paper for these cases.

3.1.1 Supporting 15-fps Animation Video Cases 1 to 4 in Table 2 show the performance of a single disk array for the 15-fps animation video. Case 1 shows that a signi cant and steady number of concurrent accesses can be achieved by adopting larger block sizes. From the block sizes of 4 frames to 8 frames, the number of concurrent accesses has been increased from 9 to 18 (i.e., 100%); from the block sizes of 8 frames 9

Table 2: Number of supported concurrent accesses for di erent video types using a single RAID 3. Display Case # Speed (frames/sec) 15 (animation) 30 (NTSC) 60 (HDTV)

1 2 3 4 5 6 7 8 9 10 11 12

Frame Size 16 KBytes 32 KBytes 64 KBytes 128 KBytes 16 KBytes 32 KBytes 64 KBytes 128 KBytes 16 KBytes 32 KBytes 64 KBytes 128 KBytes

Requested Block Size (# frames) 4 8 16 32 64 128 9 18 9 16 7 11 5 6 4 8 4 7 3 5 2 3 2 4 1 3 1 2 1 1

32 22 14 7 16 11 6 3 7 5 3 1

44 28 16 8 23 14 8 4 11 6 3 2

58 33 17 NA 29 15 8 NA 14 8 4 NA

67 35 NA NA 33 17 NA NA 16 8 NA NA

to 16 frames, the increase is from 18 to 32 (i.e., 77:7%); and from the block sizes of 16 frames to 32 frames, 37:5% is obtained. However, from the block sizes of 64 frames to 128 frames, only 15:5% is achieved (58 to 67 concurrent accesses). Case 2 also demonstrates this behavior in a smaller degree (i.e., from 77:7%, 37:5%, 27:2% to 6:0%). Therefore, Cases 1 and 2 show that the number of concurrent accesses can be increased steadily by adopting larger block sizes when the frame size is small. However, the increase is less signi cant when block size is bigger than 32 frames or frame size is larger than 32 KBytes. This is because the size of the requested block greatly a ects the achievable data transfer rate of a RAID [22]. The data transfer rate can be increased accordingly when we enlarge the requested block size until the size reaches a certain threshold. When the requested block size is larger than the threshold, a smaller increased data transfer rate is observed. Cases 3 and 4 represent scenarios where larger block sizes quickly reach the maximal sustained transfer rate. Thus it has less signi cant improvement. Case 3 has an increased rates of 57:1%, 27:2%, 14:2% and 6:25%. Case 4 has less signi cant improvement with 20:0%, 16:6% and 14:2%. For the frame size of 128 KBytes, increasing the block sizes does not result in expected improvement although one animation frame bene ts 66.66 msec of display time. This suggests that although animation video has longer deadlines and less-frequent requests, very large frame sizes still impose a strong impact.

3.1.2 Supporting 30-fps NTSC Video Figure 6 depicts the e ect of enlarging block sizes to support NTSC video. Similar to the performance of supporting animation video, the performance curve in Figure 6 shows that the number of concurrent accesses is increased signi cantly when block sizes are small. But the increases are less signi cant when the block size is large enough. For example, it is between block sizes of 64 and 128 frames that only few concurrent accesses are increased for the frame size of 16 KBytes, while it is between block sizes of 32 and 64 frames for the frame size of 32 KBytes. This is caused by the accessing property of disk arrays and is expected. For example, 10

for large and very large frame sizes like 64 and 128 KBytes, increasing the block size has much less signi cant e ect. 35

Number of Supported Concurrent Accesses

30 Frame Size = 16 KByte 25

20

15 Frame Size = 32 KByte

10 Frame Size = 64 KByte 5 Frame Size = 128 KByte

4

16

32

64 Block Size (Number of Frames)

128

Figure 6: The e ect of enlarging video block to support 30-fps NTSC video using a single RAID 3. Figure 7 depicts that when the request block is small (e.g., 4 frames or 8 frames per block), increasing frame sizes has little impact since the number of supported concurrent accesses is not many before doubling the frame sizes. However, as we can see in Figure 7, with block sizes larger than 8 frames, the increased data transfer time for large frame sizes becomes a major factor. It indicates that the number of concurrent accesses is reduced signi cantly when the frame size is doubled. 35 Block Size = 128 frames Block Size = 64 frames Block Size = 32 frames Block Size = 16 frames Block Size = 8 frames Block Size = 4 frames

Number of Supported Concurrent Accesses

30

25

20

15

10

5

16

32

64 Frame Size (KBytes)

128

Figure 7: The e ect of enlarging video frame size to support 30-fps NTSC video using a single RAID 3. 11

3.1.3 Supporting 60-fps HDTV Video The high display rate imposed by the HDTV video pushes the performance of a single RAID 3 to the limit. When the block size of 4 frames is used, a RAID 3 can only support 2 concurrent accesses, and can not support concurrent accesses when the frame size is larger. HDTV video with the frame size of 128 KBytes makes a RAID 3 only support a single access for block sizes of 4, 8, and 16 frames. Two concurrent video streams can be supported for HDTV video using the frame size of 128 KBytes with the block size of 32 frames. The maximal number of HDTV video streams that can be supported by a RAID 3 when using frame size of 128 KBytes is 2. The reason is simply because of the 20 MBytes/sec limitation of the fast and wide SCSI-II channel. Three HDTV streams each with frame size of 128 KBytes requires 3  128  60 = 22:5 MBytes/sec, which exceeds the maximal bandwidth of fast and wide SCSI-II channel.

3.2 Performance Trends Some performance trends can be observed by examining the experimental results with different video types on a single RAID 3:

 For three display speeds, it is observed that when the frame size is very large (e.g., 128

KBytes), the e ect of block size becomes less signi cant. However, the results show that more concurrent accesses can be supported with larger block sizes for medium and smallsize video frames (e.g., 32 and 16 KBytes).  In several cases, the same data transfer rate is required but with di erent frequencies of accesses. For examples, the experiments with the same block size in Cases 3, 6 and 9 require the same data transfer rate, although they impose di erent access frequencies. It is not surprising that the one that makes more requests (e.g., Case 9 of HDTV) supports less concurrent accesses. It is reasonable that the ones which access more frequently will experience more seek and latency time overhead since the request size is smaller, and the same transfer rate has been achieved by higher frequency of accesses.

4 Performance of Logical Volume Striping Logical volume striping provides a system-supported mechanism to decompose a disk request into several sub-requests; one for each disk array. All sub-requests can be processed in parallel which can potentially reduce the required access time. However, one requirement to achieve parallelism is that a large bu er size should be allocated to receive the transferred data. Consider Equation (1) to determine the size of a sub-request, subreq

= B lvSvf ; 12

(1)

where subreq represents the block size of a sub-request for each RAID 3; B represents the requested block size in frames; Svf represents the size of a video frame in KBytes; and lv represents the number of RAID 3s that constitute this logical volume. To achieve better performance, the size of sub-requests should be large enough. Thus, the value of B should depend on the value of lv . Therefore, there are two control parameters, B and lv , of experiments in this section. The design of experiments is to enlarge the block size, B from 16, 32, 64 to 128 frames for each lv , and to measure the maximal achievable concurrent accesses and the associated timing performance. The value of lv is from 2 to 16 disk arrays. In addition to the total number of supported accesses, we also list the scalability gain (in percentage) and average number of concurrent accesses supported by each disk array. we de ne scalability as the performance improvement (in terms of supported number of concurrent accesses) when more disk arrays are con gured in the logical volume.

4.1 Supporting 15-fps Animation Video Table 3 lists the experimental results of supporting animation video. Since most animation videos only impose small and medium frame sizes, we performed the experiments based on frame sizes of 16 KBytes and 32 KBytes. The performance to support larger frame sizes should be similar to what we have presented in Section 3. We also include the performance results from Section 3.1 (i.e., width = 1) for comparison. The table is arranged to re ect the scalability of the striping method. The ideal scalability is that the number of concurrent accesses should be increased proportionally to the number of disk arrays (i.e., width). However, this ideal scalability is impossible because of the increased contention for resources such as CPUs, logical volume device driver, physical device drivers and storage devices. We put the scalability gain (gain), the increase of supported accesses from the previous row with the same block size, to show the actual measured scalability. Table 3: Number of supported concurrent accesses for 15-fps animation video with logical volume striping. Width Frame Size 1 2 4 8 1 2 4 8

16 KB 16 KB 16 KB 16 KB 32 KB 32 KB 32 KB 32 KB

total 32 40 48 52 22 22 40 48

16

Requested Block Size (# frames) 32 64

128

gain ave total gain ave total gain ave total gain ave 32 44 44 58 58 67 67 25% 20 56 27% 28 92 59% 46 100 20% 12 84 50% 21 136 48% 34 100 8% 6.5 96 14% 12 178 31% 22.3 100 22 28 28 33 33 35 35 0% 11 40 43% 20 58 76% 27 50 82% 10 66 65% 16.5 90 50 20% 8 82 24% 10.3 90 50 -

For the frame sizes of 16 KBytes and 32 KBytes, the e ect of larger block sizes has a signi cant impact for animation video. By increasing the block size, the display duration has been extended, the size of sub-requests distributed among multiple disk arrays has also been increased. These e ects result in a signi cant increase of the number of supported concurrent 13

accesses. However, as the block size increases, the VOD system eventually consumes 80 ? 85% of the physical memory space when a block size of 64 frames is used for frame size of 32 KBytes with 4-wide striping. The IRIX operating system in the Onyx machine shares the same physical main memory to perform system administration. The IRIX requires 10 ? 15% of the memory space (512 MBytes in our case) for its administration tasks. Physical memory limitation prohibits us from performing further experiments for the block size of 128 frames with the frame size of 16 KBytes, and for the block sizes of 64 and 128 frames with the frame size of 32KBytes. Therefore, instead of reaching the maximal number of concurrent accesses with less than 1% average jitter ratio, we performed experiments that consume up to 80% of the physical memory, and validated that all concurrent accesses still experience guaranteed display quality. We use italic font to depict the results for these con rmation tests. On the scalability performance, the increase in percentage is expected to be less for each block size when the width is doubled. The reason is because the size of sub-requests is reduced when the width is increased such that the performance gain will be less signi cant. The results generally follow this expectation. It is also expected that the percentage gain of scalability for medium frame sizes should be better than small frame sizes. The reason for this expectation is because animation video with medium frame size should have more performance improvement when the number of disk arrays is increased. This can be observed in the column for the block size of 32 frames with the frame size of 32 KBytes.

4.2 Supporting 30-fps NTSC Video Table 4 lists the experimental results of supporting NTSC video using logical volume striping. Table 4: Number of supported concurrent accesses for 30-fps NTSC video with logical volume striping. Width Frame Size 1 2 4 8 16

32 KB 32 KB 32 KB 32 KB 32 KB

16

Requested Block Size (# frames) 32 64

128

total gain ave total gain ave total gain ave total gain ave 11 11 14 14 15 15 17 17 12 9% 6 22 57% 11 28 87% 14 32 88% 16 20 66% 5 30 36% 7.5 40 43% 10 50 20 0% 2.5 38 27% 4.8 66 65% 8.3 50 20 0% 1.3 44 16% 2.8 86 30% 5.4 50 -

Figure 8 depicts the e ect of di erent block sizes to support NTSC video. It shows that increasing the video block sizes improves the supported number of concurrent accesses. It is worth noting that if we just increase the block size without increasing the width properly, the number of supported concurrent accesses will be increased in a very minor degree. This can be observed on the RAID 3 striping (i.e., 1-wide), and 2-wide logical volume striping in Figure 8. For instance, the number of supported concurrent accesses becomes steady with the 2-wide striping when block size is larger than 64 frames. On the other hand, with increasing striping widths using the 4-wide, 8-wide and 16-wide logical volume stripings, the number of supported 14

concurrent accesses is increased signi cantly with the block size of 64 frames. The physical memory limitation again prohibits the further experiments using the block size of 128 frames. 90

80

Number of Supported Concurrent Accesses

16-wide lv stripping 70

60 8-wide lv stripping 50

40 4-wide lv stripping 30

20

2-wide lv stripping

10

RAID 3 stripping

16

32

64 Block Size (Number of Frames)

128

Figure 8: The e ect of enlarging block sizes to support 30-fps NTSC video using logical volume striping. Figure 9 depicts the scalability performance to support NTSC video. The scalability using the block size of 16 frames is not good because the number of supported concurrent accesses remains the same with more than 4 RAID 3s. The scalability is improved when the block size is increased to 32 frames. However, when the width is increased from 8 to 16 RAID 3s, the slope of the curve becomes atten, and it can be expected that the number of concurrent streams will remain the same with more than 16 RAID 3s. Thus, in general, with the block sizes of 32 frames or less, a small sub-request is generated for each RAID 3, thus resulting in less scalability. The block size of 64 frames shows the best (i.e., sub-linear) scalability performance. The reason is because a requested block size of 64 frames generates sub-requests of 8 frames for the 8-wide striping; sub-requests of 16 frames for the 4-wide striping; and sub-requests of 32 frames for the 2-wide striping. The sizes of these sub-requests are large enough to provide a stable scalability performance. Table 5 lists the timing performance with block sizes of 16 frames and 32 frames. The experimental results validate our expectation that using larger block sizes in logical volume striping will reduce the average access time per video frame signi cantly. As we presented in Table 3, the block size of 16 frames with 2-wide logical volume striping only supports up to 12 concurrent accesses. Therefore, supporting 20 concurrent accesses causes every stream to have Miss % column in Table 5. Therefore, the Missed virtually 100% jitter. This is observed in the Loops Latency represents the average access time, 714.15 msec, of accessing a block of 16 frames. The average access time per video frame is thus 71416:15 = 44:63 msec. When the block size is increased to 32 frames, every video stream has jitter-free quality when 20 concurrent accesses are supported. Now In-time Latency, 341.44 msec, represents the average access time for accessing a block of 32 frames. Thus, the average access time per 15

90

Number of Supported Concurrent Accesses

80

70

60

Block size = 64 frames

Block size = 128 frames

50 Block size = 32 frames

40

30 Block size = 16 frames 20

10

2

4

6

8 10 Stripping Width

12

14

16

Figure 9: The scalability performance of logical volume striping to support 30-fps NTSC video type. video frame is 34132:44 = 10:67 msec. The four times performance improvement, from 44:63 msec to 10:67 msec, is mainly because of the increase of data transfer rate on each RAID 3. Table 5: Timing performance of a 2-wide logical volume striping using block sizes of 16 and 32 frames to support NTSC video. Block size (frames) Streams Loops Missed-deadline 16 16 32

12 20 20

500 500 250

0.50(0-1) 497.25(497-498) 0.00(0-0)

Miss Loops (%)

0.10% 99.45% 0.00%

Latency (ms) In-time Missed 166.17 280.33

341.44

615.27

714.15

0.00

4.3 Supporting 60-fps HDTV Video Table 6 lists the experimental results to support HDTV video using logical volume striping. HDTV video usually imposes large and very large frame sizes. Thus, we choose 64 KBytes and 128 KBytes to represent the typical sizes of compressed HDTV video frames. The experimental results demonstrate a minor performance improvement with enlarging block sizes. The reason is mainly because of the already-large frame sizes and short playback duration. Results also suggest that to support HDTV video, current IRIX operating system's implementation on the maximal read block size, 4 MBytes, should be relaxed. Further improvement on the performance might be achieved with the relaxation of the maximal request size by using the block size of 128 frames with the frame size of 64 KBytes, and the block size of 64 frames or more with the frame size of 128 KBytes. It is interesting to observe that the frame size for HDTV video has a major impact on the 16

Table 6: Number of supported concurrent accesses for 60-fps HDTV video with logical volume striping. Width 1 2 4 8 1 2 4 8

Frame Size

64 KB 64 KB 64 KB 64 KB 128 KB 128 KB 128 KB 128 KB

16

Requested Block Size (# frames) 32 64

128

total gain ave total gain ave total gain ave total gain ave 3 3 3 3 4 4 NA 5 67% 2.5 5 67% 2.5 6 50% 3 NA 6 20% 1.5 8 60% 2 10 67% 2.5 NA 6 0% 0.8 8 0% 1 10 0% 1.3 NA 1 1 2 2 NA NA 1 0% 0.5 2 0% 1 NA NA 4 300% 1 6 200% 1.5 NA NA 4 0% 0.5 6 0% 0.8 NA NA -

scalability performance. For the frame size of 64 KBytes, the number of supported concurrent accesses is increased signi cantly (50 ? 67%) when the width is increased from 1 to 2 RAID 3s. But this signi cant result is not achieved for the frame size of 128 KBytes until the width is increased from 2 to 4 RAID 3s. The reason is mainly because of the contention from the high access frequency of HDTV video. Notice that all the disk arrays within the logical volume are exclusively accessed by only one stream at a time. And the access frequency of the HDTV video is not changed no matter how many disk arrays are used in the logical volume. Figure 10 depicts the contention on the logical volume device driver among 6 HDTV streams of the frame size of 128 KBytes with the block size of 32 frames using the 8-wide logical volume striping. The timing information presented in this gure was obtained from our experiments.

stream #4 122 msec stream #3 107 msec stream #1

Time Axis

86 msec stream #2 118 msec stream #0

233 msec stream #5 167 msec

Figure 10: Contention of the rst access on the logical volume using the 8-wide logical volume striping to support 60-fps HDTV video type. It demonstrates that highly parallel processing can be achieved, and the request can be served very quickly for each stream. Although these 6 HDTV streams are issued in order, from stream #0 to #5, it can be noticed that the actual execution ordering is di erent. It is known in the device driver such as logical volume, the ordering of I/O requests issued might not be the same ordering as the device driver receives the interrupts of I/O completion. For example, stream #5 started executing the I/O command later than stream #0, but nished earlier. Since 17

6 5 4 3 2 1 0 0

1000

2000

3000

4000 time (ms)

5000

4000 time (ms)

5000

6000

7000

8000

Figure 11: Timing performance for the rst 8 seconds to support 6 HDTV video streams using the 8-wide logical volume striping with block size of 32 frames and frame size of 128 KBytes. 8 7 6 5 4 3 2 1 0 0

1000

2000

3000

6000

7000

8000

Figure 12: Timing performance for the rst 8 seconds to support 8 HDTV video streams using the 8-wide logical volume striping with block size of 32 frames and frame size of 128 KBytes. the logical volume uses all the disk arrays in parallel, logical volume can serve one active request and put the other requests into the queue. Figure 10 actually represents the best scenario since all concurrent accesses seize the logical volume with minimal queueing delays. However, it should be pointed out that as video display proceeds, more queueing delays will be accumulated to become signi cant enough to cause jitters. To demonstrate the e ect of the accumulated queueing delays, we analyzed the interaction of 6 and 8 HDTV streams using the 8-wide striping with the block size of 32 frames and the frame size of 128 KBytes. We adopt a digital waveform representation that a 'high' (i.e., 1) denotes the stream sending the read request and 'low' (i.e., 0) denotes the stream is idle. To make the gures readable, we use di erent heights to represent di erent streams. Figure 11 illustrates the timing of 6 HDTV video streams. It demonstrates that although these concurrent accesses might overlap in some degree, the logical volume still serve them stably with acceptable queueing delay. Therefore, all 6 streams experienced no jitters. However, when 8 HDTV video streams are supported, Figure 12 illustrates that the accumulated e ect of queueing delays took place very quickly in the rst 8 seconds, and eventually caused 11-13 jitters in every stream. These results suggest that to support HDTV video with large frame sizes using the logical volume striping, the width should be kept small (e.g., 4).

18

5 Performance of Application Level Striping Logical volume striping su ers from contention on the logical volume device driver. It also requires large bu er memory to achieve linear scalability and fast disk accesses. The disadvantage of this requirement is the potential loss on the maximal supported number of concurrent accesses to the VOD system. A more e ective allocation scheme has to be designed to increase the supported number of concurrent accesses with a smaller bu er size. We have proposed an allocation scheme called application level striping in [19]. In application level striping, the video les are divided into blocks and stored on the constituted storage devices (e.g., RAID 3s) in a round-robin fashion. However, after de-clustering a video le across multiple RAID 3s, each video retrieval process (i.e., application) should retrieve the video blocks in a pipelining manner (i.e., access only one RAID 3 at a time). The load balancing that resulted from distributing the concurrent accesses using this method improved the performance. The 30-fps NTSC video with xed frame size of 32 KBytes and block size of 16 frames was adopted in our previous study. How application level striping supports the other video types with varying frame and block sizes was not considered in [19]. In this section, we use six striping widths: 1, 2, 4, 8, 16, and 24. Block sizes are varied from 16, 32, 64 to 128 frames. Since we already demonstrated in [19] that application level striping consisting of physical RAID 3 (i.e., 1 wide) resulted in the best performance, no logical volume is used in this section.

5.1 Supporting 15-fps Animation Video Table 7 lists experiment results to support animation video using application level striping. Table 7: Number of supported concurrent accesses for 15-fps animation video with application level striping. Width Frame Size 1 2 4 8 16 24 1 2 4 8 16 24

16 KB 16 KB 16 KB 16 KB 16 KB 16 KB 32 KB 32 KB 32 KB 32 KB 32 KB 32 KB

16

Requested Block Size (# frames) 32 64

128

total gain ave total gain ave total gain ave total gain ave 32 32 44 44 58 58 67 67 64 100% 32 92 109% 46 114 97% 57 104 55% 52 116 81% 29 176 91% 44 180 100 224 93% 28 324 84% 40.5 180 100 420 87% 26.3 420 30% 26.3 180 100 640 360 180 100 22 22 28 28 33 33 35 35 44 100% 22 58 107% 29 64 94% 32 50 86 95% 21.5 106 83% 26.5 90 50 166 93% 20.8 202 91% 25.3 90 50 360 180 90 50 360 180 90 50 -

Enlarging the block sizes has the expected e ect on the steady increase of supported concurrent accesses. The number of supported concurrent accesses is increased signi cantly. 19

Scalability of application level striping is excellent to support animation video compared to logical volume striping. Although the scalability gain in percentage still shows a decreasing trend when the width is increased, the degree of decreased percentage is much less than that of logical volume striping. The reason is because there is no need to divide the request into sub-requests, thus avoiding the low data transfer rate, which was caused by the reduced sizes of sub-requests as we observed on the performance of logical volume striping. The average number of concurrent accesses distributed among these multiple disk arrays is also more stable compared to that of logical volume striping. For example, application level striping supports at least 26.3 concurrent accesses on average per disk array for a block size of 16 frames with the frame size of 16 KBytes. With the frame size of 32 KBytes, at least 20.8 concurrent accesses per disk array is achieved. As we compare to Table 3, logical volume striping only support 6.5 concurrent accesses using 8-wide striping for a block size of 16 frames with the frame size of 32 KBytes. The above observations indicate that the load balancing achieved by application level striping has a great bene t on supporting animation video. However, the performance is still bounded by the limitation of physical memory space. Because of its excellent scalability performance, this bottleneck occurs earlier than with logical volume striping.

5.2 Supporting 30-fps NTSC Video Table 8 lists the experiment results to support NTSC video using application level striping. Table 8: Number of supported concurrent accesses for 30-fps NTSC video with application level striping. Width Frame Size 1 2 4 8 16 24

32 KB 32 KB 32 KB 32 KB 32 KB 32 KB

16

total gain 11 20 82% 36 80% 58 61% 120 106% 220 83%

Requested Block Size (# frames) 32 64

128

ave total gain ave total gain ave total gain ave 11 14 14 15 15 17 17 10 28 100% 14 32 113% 16 34 100% 17 9 54 93% 13.5 52 63% 13 52 53% 13 7.3 96 78% 12 96 85% 12 60 7.5 136 42% 8.5 120 60 9.2 250 120 60 -

The e ect of enlarging block sizes is depicted in Figure 13. Extending block sizes from 16 frames to 32 frames results in a signi cant increase of the supported concurrent accesses. However, for the block sizes of 64 frames and 128 frames, the increase of the supported concurrent accesses is very minor. This is because the disk request size has been increased so much that the data transfer rate does not increase signi cantly. Therefore, large data transfer time occurs when the block size is large. Consequently, this introduces a long queueing delay since a RAID 3 only can serve one stream at a time. Considering the following analytical equations for the experienced time components of each access request: Tos + Tcmd + Tseek + Trot + Txfer  B  Tdisplay ; (2) 20

250

Number of Supported Concurrent Accesses

24-wide application stripping 200

150

16-wide application stripping

100 8-wide application stripping

50 4-wide application stripping 2-wide application stripping 16

32

64 Block Size (Number of Frames)

128

Figure 13: The e ect of enlarging block size to support 30-fps NTSC video using application-level striping . and T

xfer =

B



S

 R dt

vf ;

(3)

where B is the requested block size; Tdisplay is the duration for each video frame; Tos is the operating system overhead; Tcmd is the time required by the storage system to process the request; Tseek is the time it takes for the RAID to seek from one end to the another; Trot is the time to complete one rotation of the RAID; and Txfer is the time it takes to transfer the requested video block across the SCSI-II I/O channel.  Rdt , data transfer rate, itself is dependent on the request size for RAID 3. Therefore, the block size has e ects on Rdt and thus the Txfer time components. The randomness factor introduced by the di erent o sets of concurrent accesses will mainly in uence the Tseek and Trot time components. Since all requests are issued independently, Tcmd actually has a dynamic queueing e ect on the waiting time. When the block size of request, B , is increased, the Tcmd , and Txfer will be increased accordingly. To illustrate the mixed e ects of the Tcmd and the Txfer time components, we performed another experiment with a single access and ten concurrent accesses for 8-wide application level striping. Table 9 lists the results of the timing performance. It can be observed that with just a single access, the In-time Latency is close to the sum of Txfer plus the minimal (i.e., unavoidable) overhead of other components. The In-time latency of a single access demonstrates an almost linear relationship with the block size. Therefore, we can compare the cases of 10 concurrent accesses using the timing performance of the cases of a single access. For example, to support :41 10 concurrent accesses using the block size of 16 frames, it requires 103 33:24 = 3:1 times the average access time of a single access; for the block sizes of 32, 64 and 128 frames, it requires 240:48 = 3:57, 526:06 = 4:22 and 1109:25 = 4:6 times. This increasing ratio demonstrates that 67:32 124:54 240:62 when the block size is becoming large, the negative e ect of large data transfer time and the 21

Table 9: Timing performance for application level striping with 1 and 10 concurrent accesses using 8-wide RAID 3. Width Block Size Streams Loops Missed-deadline (frames) 8 8 8 8 8 8 8 8

16 32 64 128 16 32 64 128

1 1 1 1 10 10 10 10

500 250 125 63 500 250 125 63

Miss Loops (%)

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Latency (ms) In-time Missed 33.24 67.32 124.54 240.62 103.41 240.48 526.06 1109.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

associated queueing delay will dominate the positive e ect of prolonged deadline. Figure 14 depicts the scalability performance of application level striping. With the block sizes of 16 frames and 32 frames, a close-linear scalability is achieved. Notice that this close-linear scalability was not achieved until the block size of 64 frames with the logical volume striping. 300 Block size = 16 frames Block size = 32 frames Block size = 64 frames Block size = 128 frames Number of Supported Concurrent Accesses

250

200

150

100

50

5

10

15

20

Stripping Width

Figure 14: The scalability performance of application-level striping to support the 30-fps NTSC video. In earlier sections, we predicted that the concurrent accesses using application level striping should be distributed among the disk arrays. From the average number of concurrent accesses per disk array, it shows the achieved load balancing is very close to the optimal (i.e., 17) concurrent accesses with the block size of 128 frames for a single disk array. Table 10 illustrates the timing performance of application level striping to support 30-fps NTSC video. To demonstrate how the adoption of multiple disk arrays improves the disk access time, compare three experimental results with the block size of 128 frames for the 8, 16 and 24wide striping. Since 60 concurrent accesses consumed 80-85% of the physical memory space, The In-time Latency represents the e ect of adopting more disk arrays in application level striping. 22

It shows that with 16-wide striping, the average access time has been reduced 1280:70 ? 878:43 = 402:27 msec, which indicates 30% of latency reduction. Similarly, 878:43 ? 500:37 = 378:06 msec is reduced from 16 to 24-wide striping, which again indicates 43% of latency reduction. Table 10: Timing performance for application level striping.

Width Block Size Streams Loops Missed-deadline (frames) 2 2 2 2 4 4 4 4 8 8 8

16 32 64 128 16 32 64 128 16 32 64

20 28 32 34 36 54 52 52 58 96 96

500 250 125 63 500 250 125 125 500 250 125

16 32 64

120 136 120

500 250 125

16 32 64

220 250 120

500 250 125

8

128

16

128

24

128

16 16 16 24 24 24

60

60 60

0.15 0.46 0.91 0.32 0.94 0.00 0.06 0.37 1.95 1.55 0.52

Latency (ms) In-time Missed

Miss Loops (%)

0.03% 0.19% 0.73% 0.51% 0.19% 0.00% 0.05% 0.58% 0.39% 0.62% 0.42%

216.76 592.50 910.47 1867.93 208.21 354.69 639.42 1842.01 151.87 442.93 644.52

603.46 1186.00 2459.31 5917.12 574.69 0.00 2288.76 4730.51 643.45 1294.89 2587.18

134.87 220.82 565.14

714.13 1614.49 2315.99

93.26 206.29 764.90

918.94 1960.52 2357.74

63

0.18

0.29% 1280.70 4778.42

63

0.15

0.24%

878.43 4356.55

63

0.00

0.00%

500.37

2.10 2.07 1.55 2.78 1.69 3.34

0.42% 0.83% 1.24% 0.56% 0.68% 2.67%

0.00

5.3 Supporting 60-fps HDTV Video Table 11 lists the experimental results to support HDTV video with the application level striping. Table 11: Number of supported concurrent accesses for 60-fps HDTV video with the application level striping. Width 1 2 4 8 16 24 1 2 4 8 16 24

Frame Size

64 KB 64 KB 64 KB 64 KB 64 KB 64 KB 128 KB 128 KB 128 KB 128 KB 128 KB 128 KB

total 3 6 12 20 36 40 1 2 6 7 7 8

16

Requested Block Size (# frames) 32 64

gain ave total gain ave total gain 3 3 3 4 100% 3 6 100% 3 8 100% 100% 3 12 100% 3 14 75% 67% 5 24 100% 3 24 71% 80% 2.3 46 92% 2.9 58 142% 11% 1.7 46 0% 1.9 58 0% 1 2 2 NA 100% 1 3 50% 1.5 NA 200% 1.5 6 100% 1.5 NA 17% 0.9 8 33% 1 NA 0% 0.4 8 0% 0.5 NA 0% 0.3 8 0% 0.3 NA -

23

128

ave total gain ave 4 NA 4 NA 3.5 NA 3 NA 3.6 NA 2.4 NA NA NA NA NA NA NA -

For the frame size of 64 KBytes, application level striping demonstrates a very signi cant improvement for the HDTV video compared to logical volume striping. The scalability gain is between 67% and 142%. However, the scalability gain becomes minor when the width is from 16 to 24 RAID 3s for the block sizes of 32 and 64 frames. The reason is because of the contention on the rst few disk array accesses since we arranged all the video les starting from the rst disk array. Because of this placement arrangement, all concurrent accesses start from sending requests of the rst video block to the rst disk array. Therefore, no more supported concurrent accesses can be increased because of the contention resulted from the high access frequency of HDTV video. We will explain this behavior later using the frame size of 128 KBytes and the block size of 32 frames. For the frame size of 128 KBytes, application level striping only demonstrates a minor improvement over logical volume striping. The reason for this limited performance improvement is because of the very large size of disk requests, which cause very long data transfer time, and the short display duration imposed by the HDTV video type. Since the data transfer time is large, and the time that the next request comes back is very short, all concurrent accesses compete the storage devices in a high frequency, especially for the rst few access requests. Therefore, the average disk access delay for each concurrent access consists mainly the queueing delay caused by other accesses when competed for the RAID 3s, and the large transfer time when storage device is seized. Since no scheduling policy is adopted, we let all concurrent accesses start their requests from the rst disk array. The load balancing feature of application level striping provides well performances on animation and NTSC video. It also performs well up to 16 disk arrays for HDTV video with the frame size of 64 KBytes, and 4 disk arrays for the HDTV video with the frame size of 128 KBytes. Consider the case of 4-wide application level striping, 4 fast-wide SCSI-II channels only o er the aggregated bandwidth of 4  20(M B=sec) = 80(M B=sec). The HDTV video with frame size of 128 KBytes requires 7.5 MBytes/sec bandwidth. The upper bound of the supported concurrent accesses in this case should be only b 780:5 c = 10. Application level striping sustained 60% of this upper bound (i.e., 6 HDTV streams). stream #6 986 msec stream #4 544 msec stream #2

stream #0

Disk 1

251 msec

253 msec stream #1 253 msec

stream #3 380 msec stream #5 759 msec stream #7 1222 msec

Figure 15: Contention on the rst disk array using the 16-wide application level striping to support 60-fps HDTV video (with block size of 32 frames and frame size of 128 Kbytes). Figure 15 depicts the contention on the rst disk array using 16-wide application level striping 24

to support 60-fps HDTV videos with the block size of 32 frames. It demonstrates that the rst three concurrent accesses seize their rst requests from the rst disk array in a pipelined manner. However, for the rest of the concurrent accesses, queueing delays are experienced. Notice that the competition for the rst few disk accesses among concurrent streams in application level striping usually is the worst. As display continues, the load balancing feature of application level striping will eventually smooth all the concurrent accesses among the multiple RAID 3s. For animation and NTSC video, this smoothing process can be accomplished very quickly. However, the rigid high display rate of the HDTV video with the frame size of 128 KBytes makes this smoothing e ect less signi cant compared to the NTSC and animation video. We have observed that the experienced jitters among these concurrent HDTV streams were always caused mainly by the contention for the rst few disk accesses. The jitters caused by the contention for rst few disk accesses could be reduced by the following two methods. One is to replicate the video les and place the starting blocks in di erent disk arrays. Then the contention of the same video le can be distributed to other disk arrays. This method su ers from the eciency of the disk usage. The advantage of duplication is more concurrent accesses can be supported. Another method is to adopt a start-up scheduling that explicitly postpones the starting of some concurrent accesses to reduce the contention. This method has the advantage that no replication needs to be made, thus, this method should have better disk utilization. However, some streams su er from long start-up delay. These techniques and their tradeo s are currently under our investigation. Our early results demonstrated that 14 instead of 8 HDTV streams can be supported by adopting these techniques with 8-wide striping for a block size of 32 frames and frame size of 128 KBytes.

6 Concluding Remarks We are among the rst few groups to adopt an experimental study to understand these design and performance issues for a large-scale VOD server. Our previous work and this study verify the potential of using a general computing machine as a VOD server. This VOD server needs to serve three di erent video types: 15-fps animation, 30-fps NTSC and 60-fps HDTV videos. The impacts of these di erent video types on the mass storage system are reported in this paper. First, the 15-fps animation video types are well supported by both allocation schemes. However, scalability is signi cantly improved using application level striping than logical volume striping. Second, using logical volume striping to support the 30-fps NTSC video requires careful considerations on the step sizes. Larger block sizes (e.g., 64 frames) generate suciently large sub-requests such that a sub-linear scalability can be achieved. Application level striping also has better improvements than logical volume striping when supporting NTSC video. The load balancing on the mass storage system achieved by application level striping is excellent. The performance results suggest that smaller block sizes (e.g., the block sizes of 16 and 32 frames) should be adopted for application level striping. Third, logical volume striping does not support a good scalability for the 60-fps HDTV video. Major improvements on the HDTV video with 25

frame size of 64 KBytes can be obtained by adopting application level striping with up to 16 RAID 3s. This experimental study suggests the need of the relaxation of the limitation of maximal request size by IRIX 5.2 operating system. Our experiment results also demonstrate that physical memory space prohibits the further improvement although the storage system is capable to support more streams. Therefore, physical memory bu er space becomes an important resource that needs to be allocated and managed more eciently.

Acknowledgment The authors wish to express their sincere gratitude to Tom Ruwart, Jon Buerge, Russel Cattelan and Je Stromberg at the Army High-Performance Computing and Research Center, Dr. Ronald Vetter at North Dakota State University, James Schnepf, Yen-Jen Lee, Horng-Juing Lee and Harish Vedavya at the University of Minnesota for their valuable comments and support.

References [1] A Guha, A. Pavan, J. Liu, A. Rastogi, and T. Steeves. Supporting Real-Time, and Multimedia Application on the Mercuri Testbed. To appear in IEEE Journal on Selected Areas in Communications Special Issue of ATM LANs: Implementations and Experiences with an Emerging Technology, Apr 1995. [2] D.P. Anderson, Y. Osawa, and R. Govindan. A File System for Continuous Media. ACM Transactions on Computer Systems, November 1992. [3] P.V. Rangan and H.M. Vin. Designing File Systems for Digital Video and Audio. Operating Systems Review, October 1991. [4] P.V. Rangan and H.M. Vin. Ecient Storage Techniques for Digital Continuous Multimedia. IEEE Transactions on Knowledge and Data Engineering, August 1993. [5] H. Vin and V. Rangan. Admission Control Algorithm for Multimedia On-Demand Servers. In

Proceedings of the Third International Workshop on Network and Operating System Support for Digital Audio and Video, pages 56{69, Nov 1992. [6] H. Vin and V. Rangan. Designing a Multiuser HDTV Storage Server. IEEE Journal on Selected Areas in Communications, 11(1):153{164, Jan 1993.

[7] J. Gemmell and S. Christodoulakis. Principles of Delay-Sensitive Multimedia Data Storage and Retrieval. ACM Transaction of Information Systems, 10(1):51{90, Jan 1992. [8] J. Gemmell. Multimedia Netwok Servers: Multi-channel Delay Sensitive Data Retrieval. In Proceedings of the ACM Multimedia'93 Conference, pages 243{250, Aug 1993. [9] P.K. Lougher and D. Shepherd. The Design of a Storage Server for Continuous Media. Computer Journal, 36(1), Feb. 1993. [10] N. Reddy. Disk Scheduling in a Multimedia I/O System. In Proceedings of the ACM Multimedia'93 Conference, pages 225{233, Aug 1993.

26

[11] M. Chen, D. Kandlur, and P. Yu. Optimization of the Grouped Sweeping Scheduling (GSS) with Heterogeneous Multimedia Streams. In Proceedings of ACM Multimedia'93 Conference, pages 235{ 242, Aug 1993. [12] P. Yu, M. Chen, and D. Kandlur. Design and Analysis of a Grouped Sweeping Scheme for Multiimedia Storage Management. In Proceedings of the Third International Workshop on Network and Operating System Support for Digital Audio and Video, pages 44{56, Nov 1992. [13] D. Ghandeharizadeh and L. Ramos. Continuous Retrival of Multimedia Data Using Parallelism. IEEE Transactions on Knowledge and Data Engineering, 5(4):658{669, Aug 1993. [14] D. Ghandeharizadeh and L. Ramos. An Evaluation od Three Virtual Replication Strategies for Continuous Retrieval of Multimedia Data. In Proceedings of the Intl. Symposium on Next Generation Database Systems and Their Applications, pages 188{197, Sep 1993. [15] G.R. Ganger, B.L. Worthington, R.Y. Hou, and Y.N. Patt. Disk Arrays: High-Performance, HighReliability Storage Subsystems. IEEE Computer, March 1994. [16] F. Tobagi and J. Pang. StarWorks - A Video Application Server. In Proceedings of IEEE COMCON'93, pages 4{11, Feb 1993. [17] F. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming RAID(TM) - A Disk Array Management System for Video Files. In Proceedings of the ACM Multimedia'93 Conference, pages 393{400, Aug 1993. [18] K. Keeton and R. Katz. The Evaluation of Vidoeo Layout Strategies on a High-Bandwidth File Server. In Proc. of the Fourth International Workshop on Network and Operating System Support for Digital Audio and Video, Nov 1994. [19] J. Hsieh, M. Lin, J. Liu, D. Du, and T. Ruwart. Performance of A Mass Storage System for VideoOn-Demand. In Proc. of the IEEE INFOCOM'95, Apr 1995. [20] Ciprico Inc. RF6700 Controller Board Reference Manual, August Number 21020236 A, 1993. [21] J. Liu, D. Du, and J. Schnepf. Supporting Random Access for Real-time Retrieval of Continuous Media. To appear in Journal of Computer Communications: Special Issue on Multimedia Storage and Databases, Feb 1995. [22] T.M. Ruwart and M.T. O'Keefe. Performance Characteristics of a 100MB/second Disk Array. Preprint 93-123, Army High Performance Computing Research Center, University of Minnesota, December 1993.

27

Suggest Documents