disk; i.e., no seeks (other than track to track seeks) are required while reading a segment. Each group can be considered a virtual disk with D. G times the ...
Pipelined Disk Arrays for Digital Movie Retrieval Ariel Cohen Walter A. Burkhard P. Venkat Rangan Gemini Storage Systems Laboratory & Multimedia Laboratory Department of Computer Science & Engineering University of California, San Diego La Jolla, CA 92093-0114
Abstract
We develop a reliable disk array based storage architecture for digital video retrieval. Our goals are twofold: maximizing the number of concurrent realtime sessions while minimizing the buering requirements, and ensuring a high degree of reliability. The rst goal is achieved by adopting a pipelined approach and by reducing latencies through specialized disk caching and constrained data placement schemes. The second goal is achieved by dividing the disks into RAID 3 reliability groups which serve as pipeline stages. We note that the buering requirement decreases as the number of groups increases. To improve the performance further, we introduce two techniques for more ecient movie retrieval: on arrival caching, and interleaved annular layout. We present a case study of the performance of these techniques which shows a signi cant improvement when they are incorporated.
1 Introduction
Storage servers for video retrieval typically consist of an array of high-performance disks which is connected to display stations over a high-bandwidth network (see [3] for a study of a possible network architecture). The system uses memory either at the server or at the display stations (or both) for buering. Naturally, the goal is to serve as many concurrent users as possible given the available resources. We will focus on increasing the number of concurrent streams supported by the array and decreasing the buering requirements while maintaining a high degree of reliability. We consider a video retrieval workload; that is, the system operates in an essentially read-only environment in which the task of the storage server is to retrieve movies from the disk array for consumption by
Work supported in part by grants from NCR, Wichita, Kansas, SAIC, San Diego, California, and the University of California MICRO program.
the display units. No editing is allowed. The disk array will be devoted to video data; other les (e.g. system les) will be stored in separate areas. The central motivation for our work is the observation that increasing the number of disks in a RAID 3 results in diminishing returns for our workload of small reads since adding disks improves the transfer rate without reducing the other latencies which will dominate. To address this issue, we use a disk caching scheme to reduce rotational latencies to a minimum, a constrained layout scheme to reduce seek times, and a pipelined approach which breaks up the RAID 3 into a number of smaller RAID 3's to improve concurrency.
2 Related work
The issue of reducing the buering requirement and maximizing the number of concurrent sessions in a multimedia retrieval environment is addressed in [4] and [9]. In [4] the sorting-set algorithm is proposed, and in [9] the Grouped Sweeping Scheme (GSS) is introduced. These schemes are functionally equivalent techniques which are based on the idea of dividing the streams into sets which are serviced in a xed order; the reads for the streams within each set are sorted so that they can be serviced in one sweep of the disk arm. The number of sets is a parameter which affects the buering requirement. The data layout is unconstrained, no special caching is attempted, and no redundancy is introduced into the system for reliability purposes. The schemes which we propose in this paper can be used in conjunction with GSS (or the sorting-set algorithm) in a straightforward manner. An important aspect of disk array storage servers for video is load-balancing. The goal is to prevent situations in which certain disks become overloaded (because they contain the only copy of a popular movie, for example). There are three approaches to this problem. One approach is to replicate movies according to their popularities in an attempt to achieve loadbalancing; such a scheme for a video-on-demand system is proposed in [5]. A second approach is to stripe
all the data across the entire array such that all disks participate in all reads; this approach is pursued in [7]. A third approach is to store parts of all movies on all disks; limited striping is used only if necessary to achieve the bandwidth required by the streams. Such an approach for uncompressed video is described in [1]. In this paper, we adopt a middle ground between the last two approaches.
3 The basic approach
This section describes the basic layout and reading policies which underlie the techniques introduced in later sections. Many of the details are omitted here due to a lack of space. See [2] for a full presentation and a detailed analysis.
are evenly dispersed across all disks, so no hot spots can develop even when some movies are much more popular than others. Figure 1 illustrates the basic layout. In this example, = 4 and = 2. denotes fragment of segment of movie , and denotes the parity of the two fragments of segment of movie (i.e. 1 2). The data within rectangles is read in parallel by all disks in the group. D
z
D
G
D
G
G
G
D
s
s
G
;:::;G
Dx:y:z
x
P x:y y
x
Group 0
3.1 The basic layout
The disk array contains data disks which are divided into equal size groups of DG data disks and one parity disk; thus, the total number of disks in the array is + . Parity disks store the parity of the data stored in the data disks in their group. The disk array stores a number of movies. Each movie is divided into segments. We can divide the movies into segments of equal size or equal display time. These two policies are of course equivalent when xed rate compression (or no compression) is used. However, in the case of variable bit rate compression the display time of a xed amount of data can vary, so segments of equal display time will generally not be of the same size. Each segment is striped across an entire group. That is, equal size fragments of each segment appear on each data disk in the group at the same location, so that a segment can be read in parallel by the entire group with synchronized arms. The granularity of the striping unit does not matter for our puposes; a bit, byte, sector, or other striping unit can be chosen. The fragments of segments are stored contiguously on each disk; i.e., no seeks (other than track to track seeks) are required while reading a segment. Each group can be considered a virtual disk with DG times the capacity and bandwidth of a single disk. Note that each group is essentially a RAID 3 disk array. Also note that when = 1 the data is striped across the entire array, and when = no striping exists and the parity disks mirror the data disks. Segments are assigned to groups in a round-robin fashion: segment of each movie is striped across group mod (the groups are numbered 0 ?1.) When a movie is retrieved, its segments are read in this round-robin fashion, so we can consider each group to be a service station in a pipeline. Note that all movies
G
y
Group 1
Dx:y:
Dx:y:
D1.1.1
D1.1.2
P1.1
Movie 1, Seg. 1
D2.1.1
D2.1.2
...
P2.1
Movie 2, Seg. 1
D1.3.1
D1.3.2
P1.3
Movie 1, Seg. 3
D2.3.1
...
D2.3.2
...
P2.3
...
Movie 2, Seg. 3
D1.2.1
D1.2.2
P1.2
Movie 1, Seg. 2
D2.2.1
...
D2.2.2
...
P2.2
Movie 2, Seg. 2
D1.4.1
D1.4.2
P1.4
Movie 1, Seg. 4
D2.4.1
D2.4.2
P2.4
Movie 2, Seg. 4
...
...
...
...
... ...
Figure 1: Basic Layout Note that each group can sustain up to one disk failure since the data which is missing due to the failed disk can be reconstructed by computing the parity of the data read from the functioning disks1 . Moreover, no degradation of performance occurs if the parity can be computed fast enough. The preservation of performance under failure is a crucial property in our environment. Clearly, a failure in a read-only movie server is not likely to result in a loss of data since other copies of the movie exist elsewhere. However, a disk failure can render the entire server useless until the disk is replaced and reconstructed. With the RAID 3 scheme, the server can sustain up to one disk failure in each group without any loss of service. Any scheme with degraded performance under failure (e.g. RAID 5) would result in a need to drop at least some of the sessions, which we consider unacceptable. Based on the analysis in [2] it is easy to show that if we x all parameters except the number of groups then the buering requirement decreases as the number of groups increases (see also Section 6). However, if we wish to include a parity disk in each group then the number of required parity disks may be excessive if the number of groups is large. There are two possible approaches to locating the segments within each group: one is to disperse the segments in an arbitrary fashion, and the other approach 1 If there is no failure, no parity computation is performed, and the data read from the parity disk is discarded.
is to constrain their location in a way that would reduce seek latencies. We will show that the regular nature of video accesses coupled with some constraints on start-up times can make constrained layout attractive. We now describe the buering and reading policies. These dier somewhat depending on whether equal size or equal display time segments are used.
3.2 Fixed display time segments
Movies are divided into equal display time segments; in other words, the segments contain the same number of frames (but segment sizes may vary due to VBR compression.) The segments of each stream are read from the disk groups in a round-robin fashion. Time during playback is divided into reading cycles during which exactly one segment is read for each stream from some group. Each reading cycle concurrently serves groups of streams which we call cohorts. Each stream is a member of exactly one cohort in each reading cycle. We can think of cohorts as tasks to be performed by a circular pipeline. During a reading cycle, each group serves one cohort, and then the cohorts move to the next group where they will be served during the next reading cycle. The maximum number of streams in a cohort is xed to permit a certain xed maximum number of streams to be serviced. When the number of streams in a cohort is lower than this maximum number, we say that the cohort contains one or more free slots. When the display of a new stream needs to be initiated, the system waits until a cohort with a free slot is about to be served by the group where the rst segment of the requested movie resides; the new stream is then incorporated into this cohort. When a stream ends, it is dropped from its cohort; this results in a free slot which can be used to initiate a new stream. Note that once a stream is assigned to a cohort, it remains a member of that cohort until its display is nished. Figure 2 shows examples of reading cycles along with their cohorts in a server with two groups. In this example, a cohort may contain up to 4 streams. Cohort 0 is served by group 0 during reading cycle , group 1 during the reading cycle + 1, and group 0 again during reading cycle +2. Cohort 1 is served by group 1 during reading cycle , group 0 during reading cycle + 1, and group 1 during reading cycle + 2. A new stream ( 8 ) is incorporated into cohort 1 during reading cycle + 1. Note that the order of reads may vary from reading cycle to reading cycle; this exibility enables us to use seek optimization algorithms. During each reading cycle, for each stream, the sysG
t
t
t
t
t
t
S
t
reading cycle t Cohort Group 0
S
4
S
S
2
Cohort Group 1
S
5
S
6
reading cycle t+1
0 1
Cohort
S
3
S
6
1
S
7
S
7
S
Cohort
S
1
S
4
reading cycle t+2
1
Cohort
5
S8
S4
S3
S
S
S
0
S
Cohort
2
3
7
5
0
S2
S1
1
S
6
S8
Figure 2: Reading Cycles tem reads the next segment from the group serving the stream's cohort into the buer while consuming the segment which was read in the previous reading cycle. Let the segment display time be (seconds); if the reading cycle takes less than seconds, the system waits until seconds are over before starting the next reading cycle. The issue of nding the minimum value for that still ensures starvation-free operation is addressed in [2]. We know that when VBR compression is used the segment size can vary from segment to segment. This raises the question of whether our layout scheme will result in an even distribution of data across the groups. For example, a situation in which one group happens to be assigned more than its share of long segments could result in a disk capacity over ow at that group. The layout is such that all groups contain the same number of segments2, denote this number by . We can view the amount of data assigned to a group as the sum of independent bounded random variables with an identical probability distribution3 . It can be shown that for any realistic choice of parameters, the probability of a signi cant deviation from the average sum is negligible (see [2]). t
t
t
t
s
s
3.3 Fixed size segments
In Section 4 we describe a disk caching technique (OAC) which requires us to be able to x the size of segments. When VBR compression is used, the xed time approach described above does not permit us to do this. We now describe a xed size approach which addresses this issue. Unless OAC is used in conjunction with VBR compression, however, the xed time approach is the preferred approach since it requires less buering (see below). The xed size approach is similar to the xed time approach described above with the following dier-
2 Strictly speaking, slight dierences can occur if the length of some movies is not a multiple of G t, but such dierences would be insigni cant. 3 These assumptions are justi ed since the segments in a group correspond to parts of the movie that are far enough apart (unless the number of groups is very small), or to parts of dierent movies altogether. See [6] for a study of autocorrelation in MPEG streams.
ences: movies are divided into segments of equal size rather than equal display time, and time during playback is divided into reading cycles during which at most one segment is read for each stream that is being displayed. Unlike the xed time approach, the xed size approach may require omitting the reads for some streams during some reading cycles in order to prevent buer over ow. This introduces a complication for our pipelined scheme: we cannot just allocate segments to groups in a strict round-robin fashion like we did in the previous section. The reason is that if we follow this scheme then there is the possibility that in order to prevent buer over ow a segment will not be requested when the group on which it resides serves the stream's cohort | the segment will be needed only later when the cohort is served by another group. When the data is laid out on the groups, it is necessary to keep track of the expected buering situation for all movies so that the layout process can avoid putting a segment in a group if the corresponding reading cycle will not request it. During layout, segments are put in the groups in an essentially roundrobin fashion. However, some groups are occasionally skipped for some movies; these skips correspond to the skips that would occur during the corresponding reading cycles Hence, by simulating the consumption process during the layout process we can ensure that the segments requested during any reading cycle will be in the proper groups. The buering requirement for the xed size approach is almost 50% higher than that for the xed display time approach (this is a result of the need to ensure starvation-free operation even when streams are omitted during some reading cycles | see [2].) This raises the question of whether the xed size approach has any advantages. As we will see in Section 4, the fact that xed size segments are read in each reading cycle enables us to use a highly bene cial disk caching scheme which cannot be used with the xed display time approach. This caching technique can make the xed size approach much more attractive than the xed display time approach.
4 On arrival caching (OAC)
Disks typically use a local buer or cache to improve the performance of read operations [8]. The technique which is used for this purpose is called look ahead read or readahead. The idea is to preload the buer with the data that immediately follows the requested data on the disk. This is motivated by the fact that read requests tend to have a high degree of contiguity in typical environments, and thus it is likely that
read requests will be followed by additional requests to the immediately following data. When such requests arrive they may be satis ed by the buer without incurring the delay of accessing the disk. Readahead caching is far from ideal for our environment. Our workload consists of small reads of one or two tracks or per disk. Furthermore, with the xed size segment approach (see Section 3.3) we know exactly how much contiguous data will be read from each disk (the size of a segment divided by the number of disks in a group); clearly there will usually be no advantage to caching data beyond that amount. Another pertinent fact is that a large portion of the latency in a reading cycle is due to rotational latency. If possible, we would like to use caching to eliminate that latency. Fortunately, we can achieve this goal with the xed segment size approach by exploiting the fact that the size of the reads is xed.
4.1 The basic technique
The preferred disk caching scheme for our environment is on arrival caching (OAC) (also known as on arrival readahead or zero latency read ). The idea is to eliminate the rotational latency by doing the following: after a read request is received by the disk drive, the drive seeks to the proper location and starts reading the data into the buer immediately (i.e. at the next sector to pass under the read head). An entire track (or more) is read and cached in the buer. Obviously, this scheme oers a substantial advantage if requests are a small whole number of tracks in size and the data can be laid out properly. As we will see, this is the case in our environment. OAC has been used in some early disk drives when track capacities were lower and the size of typical accesses was closer to the size of a track [8]. Some current disks support OAC, but the use of this scheme is not widespread since it is not appropriate for most environments. We will show that OAC can oer a substantial bene t in a multi-session video environment.
4.2 Layout for OAC
The goal of a layout scheme for OAC is to achieve the elimination of the rotational latency without incurring the penalty of fragmentation. The disk cache should only contain data that is relevant (i.e. belongs to the segment that is being read), so no track should contain data belonging to more than one segment. Clearly, if the size of the fragment of a segment that is stored on each disk is not a whole number of tracks, then fragmentation occurs. The conclusion is that the fragment size should be a whole number of tracks, and data belonging to dierent fragments should be stored in dierent tracks.
5 Interleaved annular layout (IAL)
Another way to reduce the length of reading cycles (and thus reduce the buering requirement) is to lay out the segments in such a way that the distances among segments that participate in a reading cycle are short. This obviously requires constraining the times at which new streams can be introduced into reading cycles. If we allow new streams to be introduced any time there is a free slot, then cohorts may request any combination of segments, thus foiling any constrained layout schemes. We propose a layout scheme which we call interleaved annular layout (IAL). The cylinders of the disks in the disk array are divided into contiguous cylinder groups called rings. The scheme ensures that all segments read by any cohort in any reading cycle are in the same ring. Every other ring is accessed as the heads move from one edge of the disks to the other; the rings read when the heads move in one direction are the ones skipped when the heads move in the other direction. The purpose of this interleaving is to prevent the need for a long seek back to the rst ring when the heads reach the last one. An example is shown in Figure 3. There are 5 rings, 2 movies and 10 segments per movie in this example, and only one group in the disk array. The gure shows how the fragments of the segments might be laid out; only one disk is shown (the layout on all disks in the group is identical.) The notation denotes a fragment of segment of movie . The arrows illustrate the direction in which the reading proceeds; dashed arcs denote skipping. Each cohort reads segments from a single ring of the group which is currently serving the cohort; in our example (where there is only one cohort), if the F x:y
y
x
F1.0 F1.4 F1.1 F1.3 F1.2 F2.5 F2.9 F2.6 F2.8 F2.7
ng
3
Ri
ng
2
F1.7 F1.8 F1.6 F1.9 F1.5
1 ng Ri
ng Ri
F2.1
Ri
ng
0
F2.3
4
F2.2
Ri
Another relevant fact needs to be taken into consideration: tracks in most modern high capacity disks are grouped into 3{20 zones. Tracks in outer zones have higher capacities than those in inner zones. This technique is called zoned bit recording (ZBR) [8]. Segment sizes must be tailored to the particular zone in which they reside in order to satisfy the requirement that each fragment of a segment is a whole number of tracks in size. Segments in outer zones might be larger than segments in inner zones because of the need to ll a track to capacity. Note that this does not result in longer reading times for segments in outer zones | the number of tracks read there will be the same as or lower than the number of tracks read for segments in inner zones, and track reading times are the same regardless of the location of the track. See [2] for details about computing the segment sizes.
F2.4 F2.0
Figure 3: IAL with 5 Rings heads are currently in the outer ring (Ring 0) then only segments 0 and 5 of movies 1 and 2 can participate in the current reading cycle. This means that the server might not be able to start servicing a new request immediately since it has to wait for the ring containing the rst segment of the movie to become the next ring to be accessed. We use the term startup latency to refer to this latency. The reads within a ring are performed in the order in which the segments appear in the ring. IAL can be used with or without OAC. If we wish to use IAL without OAC, we can use the xed time approach which is described in Section 3.2. Let be the number of rings. Segment of each movie is striped across ring4 mod of group mod . There is no constraint on the order of segments within a ring. should be relatively prime to in order to ensure an even distribution of segments among the rings. This will also ensure that if there is a free slot in some cohort then a new stream will not have to wait more than reading cycles before it can be initiated. A new stream can be initiated when a cohort with a free slot is about to be served at the group and ring containing the rst segment of the new stream. If we wish to combine OAC with IAL, we need to use the xed segment size approach which was discussed in Section 3.3. Recall that the xed size approach requires omitting the reads for some streams during some reading cycles in order to prevent buer over ow. This introduces a layout complication which is dealt with by skipping groups during layout as described in Section 3.3. R
s
s
R
R
s
G
G
G
R
4 Rings are numbered by the order in which they are read; see Figure 3.
Using the techniques described in [2], we studied the performance of a disk array with 24 data disks. The disk parameters were based on the performance characteristics of Seagate Elite 9 disks which are highcapacity (9 GB) high-performance disk drives that utilize ZBR. The maximum consumption rate was 12 Mbits/s5 . Table 1 shows the case study results. The buer size was constrained to be at most 8 MB per stream. The + and { signs that appear after OAC and IAL signify whether the scheme was used (+) or not ({). The results in the MIN columns were obtained by considering all seeks to be track to track seeks; hence, these results serve as a bound on what can be achieved by decreasing seek times. Each entry in the table shows the maximum number of concurrent streams and the required buer size per stream (in MB). G 1 2 3 4 6 8 12 24
OAC{ IAL{ 55(8) 64(8) 66(7) 68(6) 72(7) 72(5) 72(4) 72(3)
OAC{ IAL+ 58(8) 66(8) 69(7) 72(7) 72(5) 72(4) 72(3) 72(2)
OAC{ MIN 58(8) 66(7) 69(7) 72(7) 72(5) 72(4) 72(3) 72(2)
OAC+ IAL{ 68(5) 74(7) 75(8) 72(5) 72(4) 72(3) 72(3) 72(2)
OAC+ IAL+ 79(5) 78(3) 78(2) 76(2) 78(2) 72(1) 72(1) 72(1)
OAC+ MIN 79(5) 78(3) 78(2) 76(2) 78(1) 72(1) 72(1) 72(1)
Table 1: Case Study Results We see the substantial bene t obtained by increasing the number of groups . If the maximum number of concurrent streams is max then a reading cycle 6 will contain at most umax G reads per cohort . Hence, as increases, the maximum number of reads that need to be performed per cohort during a reading cycle decreases proportionally, but the bandwidth of a disk group also decreases proprtionally to the increase in the number of groups (recall that each group contains DG disks). The reason for the bene t obtained by increasing is that the number of seeks and rotational latencies incurred per cohort during a reading cycle also decreases proportionally to the increase in because of the reduction in the number of reads. Since the buering requirement is proportional to the maximum length of a reading cycle, the shorter reading cycles obtained by increasing result in a lower G u
G
G
G
G
5 This is a reasonable rate for NTSC quality MPEG compressed movies (see [6]). 6 Note that u max has to be a multiple of G, which accounts for the sporadic non-monotonicity in Table 1.
buering requirement or a larger number of supportable streams for the available memory. The graph in Figure 4 focuses on the impact of OAC and IAL when = 1. Note the jumps exhibited by the OAC curves. These jumps are a result of the requirement that the segment size be a multiple of the track size for OAC. We see the signi cant bene t of using OAC, and IAL with OAC. IAL alone provides a less signi cant bene t. G
16 15 14
Buffer Size per Stream (MB)
6 Case study
13 12 11 10 9
{OAC-, IAL-} {OAC-, IAL+} {OAC+, IAL-} {OAC+, IAL+}
8 7 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Number of Streams
Figure 4: Case Study Results ( = 1) G
References
[1] Berson S., et al. Staggered Striping in Multimedia Information Systems. In Proc. ACM SIGMOD, pp. 79{90, 1994. [2] Cohen, A., W.A. Burkhard, and P.V. Rangan. Pipelined Disk Arrays for Digital Movie Retrieval. Tech. Report CS-95-409, CSE Dept., University of California, San Diego, 1995. [3] Cohen, A., C.W. Padgett, and W.A. Burkhard. A High-Performance Circuit-Switched Network for Distributed Video Servers. Tech. Report CS-95-412, CSE Dept., University of California, San Diego, 1995. [4] Gemmell, J. Multimedia Network File Servers: Multichannel Delay Sensitive Data Retrieval. In Proc. ACM Multimedia '93, pp. 243{250, 1993. [5] Little, T.D.C., and D. Venkatesh. Probabilistic Assignment of Movies to Storage Devices in a Video-OnDemand System. In Proc. 4th Int'l Workshop on Network and Operating System Support for Digital Audio and Video, pp. 213{224, 1993. [6] Pancha, P., and M. El Zarki. MPEG Coding for Variable Bit Rate Video Transmission. IEEE Communications Magazine, 32(5), pp. 54{66, 1994. [7] Reddy, A.L.N., and J. Wyllie. Disk Scheduling in a Multimedia I/O System. In Proc. ACM Multimedia '93, 225{233, 1993. [8] Ruemmler, C., and J. Wilkes. Modelling Disks. HP Labs Tech. Report HPL-93-68, 1993.
[9] Yu, P.S., M.-S. Chen, and D.D. Kandlur. Design and Analysis of a Grouped Sweeping Scheme for Multimedia Storage Management. In Proc. 3rd Int'l Workshop on Network and Operating System Support for Digital Audio and Video, pp. 38{49, 1992.