Oct 3, 1994 - and Wilkes[15], a linear model could lead to a wide range of deviation. ..... [12] Jim Gemmell and Stavros Christodoulakis. Prin- ciples of ...
A Multimedia Storage System for On-Demand Playback Yen-Jen Oyang , Chun-Hung Wen, Chih-Yuan Cheng, Meng-Huang Lee, and Jian-Tian Li Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan, R.O.C. October 3, 1994
Abstract
important applications of multimedia storage systems is on-demand playback of video or high-quality audio programs [5, 6, 8, 9, 10, 11, 12, 13, 14]. In on-demand playback applications, the storage system supports concurrent retrieval of continuous video or audio programs requested by a large number of clients. Such applications impose two major challenges to mass storage system design:
This paper presents the design of a multimedia storage system for on-demand playback. The design stresses effective utilization of disk bandwidth with minimal data buffer to minimize overall system costs. The design procedure is most distinctive in the following two aspects: 1. It bases on a tight upper bound of the lumped disk seek time for the Scan disk scheduling algorithm to achieve effective utilization of disk bandwidth.
1. High data retrieval bandwidth – The data retrieval bandwidth required by such applications is particularly high due to a large number of clients.
2. It starts with a general two-level hierarchical disk array structure to derive the optimal configuration for specific requirements.
2. Real-time, continuous transfer of data – The system must guarantee uninterrupted service to each client.
Key terms: multimedia, storage system, on-demand playback, disk bandwidth, disk array
In order to meet these two challenges with minimal system costs, the designer must develop appropriate data placement and retrieval strategies so that I/O bandwidth of the storage devices is effectively utilized with minimal amount of data buffer. Here, effective utilization of storage device bandwidth means that the number of storage devices required to provide sufficient I/O bandwidth is minimized. In addition to the number of storage devices, another factor that contributes to overall system costs is the size of data
1 Introduction In recent years, the design of mass storage systems for multimedia applications has become an active research topic [1, 2, 3, 4, 5, 6, 7]. One of the most This research was sponsored in part by the National Science Council of R.O.C. under grant NSC 83-0408-E-002-002.
1
buffer. In on-demand playback applications, certain amount of data buffer is needed to guarantee realtime, uninterrupted transfer of data to each client. It is the designer’s desire to minimize the amount of data buffer required. The studies reported in [6, 11] represent the very first efforts to tackle the disk scheduling problem in multimedia storage design. The main deficiency of the early efforts is that utilization of disk bandwidth is extremely ineffective. The disk head needs to sweep across entire disk surface in order to read just one file block of data, or in another term, one retrieval unit of data. Yu, Chen and Kandlur[13], then, proposed the Grouped Sweeping Scheme(GSS) to improve disk bandwidth utilization. However, the GSS uses a linear model for disk seek time. According to Ruemmler and Wilkes[15], a linear model could lead to a wide range of deviation. In multimedia storage system design, this means that disk bandwidth would not be effectively utilized since the designer would need to include a large margin to guarantee real-time constraints. Another important issue that has not been thoroughly studied in previous literature is how disk arrays can be exploited in multimedia storage system design. This paper presents a comprehensive procedure to design multimedia storage systems for on-demand playback. The design stresses effective utilization of disk bandwidth with minimal amount of data buffer. The design procedure is most distinctive in the following two aspects:
system specification. Finally, section 4 concludes the discussion of this paper.
2 General Organization and Operations 2.1 General Organization Figure 1 depicts the architecture of the multimedia storage system. The system comprises a host and multiple storage nodes. The host stores the directory of the entire file system and commands the operations of the storage nodes. Each storage node contains one or more disks for storing multimedia data. If the storage node contains more than one disks, then these disks form a fine-grain disk array [19]. The fine-grain disk arrays in the storage nodes are then grouped to form a coarse-grain disk array [19]. As a result, the entire system consists of two levels of disk arrays. Since playback operations invoke no writes to the disks, we will purposely omit the parity data when we refer to the disk array structure shown in Figure 1. The storage nodes receive commands from the host through the Ethernet and deliver the requested data to the hign-bandwidth video channel. In our implementation, each node, including the host and the storage nodes, is an Intel 80486 CPU based personal computer running the Mach 3.0 microkernel[20] with a BSD UNIX server. A SCSI bus is used as the highbandwidth video channel. Figure 2 shows the general rule to place the file blocks, or retrieval units in another term, of a multimedia file in the system. In Figure 2, each drawn disk represents the logical disk formed by the disks in a storage node. Since the discussion here is about how file blocks are interleaved in the high-level of the disk array hierarchy, the low-level fine-grain disk array can be treated logically as an individual disk in this regard. Figure 2 shows that each disk is evenly partitioned into several regions. Each region is composed of a number of physically consecutive tracks. Disk partition is not mandatory. If partition is not performed,
1. It bases on a tight upper bound of the lumped disk seek time for the Scan disk scheduling algorithm[16, 17, 18] to achieve effective utilization of disk bandwidth. 2. It starts with a general two-level hierarchical disk array structure to derive the optimal configuration for specific requirements. In the following part of this paper, section 2 discusses the general organization and operations of the proposed system. Section 3 elaborates the design process that leads to an optimal solution under a given 2
Ethernet
Host
Directory Disk
Storage Node
Storage Node
Storage Node
Fine-grain disk array
Fine-grain disk array
Fine-grain disk array
SCSI bus as the High Speed Video Channel
Figure 1: Architecture of the proposed multimedia storage system
Region 0
Disks are evenly partitioned into R regions
Disk 0
Disk 1
Disk M-1
File blocks 2RMI, and 2RMI+2R-1
File blocks 2RMI+2R, and 2RMI+4R-1
File blocks 2RMI+2RM-2R, and 2RMI+2RM-1
File blocks 2RMI+2R+1, and 2RMI+4R-2
File blocks 2RMI+2RM-2R+1, and 2RMI+2RM-2
Region 1
File blocks 2RMI+1, and 2RMI+2R-2
Region R-1
File blocks 2RMI+R-1, and 2RMI+R
File blocks 2RMI+3R-1, and 2RMI+3R
Coarse-grain array with hardware parallelism = M
Figure 2: File block placement strategy
3
File blocks 2RMI+2RM-R-1, and 2RMI+2RM-R
then file blocks are placed in the disk system just like the normal case of a coarse-grain disk array. If disk partition is performed, then file blocks from each individual multimedia file are then interleaved in the disk system according to the following rule: file blocks with indices 2RMI + 2jR + k and 2RMI + 2jR + (2R ? k ? 1) are placed in region k of disk j , where M is the number of storage nodes in the system, R is the number of regions into which the disk is partitioned into, I is any integer number larger than or equal to 0, j is the index of the disk and runs from 0 to M ? 1, and k is the index of the region and runs from 0 to R ? 1. The idea behind the development of the 2-level disk array architecture is to provide various design alternatives. As will be shown later in the paper, two systems with the same number of disks but different organizations will have different characteristics. The main distinctions are the amount of data buffer required and the maximum start-up latency, which is the maximum amount of time a new client needs to wait before the service begins. It is up to the designer’s decision to select one design alternative that best fits his/her needs. The reason behind performing disk partition is to minimize average seek time of disk accesses during the Scan operation. The reason to round file block placement at one end of the disks is to optimize disk head movement. As will be shown later, when carrying out data retrieval, the disk head iteratively scans across disk surface in both directions to read file blocks from the active video/audio programs. Therefore, it makes sense to round file block placement at one end of the disk so that the data to be read next are immediately available when the disk head turns around.
denotes the number of storage nodes in the system. Then, the system divides the streams into M groups with no group exceeding a predetermined ceiling of number of streams. The ceiling is imposed to guarantee uninterrupted service to each client, i.e. to meet real time requirements. If there are more clients than the system can serve at one time, then the late comers must wait until some slots in these groups become vacant, i.e. some clients terminate their accesses.
The system divides the streams into M groups according to their starting times so that the M groups of streams access the M storage nodes in an interleaved and rotatory manner. Two streams in the same group were admitted to the system either at the same time or at different times but with overlapped shifts. In the later case, these two streams access file blocks with an index offset equal to a multiple of (2 R M ) simultaneously, where R is the number of regions into which the disk is partitioned into. On the other hand, two groups of streams never access the same disk at any given time. The M groups of streams access disks in a synchronized manner. That is, the M groups of streams access the same partition region in the disks at the same time. When the disk head scans across one partition region, the system retrieves one file block for each stream. Once the disk head has made a round trip across the disk surface, these M groups of streams rotate and start a new round of access.
Figure 3 demonstrates a simple case of the disk access operation. In Figure 3, there are 4 streams and 2 disks. Each disk has two partition regions. According to the placement policy illustrated in Figure 2, file blocks with indices 8I; 8I + 1; 8I + 2; 8I + 3; are stored in Disk 0 and file blocks with indices 8I + 4; 8I + 5; 8I + 6; 8I + 7; are stored in Disk 1, where I is an integer larger than 2.2 Disk Operations or equal to 0. The 4 streams are divided into 2 groups The multimedia storage system proposed in this paper so that while streams W and X are accessing Disk 0, performs service by dividing the streams into a num- streams Y and Z are accessing Disk 1, and vice versa. ber of groups and making these groups access disks Furthermore, accesses performed by these two groups in an interleaved and synchronized manner. Let M are synchronized. While stream W is accessing file 4
Stream W
Stream Y
Stream X
Stream Z
At time T0 At time T1
Region 0
Region 1
At time T0
File blocks 0, 3, 8, 11, 16, 19, ...
File blocks 4, 7, 12, 15, 20, 23, ...
File blocks 1, 2, 9, 10, 17, 18, ...
File blocks 5, 6, 13, 14, 21, 22, ...
Disk 0
Disk 1
Figure 3: An illustration of the disk access operation block 8i + m from one program and stream X is accessing file block 8j + m from another program, streams Y and Z are accessing file blocks (8k + m + 4) mod 8 and (8l + m + 4) mod 8 from the other two programs, respectively. During the access operation, the system retrieves and buffers one file block for each stream when the disk head scans across one partition region. Later in time, when the disk head moves to next region, the system transmits the buffered data to the clients and buffers the incoming data in another area of the data buffer. With this practice, it is quite straightforward to figure out that the data buffer must be of size
a new client can not start its access until a group with vacant slots has rotated to the storage node that contains file block 0.
3 System Design This section elaborates a comprehensive procedure to determine the optimal disk system configuration for meeting a given specification. The primary optimization criteria are the effectiveness of disk bandwidth utilization and size of data buffer required.
3.1 The Design Process
2 (the size of a file block) (the maximum number of streams allowed )
Here, let us use the symbols listed below in the subsequent discussion.
N : denotes the number of streams that the sys-
The admission control mechanism in the proposed multimedia storage system is quite straightforward. A new client can be admitted only when vacant slots are available in one or more groups of streams. That is, some groups contain less streams than the permitted ceiling. However, even if vacant slots are available,
tem can admit. Assume N is given by the system specification.
R: denotes the number of regions into which the disk is partitioned into. 5
L:
denotes the number of disks that each of access overhead, T o defined above, accounts for. In the low-level, fine-grain disk arrays in the disk other words, we should make the disk system spend system hierarchy contains. most of time in retrieving data rather than moving disk heads around and waiting for disk rotation. M : denotes the number of lower-level, fine-grain The first issue is to how the disks should be condisk arrays or single disks that the high-level, figured and organized. This issue concerns (1) how coarse-grain disk array structure contains. many regions into which the disk is to be partitioned G: denotes the maximum number of streams in into, i.e. R defined above, and (2) how these disks should be grouped to form a disk array structure as each group. G is equal to N divided by M . depicted in Figure 1. The design procedure presented S : denotes the number of bytes that the system in the following proceeds with a pre-determined value retrieves from one disk in each retrieval opera- of R. The designer may need to go through the design tion. S times L is the size of a file block. procedure a few times with various values of R in or: denotes the ratio of disk bandwidth utilization der to determine the optimal configuration that meets the system requirements, i.e. simultaneously serving that the designer wants to achieve. N streams with each requiring B s data bandwidth. T o : denotes the worst-case overhead when the With the value of R pre-determined, the designer disk head scans across one disk partition region can derive a formula for computing T o defined above. and makes G disk accesses. T o contains three T o contains two components: (1) the maximum components. The first component is the worst- lumped sum of seek time; (2) the lumped worst-case case lumped sum of seek time. The second rotational latency; (3) the overhead due to other syscomponent is the lumped worst-case rotational tem operations such as command issuing, bus initiallatency. The third component is due to other ization, and data transfer. The overhead due to other system overheads for making disk accesses such system operations for making one disk access can be as command issuing, bus initialization, and data modeled by a fixed worst-case lumped sum and is detransfer. If the system uses a disk track as the noted by T 1 here. The maximum lumped sum of seek basic storage unit and the disk features on-arrival time is a function of the number of cylinders in one read-ahead[15], then disk accesses would not in- disk partition and the term G defined above. cur rotational latency. Otherwise, each disk acAs shown in Figure 4, the seek time of a disk can cess may incur a rotational latency as much as be conservatively modeled by piecewise linear equathe rotation time of the disk. tions. In the case of Figure 4, the modeling function comprises two linear segment as follows: T r : denotes the time of one disk revolution.
B d : denotes the sustained bandwidth of the disk.
(
seek time =
It is the bandwidth observed when the disk continues to read data out from consecutive tracks.
c 1 + c2 d c 3 + c4 d
if d C if d > C;
(1)
where C is a constant defining the boundary of the two formulas above and d is the distance of disk head movement in number of cylinders. Mathematically, it can be proved that, if c2 > c4 , then the maximum lumped seek time occurs when the G stops are evenly apart (See Appendix A for a complete proof). Accordingly, the designer derives the following formula
B s : denotes the data bandwidth required by each stream.
The system design is based on the architecture depicted in Figure 1. The basic idea behind the system design is to achieve effective utilization of disk bandwidth by minimizing the percentage of time the 6
The seek time measured The piecewise linear equations model
80
70
Time(milliseconds)
60
50
40
30
20
10 0
2000
4000 6000 Distance in number of cylinders
8000
10000
Figure 4: Measured disk seek time vs distance. for computing T o :
Inequality (3) guarantees the desired level of disk bandwidth utilization is achieved. Meanwhile, inequality (4) guarantees uninterrupted service of video/audio programs. Altogether, there are three variables G, L, and S in equation (2) and inequalities (3) and (4). The designer can write a computer program to figure out possible combinations of G, L, and S values that satisfy inequalities (3) and (4) and select the one that best matches his/her needs. A note here is that the program should automatically exclude some combinations with unreasonable large values of L and S since the size of data buffer required is
T o8=
G + 1) (c1 + c2 d0) + G (T r + T 1 ) if d0 C; (2) > (G + 1) (c3 + c4 d0 ) + G (T r + T 1 ) > > : if d0 > C; where d0 is the number of cylinders in a partition region divided by G + 1. As mentioned above, if > ( > >
2500;
1. It bases on a tight upper bound of the lumped disk seek time for the Scan disk scheduling algorithm to achieve effective utilization of disk bandwidth.
(5) where d is the distance of disk head movement in num2. It starts with a general two-level hierarchical disk ber of tracks and the unit of time is millisecond. The array structure to derive the optimal configuradesign goal is to achieve disk bandwidth utilization of tion for specific requirements. 50%. Table 2 summaries the resources required and max- The design procedure is simple and effective in the imum start-up latency for various design alternatives. sense that the designer can easily work out a few
8
Rotation speed No. of tracks No. of sectors per track No. of bytes per sector Maximum sustained disk bandwidth System overhead
3600rmp 9953 25 512 Bytes 553 KBytes per second 1.7 milliseconds
Table 1: Disk system specification. Alternative Values of R and L R=1, L=1 R=2, L=1 R=4, L=1 R=1, L=2 R=2, L=2 R=4, L=2 R=1, L=4 R=2, L=4 R=4, L=4
Value of G 2 2 2 5 5 5 10 10 10
Value of S (KBytes) 49.5 39 34 33.5 29.5 27 28 26 25
No. of disks required 10 10 10 8 8 8 8 8 8
Size of data buffer required (KBytes) 1,980 1,560 1,360 2,680 2,360 2,160 4,480 4,160 4,000
Maximum start-up latency 7.15 seconds 11.28 seconds 19.59 seconds 4.83 seconds 8.47 seconds 15.61 seconds 4.04 seconds 7.48 seconds 14.38 seconds
Table 2: Design alternatives of the design example. design alternatives and then select the one that best and m1 ; m2 ; : : : ; mq ,we have fits his/her requirements according to the hardware mj p X q X li X X resources required and other concerns such as maxih(n0 ? k) h(n0 + k ? 1) mum start-up latency.
i=1 k=1
Appendix A
if
Here, we will present a tight upper bound of lumped disk seek time for the Scan disk scheduling algorithm. In the subsequent discussion, N + denotes the set of all positive integers and R denotes the set of all real numbers.
j =1 k=1
p X i=1
li =
Proof: Since i < j , we have
q X j =1
mj and
9
for all 1 i p:
h(i) h(j ) for all i; j 2 N + and p X li X
Lemma 1 Let h : N + ! R be a function with h(i) h(j ) for all i; j 2 N + and i < j . Then, for an n 0 2 N + and two series of positive integers l1 ; l2 ; : : : ; lp
n 0 > li
h(n0 ? k)
i=1 k=1 p X li h(n0 ) i=1
=
=
=
h(n0 ) h(n0 ) q X
p X i=1
q X j =1
li
=
j =1 k=1
mj
s X i=1
h(n0 + k ? 1)
0
f (n0 + k ? 1) 0
0
sf (n0)
2
f
di = 0 and n0 + di 1 for all di , we have sf (n0)
j =1 k=1
f (n0 ? k) +
(* by Lemma 1 *)
Therefore, we have
: N + ! R be a function and f 0 (n) = f (n + 1) ? f (n) for all n 2 N +. If f 0 (i) f 0(j ) for all i < j , then for a positive integer n0 and a series of integers d1; d2; : : : ; ds with Lemma 2 Let
i=1 k=1 m
q Xj X
mj h(n0 )
j =1 mj q X X
?
p X li X
s X i=1
f (n0 + di)
s X i=1
Theorem 1 Let f : N + ! R, with f 0 (i) f 0 (j ) for all i < j , be a monotonically increasing function that models the seek time of a disk with respect to the seek distance in number of cylinders, where f 0 (n) = f (n + 1) ? f (n) for all n 2 N +. If the head of the disk scans across a region of C cylinders and makes (s ? 1) stops, then the lumped seek time is maximized when the (s ? 1) stops are evenly apart by C=s. Here, it is assumed that C is a multiple of s. Proof: Let n0 = C=s and d1 ; d2; : : : ; ds be a series
s X
Without loss of generality, assume of integer with di = 0 and n0 1 i = d1; d2 ; : : : ; dp < 0 and dp+1; dp+2; : : : ; ds 0. 1 i s. By Lemma 2, we have Let Proof:
q = s ? p; li = ?di mj = dp+j
for 1 i p; for 1 j q
Then, we have
p X i=1
li =
q X j =1
i=1
=
f (n 0 + di ):
References
( 0) +
f (n0 ? li ) ? f n
j =1
i=1
f (n0 + di ) ? f (n0)
i=1 q X
s X
di 1 for all
In Theorem 1 above, it is assumed that C is a multiple of t. If it is not the case, then the designer can use dC=te instead to calculate the maximum lumped seek time.
and
p X
sf (n 0)
+
This means the lumped seek time is maximized when the (s ? 1) stops are evenly apart by C=s. 2
mj
s X
2
f (n0 + di)
[1] C. Yu, W. Sun, and D. Bitton. Efficient placement of audio on optical disks for real-time applications. Communication ACM, Vol.32, No.7, July 1989.
f (n0 + mj ) ? f (n0)
10
[2] Y. Mori. Multimedia real-time file system. Technical report, Matshushita Electric Industrial Co., February 1990. [3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
and retrieval. ACM Transactions on Information Systems, Vol.10, No.1, January 1992.
[13] Philip S. Yu, Mon-Song Chen, and Dilip D. Kandlur. Design and analysis of a grouped A. Hopper. Pandora - an experimental system for sweeping scheme for multimedia storage manmultimedia application. ACM Operating System agement. In Proceedings of Third International Review, Vol.24, No.2, April 1990. Workshop on Network and Operating System D. D. Kandlur, M. S. Chen, and Z. Y. Shae. Support for Audio and Video, November 1992. Design of a multimedia storage server. In IBM [14] Fouad A. Tobagi, Joseph Pang, Randall Baird, Research Report, June 1991. and Mark Gang. Streaming RAID - a disk arShahram Ghandeharizadeh and Luis Ramos. ray management system for video files. First Continuous retrieval of multimedia data using ACM International Conference on Multimedia, parallelism. IEEE Transactions on Knowledge August 1993. and Data Engineering, Vol.5, No.4,4 August [15] Chris Ruemmler and John Wilkes. An introduc1993. tion to disk drive modeling. IEEE Computer, P. Lougher and D. Shepherd. The design of a Vol.27, No.3, March 1994. storage server for continuous media. The Com[16] Coffman, E.G., Klimko, L.A., and Ryan, B. puter Journal, Vol.36, No.1, 1993. Analysis of scanning policies for reducing disk A.L. Narasimha Reddy and James C. Wyllie. I/O seek times. SIAM Journal of Computing, Vol.1, issues in a multimedia system. IEEE Computer, No.3, Sep 1972. Vol.27, No.3, March 1994. [17] Teorey, T.J. and Pinkerton, T.B. A comparative Sun Microsystem. Multimedia file system. analysis of disk scheduling policies. CommuniTechnical report, Sun Microsystem, August, cations of the ACM, Vol.15, No.3, Mar 1972. 1989. Software Release. [18] Geist, R. and Daniel, S. A continuum of disk David P. Anderson and Goerge Homsy. A conscheduling algorithms. ACM Transactions on tinuous media I/O server and its synchronization Computer Systems, Vol.5, No.1, Feb 1987. mechanism. IEEE Computer, Vol.24, No.10, [19] Robert Y. Hou Gregory R. Ganger, Bruce October 1991. L. Worthington and Yale N. Patt. Disk arrays: D. Anderson, Y. Osawa, and R. Govindan. RealHigh-performance, high reliability storage subtime disk storage and retrieval of digital audio systems. IEEE Computer, Vol.27, No.3, March and video. Technical report, U.C. Berkeley, 1994. September 1991. UCB Technical Report. [20] K. Loepere. Mach 3 Kernel Principles. Open P. Venkat Rangan and Harrick M. Vin. DesignSoftware Foundation and CMU, 1991. ing file systems for digital video and audio. In Proc. of 13th ACM Symp. on Operation Syst. Principles, October 1991.
[12] Jim Gemmell and Stavros Christodoulakis. Principles of delay-sensitive multimedia data storage 11