Multimedia Tools and Applications Journal, 18(2), pp. 137-158, 2002.
Fundamentals of Scheduling and Performance of Video Tape Libraries Costas Georgiadisy
Peter Triantafillou yz
Christos Faloutsos x
February 15, 2000
Abstract Robotic tape libraries are popular for applications with very high storage requirements, such as video servers. Here, we study the throughput of a tape library system, we design a new scheduling algorithm, the so-called Relief, and compare it against some older/straightforward ones, like FCFS, Maximum Queue Length (MQL) and an unfair one (Bypass), roughly equivalent to Shortest Job First. The proposed algorithm incorporates an aging mechanism in order to attain fairness and we prove that, under certain assumptions, it minimizes the average start-up latency. Extensive simulation experiments show that Relief outperforms its competitors (fair and unfair alike), with up to 203% improvement in throughput, for the same rejection ratio.
1 Introduction Even though secondary storage devices (based on magnetic disks) have become cheaper and increased their storage capacity at remarkable rates, they still cannot satisfy economically the storage requirements of video-based databases for many demanding applications, such as video-on-demand, digital libraries, tele-teaching, video broadcasting, etc. Tertiary storage and, in particular, tape libraries, offer an economical solution. For example, tape storage, even when considering high-end tape library products (such as the Ampex DST812 robotic library), costs less than 3 cents/MB, while for high-end disk array products (such as Maximum Strategy’s Gen-5 product) the cost is about 30 cents/MB. Furthermore, the above figures cover only the "capital" costs and do not include the high maintenance costs of disk storage, which is reported to This work has been partially supported by the ESPRIT Long Term Research Project HERMES (project number 9141).
y Costas Georgiadis and Peter Triantafillou are with the Department of Electronics and Computer Engineering, Technical
University of Crete, Crete, Greece. e-mail addresses: fgeorgiad,
[email protected]. z Peter Triantafillou is the contact person (FAX: +30 821 64846, TEL: +30 821 37230). x Department of Computer Science, Carnegie Mellon University, Pittsburgh, USA (e-mail address:
[email protected]). This material is based upon work supported by the National Science Foundation under Grants No. IRI-9625428, DMS-9873442, IIS-9817496, and IIS-9910606, and by the Defense Advanced Research Projects Agency under Contract No. N66001-97-C-8517. Additional funding was provided by donations from NEC and Intel. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, DARPA, or other funding parties
1
be $7/MB ([17]). Considering that (i) the storage for a 12 TB video database containing a few thousand videos costs less than $400,000 if a high-end tape library is chosen (e.g., the price for the Ampex DST812 with four drives and 256 50GB tapes) and over $4,000,000 if a high-end disk array is chosen (e.g., the price for 8 Maximum Strategy’s Gen-5 arrays, each storing 1.7TB) and that (ii) medium-sized companies are reported to be willing to spend up to $125,000 annually for mass storage systems, one can see why robotic, near-line tape libraries are very desirable. For reasons such as these, the tape library market is growing with an annual pace of 34% ([17]). However, robotic tape library access times remain about 3 orders of magnitude slower than that of disk-based storage, due to costly robotic exchange and tape positioning operations. For this reason, many demanding applications involve all the above-mentioned storage levels into hierarchical storage servers in which the higher and fastest level is the disk storage which acts as a cache and the lowest and slowest is the robotic tape library. In fact, several real-world products for applications such as Digital TV broadcasting and digital video effects are relying upon such storage technology infrastructure ([8]). Since the tape library level is the slowest, any performance improvement in transfering data from this level improves considerably the overall storage hierarchy performance. Our goal with this paper is twofold. First, we study the central issue of scheduling retrieval requests for tertiary-storage-resident video objects, and second, the performance fundamentals of such a tertiary storage robotic tape library in video databases.
1.1
Tape Library Technology Overview
Tape libraries typically consist of robot arms, tape drives and a large number of tape cartridges. The robotic mechanism loads/unloads the tapes from/to the shelves to/from the drives. The tape technology employed is either linear or helical scan recording ([1, 9]). Linear tapes have their tracks parallel to the tape’s axis, and can be read by batches either in one direction or in both (e.g., serpentine drives). Helical scan tapes have their tracks at an angle with the tape’s axis. A rotating drum can read the tracks while the tape moves in one direction. In addition, there is a great variability in the storage capacity of each tape cartridge, with typical values ranging from a few GBs up to 330GB. The total access time in tape libraries is considerably high. Searching within a tape proceeds at best with a pace of 1.6GB/s plus a startup cost of a few seconds. Many vendors report robotic delays less then 10 seconds, which, at least for smaller-capacity tapes, can bring the total access cost below a few tens of seconds. Currently, there are tape drive products offering transfer rates up to 20MB/s (with no compression).
1.2
The Problem
Application Requirements: Video databases have high storage requirements. For example, a single MPEG-2 90-minute video typically requires at least 2-3GB of storage. Reading of an MPEG-2 video from a tape can take from about 100 seconds to about 20 minutes, depending on the tape drive’s performance capabilities. Video accesses follow a skewed distribution. Thus, the multicasting (or batching) of a single reading
2
of a video for all its requests may prove very beneficial. However, due to VCR-type interactions, requests may be for different parts of a video, prohibiting such multicasting. Thus, both workloads types should be considered. Finally, requests should be served within a certain time threshold (i.e., in (near-)video-on-demand applications) or the users "drop out". Algorithms that ensure low start-up latencies (for "lucky" requests) at the expense of a high drop-out/rejection ratio, will not have a real-world usefulness. An efficient scheduling algorithm must achieve high throughput, while respecting rejection ratio constraints.
System-Level Issues: Our library has three different types of resources for which requests compete: the tapes, the tape drives, and the robots. To service requests, simultaneous resource allocation is necessary: a request has to possess the tape it wants (or a copy of it, if we have replication), as well as to possess a tape drive, and, maybe, a robot arm, if the tape is not already loaded. Simultaneous resource allocation makes efficient scheduling a formidable task. The problem at hand, therefore, is twofold: first, to devise efficient scheduling algorithms that are appropriate for such a complex environment and examine their performance, and second, to study which resources form the bottlenecks, under which circumstances, and what can be done to alleviate these bottlenecks. The second problem will be addressed by examining the impact of the length of the service times, of the number of tape drives, and of the number of robots employed. The settings of these parameters create bottlenecks at either the robot or the drive resources. Our approach is both experimental and analytical. We will first develop an optimal scheduling algorithm for a simplified problem setting. Then, we will adapt that algorithm to our environment. The experimental study will compare the performance of this algorithm to that of others found in the literature and study the general performance characteristics of the video tape library. In section 2, we overview related work. In section 3, a description of the tape library model is provided. The detailed description of three scheduling algorithms is presented in section 4. In section 5, we contribute a formal problem formulation and an optimal scheduling algorithm called Relief. In section 6, we explain how could tape replication be incorporated into the scheduling algorithms. In section 7, we present the results derived from our experiments. Finally, an overview of our work is presented in section 8.
2 Related Work Despite the facts that tape library storage (i) has been recognized as the most economical mass storage medium for applications such as video servers, (ii) suffers from very high access times, (iii) is currently employed by many real-world video delivery/manipulation products, and (iv) enjoys high market growth rates, to our knowledge there is no reported study of efficient scheduling of video tape library acceses, nor an experimental investigation of the performance issues in video tape library environments under efficient scheduling algorithms.
3
Related work has mostly concentrated on modeling the performance characteristics of tape drive and tape library products ([1, 9, 13, 14]), on comparative studies of the use of tertiary storage for multimedia ([1, 2, 7]), on storing and elevating video blocks from tertiary for playback ([5, 18]), on caching digital library documents in secondary storage ([15]), on striping and analytical modeling of tape libraries under FCFS scheduling ([6, 12]), on algorithms for optimal data placement in tertiary libraries ([3]), and on scheduling random accesses to traditional data on single (serpentine and modified serpentine) tapes ([10, 11]). In [16], the authors contributed a scheduling algorithm for video tertiary libraries which is useful when several drives compete for one robotic arm, the queuing delays for which could otherwise cause hiccups.
3 Tape Library Simulation Model One can distinguish the operations of a tape library’s components into robot arm, and tape drive operations. A robot arm performs three fundamental tasks: the load, the unload and the move operations. The load operation consists of grabbing a tape cartridge from a shelf and putting it into a drive. The unload operation consists of grabbing a tape from the drive and putting it back on the shelf. Loading and unloading require arm movement to and from a shelf. A drive performs four operations, namely the medium load and the medium eject, the tape search/rewind and the tape read (playback). Tape libraries consist of three key resources: tape drives, robot arms, and tapes. Contention for these three resources cause queuing delays. A queuing delay for a tape occurs when a request for a tape cannot be served (even though there may be available drives and robots) because the desired tape is already in use serving another request. A robot queuing delay occurs when there is no available robot arm while there is a request that requires one. Finally, queuing delays for drives occur when there are no available drives to be used by waiting requests. Notice that we have the case of simultaneous resource occupation : A request may have to wait for its tape to be available; for an empty tape drive; and, possibly, for a free robot arm, if the desired tape is not already loaded.
D
Our model simulates a closed queuing network for a tape library. Our library consists of tape drives, robot arms, and tapes. The robots are considered similar meaning that every operation that a robot can perform, can also be performed by every robot at the same time. The tape drives are also considered similar.
R
T
In our model each tape contains a single video object. This was done for simplicity, since it allows us to bypass dealing with issues such as the efficient placement of video objects within tapes, declustering issues, etc., which are research issues on their own right. Every time an object is accessed, the time spent by the drive transferring data depends on the object’s size and on the drive’s transfer rate. After a video object is accessed a tape rewind is performed. We denote the sum of the transfer and the rewind time with the term access time ( access ) and we will measure it in seconds. In multicasting environments, for simplicity, we assume that all requests have the same transfer time (i.e., video size). In unicasting environments this is not the case, since we wish to model random accesses within a video.
t
4
In the case where a robot invocation is required, the following operations take place: a ‘drive unload’ of the old tape in the idle drive with simultaneous move of the robot arm towards the drive, a ‘robot unload’ of the tape, robot movement and placement of the old tape on the shelf, robot movement and unload of the new tape from the shelf, and, finally, robot movement and drive load with the new tape. The whole procedure just described takes practically constant time (it has been found to be independent of the relative location of tapes ([1]) and is termed mount time (tmount ) and we will also measure it in seconds. We implicitly assumed that when a drive becomes idle (i.e., there is no request that requires an access on it), its tape remains in the drive and is placed off-line (i.e., is moved away from the drive) only if there is a request for another tape and the host drive must be used 1 . We model the user behavior, i.e., the way in which the user issues requests to the library, by associating with each object an access probability (p i i = 1 : : : T ). A request for a tape playback is made according to the access probabilities of the objects. We will assume, as is the standard practice in related work, that the videos’ probability distribution follows a Zipf distribution, which has been found to closely reflect user preferences. T 1 1 pi = 1 i T where c = 1 1
X
c
i=1 i
i
where the exponent 1 controls the skew of the distribution: the greater the exponent the more skewed the distribution. Typical exponent values found in the literature range from 0.73 to 2 ([1, 4]). Note that in a hierarchical storage video server, with primary and secondary storage acting as caches for the tertiary, the secondary storage cache will be absorbing requests for the most popular videos. This will definitely affect the skew of the Zipf distribution. However, our experiments have shown that reasonably-sized secondary storage caches (with a storage capacity of 5% of the tertiary backing store) do not have the required bandwidth to service all requests for the most popular videos. Thus, the requests received by the tape library continue to follow a quite skewed distribution, and can be a significant percentage of the total requests submitted to the video server (see Appendix A). In the unicasting environment a request, in the best case, will have to wait for t access seconds in order to be serviced. If the requested tape must first be brought on-line, then the total service time cannot be less than taccess + tmount seconds. Naturally, these times do not include any queuing delays for tapes, drives, and robots. In multicasting environments, a request may ‘piggyback’ on another, since with a single (possible) robot access and a single drive access, all requests for the same video are served. Thus, the total cost is amortized over all piggybacking requests. Our queuing system model is depicted in figure 1. It uses a queue where incoming requests wait for service. This queue is termed wait for idle drive queue (WIDQ). When a drive becomes idle, the scheduling algorithm will pick a request (group) from this queue to serve. Since robot involvement may be required, 1
Some tape drives actually unload a tape which has been idle for a long period of time. However, this is a minor issue since scheduling is of primary concern in systems with heavy loads.
5
there is a queue where the selected requests wait for a robot arm to become available. This queue is termed the (wait for idle robot queue (WIRQ)). WIRQ is only required when the number of drives is greater than the number of robots.
Robot Subystem
Drive Subsystem
Rejections drive 1
robot 2
drive 2
Idle Robot Queue ....
....
.... Idle Drive Queue
robot R
....
robot 1
drive D
Figure 1: Queuing system for the library model. As mentioned, the requests to the library require simultaneous resource allocation. Each request requires first a tape resource, second, a drive resource, and when both have been assigned to it, it may require a robot resource (if the desired tape is not already loaded onto the assigned drive). There are two main types of events in the library: an arrival of a new request, and the completion of the service of a request. The essence of the system’s operation is as follows: On arrival of a new request: the request enters the WIDQ On completion of a request: (or while there is an idle drive): the drive and its tape (if any) are declared ’available’ for locking. the scheduling algorithm is applied on the requests in the WIDQ. The selected request must be for an available tape. the idle drive and selected tape are locked. the request is queued at the robot, if necessary, in which case upon the completion of the robot service (taking tmount time), the playback starts. Requests waiting past a certain timeout exit the system (or the system rejects them). Also, for simplicity, in this paper we have assumed that requests in the WIRQ are served in a FCFS basis. Table 1 shows the system parameters and their typical values. 6
Table 1: System and Problem Parameters Parameter
Explanation
Values
D
Number of drives Number of robots Number of tapes Drive access time2 Mount time: Robot + drive load/unload Replication overhead The access probability of the ith video skew parameter for Zipf Timeout before requests are rejected Mean number of requests for the ith video per time unit Throughput in requests/hour achievable by an algorithm Multiprogramming level (number of users)
5 1 or 2 300 60, ... , 230 secs 25 secs 0 or 20% storage space zipf of uniform +0.271, -0.271 5, 20, mins
R T taccess
tmount RO pi tout i
TP MPL
1
20, ... , 200 requests
4 Scheduling Algorithms for Requests in the Video Tape library Next, we outline three basic scheduling algorithms. FCFS Scheduling: This scheduling algorithm picks the request closest to the head of queue (i.e., already in a tape drive and the oldest one) whose tape is available (i.e., not used by another request). This algorithm serves as a reference point for comparison purposes. Bypass Scheduling: This is similar to ’Shortest Job First’: a younger request is allowed to bypass older requests, if it does not need a (costly) robot access. Bypass selects from the queue the oldest request for a tape that is already loaded in the idle drive. If no such request exists, Bypass behaves as FCFS. Bypass tries to minimize the costly robotic operations. In a multicasting environment, Bypass scheduling allows all requests in the queue waiting for the same video as the selected request, to piggyback onto the selected request. Bypass is an unfair algorithm, suffering from starvation. Thus, to become useful, it has to incorporate some aging mechanism to avoid starvation. In this paper, we study Bypass without aging, noting that its performance will be an upper bound for the performance of any useful Bypass-based algorithm with aging! The Maximum Queue Length (MQL) Scheduling: MQL is suitable for environments where multicasting is allowed. The algorithm partitions the waiting queue into a number of queues, one for each tape. At scheduling time, it selects to serve the queue with the greatest length. The motivation is to increase as much as possible the amortization benefits gained from multicasting. This algorithm has been studied in a similar context, namely that of efficient batching policies for video requests, where it has been found to perform very well ([4]). MQL also suffers from starvation. 2
An MPEG-2 video with bandwidth 4Mb/s of 80 minutes (4800 seconds), needs 4800*4 Mbits=2.4GB of storage. A tape drive that transfers data at a rate of 12MB/sec will retrieve the video in 2.4GB/12MB/s=200s seconds.
7
5 A Formal Approach 5.1
Formal Problem Formulation
Our goal is to find out which tape request to serve when a drive becomes available. Our answer is to pick the request that will maximize the ‘relief ratio’. Next we give the definitions and the justification. First, we study a more structured problem. Consider a queueing setting, where we are given i) one server (e.g., I/O channel or a tape drive), able to broadcast/multicast data items (eg., video movies), ii) requests for T distinct video items, iii) the i-th video item has service length L i time units, iv) the i-th video item is accessed with probability p i and v) the requests for item i arrive with a mean rate i . Assuming that the access probabilities p i remain constant over time, we want to find how often we should schedule the multicast of item i (i = 1 ::: T ). That is, we want to find the cycle times C i for each item, so that we minimize the average waiting time (i.e., start-up latency) of the requests. Notice that the cycle time Ci is an integer multiple of the unit time, and that it measures the time units from the beginning of one broadcasting of item i to the beginning of the next broadcasting. Theorem 1 For a single multicasting server with T items and access qprobability p i for the i-th item, the PT Li optimal cycle time Ci for the i-th item is given by C i = j =1 Lj pj pi Proof: The average waiting time W is given by
W
=
X i
(piCi=2)
(1)
The percentage of time units that the server is engaged with video i is L i =Ci. Assuming that our server is never idle (i.e., utilization = 1) these percentages should sum to 1:
T X i=1
(Li=Ci) = 1
(2)
Thus, we want to minimize Eq. 1 subject to the constraint of Eq. 2. Using the Lagrange multipliers’ theory, we have ( P P
F (h C1 :::CT ) = Ti=1 ( 12 pi Ci) + h Ti=1 (Li=Ci ) @F=@Ci = 0 i = 1 ::: T
or
and, using Eq. 2, we get
5.2
p
1 pi + h( Li =Ci 2) = 0 ) Ci 2
2h =
=
q
2hLi =pi
q PT p PT Li . i=1 pi Li to give finally: Ci = j =1 Lj pj pi
(3) QED
The Proposed Scheduling Algorithm: ‘Relief’
For our initial scheduling problem, the crucial observation is that Eq. 3 implies that 1 pi Ci2 =Li 2
= h = constant i = 1 ::: T 8
(4)
is constant, and ii) the mean arrival rate for item i is
Under the assuptions that i) the mean arrival rate i, we get that
pi = i=
and, since
(5)
s constant, Eq. 4 becomes 1 i Ci 2=Li 2
= constant i = 1 ::: T
(6)
Intuitively, iCi accounts for the average number of requests we serve in every broadcasting of the i-th item; iCi Ci =2 represents their cumulative waiting time; (or the cumulative ‘relief’ enjoyed when broadcasting the i-th item!). When divided by L i , it gives us the amount of ’relief’ enjoyed per unit time of broadcasting of item i. Let’s call it the ’relief ratio’ for the i-th item/tape:
relief ratioi (cumulative waiting timei )=Li
(7)
‘Relief’ scheduling: When the access probabilities pi are unknown, the single multicasting server should choose the object i with the maximimum ‘relief ratio’. The justification is as follows: Eq. 6 implies that, under an optimal choice for the cycles C i (i = 1 :::T ), the relief ratio for item i at the start of broadcasting of item i is constant and equal for all items. Clearly, this means that the relief ratio for item i is higher than the relief ratio of every other item at the second that we start broadcasting the i-th item: If item j had higher relief ratio, clearly its relief ratio would only increase further, waiting to be broadcasted, and thus it would further increase the inequality between the relief ratios of items i and j which goes against the principle of Eq. 6. This way we are assured that we did our best to achieve the maximum possible relief during the next L i service time units. In fact, if the access probabilities are constant over time, ‘Relief’ is indeed optimal, automatically leading to optimal-length cycles. Corollary 1 Our ‘Relief’ heuristic is optimal when the access probabilities p i are constant over time (even if they are unknown to us!) Proof (Sketch): Our server will automatically pick the i-th object that results in the optimal cycle times of Theorem 1. QED Our proposed scheduling algorithm is based on the previous proof that the relief ratio is the key for choosing the next request to serve. The Relief algorithm attempts to improve performance by minimizing the average start-up latency of video requests. This is achieved by calculating for every request in the queue, (a) its total wait time, and (b) its service time, and computing the ‘relief ratio’ for each request i, as wait timei service timei . Intuitively, the selected requests have either a long waiting time, and/or require small service times. Thus, in essence, the algorithm picks the request that results in the greatest wait time relief for the smallest resource-occupation time. Under multicasting, all requests for the same video form request groups and the algorithm then selects the group with the highest group relief ratio. The group relief ratio is computed by summing up the waiting 9
times of all members of each group and dividing by the service time (once for the whole group). In addition to the novelty of the Relief algorithm, we note that aging mechanisms in scheduling algorithms are typically found ad-hocly, while ours is based on sound formal arguments.
6 Video Tape Replication Having a single copy of each object might not result in maximum drive utilization nor in satisfying start-up delays. To see this, consider the following scenario: In a library with ten tape drives, and in which the bypass, Relief, or MQL scheduling discipline is used, suppose there is a video with access probability of 0.9. This means that around 90% of the requests are pending for that "hot" object, a few drives have a small number of requests pending on them, and probably there are some idle drives. The following observation arises naturally: if we had another replica for that "hot" object, then one of the idle drives could be utilized and the performance of the system would be improved. This leads us to the notion of object replication: each of the T distinct objects might have a number of replicas. Furthermore, it is a logical choice if we maintain as many replicas for an object as its access probability induces. This way, for any two objects having access probability ratio equal to a it holds that the ratio of the number of their replicas equals aT 0 , where T 0 is the total number of tapes used to store replicas. We refer to T 0 with the term replication overhead using a fraction of T . For example, a library with T = 100 and 20% replication overhead has T 0 = 20 tapes used for replication and a total of T +T 0 = 120 of tapes in the system. Replication can by applied for any of the basic scheduling algorithms, regardless of multicasting. The algorithms already described, need to be further tuned. So far, the algorithms were described with a single copy for each tape. If an algorithm selected a particular (group) request, that request must be one for an available tape. That is, if the desired tape was already in use in an active drive, then the algorithm had to select another request and repeat the process until a request was selected with an available tape. When there are replicas, each algorithm selects a (group) request to serve as long as one replica of the desired tape is available.
7 Performance Results For simplicity, when investigating how the system’s resources (drives and robots) affect the performance, our primary performance metric will be the system throughput with rejections "turned off". The section that focuses on the performance comparison of the scheduling algorithms will use as the primary performance metrics the rejection ratio, and the system throughput with rejection ratio constraints, since these are the meaningful metrics in real-world applications. The number of distinct tapes (and, therefore, objects) is set to T = 300 and a single robot arm is considered in the system (R = 1, unless otherwise stated). The robot mount time (t mount ) is set to 25 seconds and the drive access time (taccess ) is set to 200 seconds (unless otherwise stated). We use the
10
notation RO for "Replication Overhead". An RO of 0.2 means that there are T 0:2 = 60 more tape cartridges which store replicas of the objects. We used the Zipf distribution ( = 0:271, unless otherwise stated) to model the access probability distribution of the objects.
7.1
Impact of the Drive Access Time and Number of Robots
R=1, D=5, Unicast, RO=0, theta=0.271
R=1, D=5, Multicast, RO=0, theta=0.271
210
400 Average Startup Latency (min)
Throughput (reqs/h)
Relief (tAccess=60) 200 Relief (tAccess=80) Relief (tAccess=100) 190 Relief (tAccess=120) 180 170 160 150 140
Relief (tAccess=60) Relief (tAccess=80) Relief (tAccess=100) Relief (tAccess=120)
350 300 250 200 150
130 120
100 20
40
60
80
100 120 MPL (N)
140
160
180
200
20
40
(a)
60
80
100 120 MPL (N)
140
160
180
(b) Figure 2: Drive access time impact.
Figures 2a and 2b show the throughput as a function of MPL for Relief for both unicasting and multicasting environments. The first thing to notice is the throughput under unicasting drops as the access time increases. This is easily explained since each request takes more time to be serviced forcing the waiting ones to stay longer in the queue. Interestingly enough, under multicasting, the Relief algorithm is not affected by the drive access time if the latter is less than about 100 seconds. This happens because, under multicasting, the throughput of the system is much greater compared to unicasting making the robot resource even a more severe bottleneck. This situation ceases to occur when, in the average case, only one drive is waiting to get loaded at any time, i.e., the time elapsed between two consecutive media exchanges for the same drive (i.e., the drive access time) is such that the robot arms have enough time to load the rest of the drives: t
access (D 1) tmount =R
(8)
According to the above formula, the minimum access time so that the robots are not the bottleneck is (D1) tmount=R, which is equal to 100 seconds in our graphs. This last observation suggests a criterion for when an extra robot arm is required or/and when no more drives are needed, and for which taccess values the system’s resources are well utilized. In figure 3, we show how increasing by one the number of robots affects the above-mentioned graphs. For 11
200
D=5, tAccess=60s, Unicast, RO=0, theta=0.271, Relief 240 R=1 R=2
230 Throughput (reqs/h)
220 210 200 190 180 170 160 20
40
60
80
100 120 MPL (N)
140
160
180
200
Figure 3: Impact of number of robots. both unicasting and multicasting schemes, there is a considerable improvement in throughput. Additionally, note that the robot is no longer the bottleneck. When, for example, t access = 60s, we can see considerable performance improvement by adding a second robot arm.
7.2
Impact of the Number of Drives
Figures 4a, 4b, and 4c show that the throughput increases proportionally with the number of the drives. This happens because the number of the requests in the system is large enough so that none of the drives is idle. Therefore, a duplication of the number of drives doubles the throughput and reduces the start-up latency in half. However, there is a limiting number of drives beyond which the performance of the system does not improve proportionally to the number of drives or even does not improve at all (see figures 4a and 4b, graphs for = 10). This is due to the fact that the single robot arm has now become the bottleneck which results in drive under-utilization. The same things hold for each of the Bypass, FCFS and MQL policies (although not shown).
Throughput (reqs/h)
D
R=1, tAccess=80s, Zipf, Multicast,RO=0, theta=0.271, Relief R=1, tAccess=[50s-110s], Zipf, Unicast, RO=0, theta=0.271, Relief R=1, tAccess=[50s-110s], Uniform, Unicast, RO=0, theta=0.271, Relief 300 D=1 D=1 220 D=1 300 D=3 D=3 D=3 D=5 D=5 200 D=5 250 D=10 D=10 D=10 250 180 160
200
200
150
140 120
150
100 100
100
50
50
80 60
50
55
60
65
70
75 80 MPL (N)
85
90
95
100
40 50
55
60
65
70
75 80 MPL (N)
(a)
85
90
95
100
50
55
60
(b)
Figure 4: Effect of the number of the drives on system’s performance.
12
65
70
75 80 MPL (N)
(c)
85
90
95
100
From figures 4b and 4c one can witness that the more skewed the access frequencies, the better the performance under the Relief discipline. This happens because in Relief (as well as in Bypass), under skewed distributions, the selected request has a higher probability of finding the desired video on-line. On the other hand, although not shown, for the FCFS and MQL algorithms almost no improvement is observed if a more skewed distribution is used. FCFS policy benefits from the more skewed distribution only if the first request picked from WIDQ, at an idle drive emergence, is for the video just released. The probability that the last occurs, increases very slightly when the uniform distribution is replaced by the Zipf one.
7.3
Performance Comparison of Scheduling Algorithms Under Unicasting
We use taccess = 200secs to be sufficiently away from the point where we have robot bottlenecks as seen in section 7.1. 7.3.1
Rejection Ratio
The first performance metric of concern is the rejection ratio which is defined to be the fraction of rejected requests over all requests submitted to the system. For this reason, in our runs, we associate with each request a timeout, after which, if the request has not been scheduled yet, it is rejected. Figure 5 shows the results for two different timeout values. There are two main observations. 1. FCFS, MQL and Relief have very similar performance. This is justified since all three algorithms in a unicasting environment tend to choose requests from the head of the WIDQ. For this reason we only show in the graph the performance of Relief and Bypass. 2. There are two fundamental observations regarding the performance of the rejection ratios of Bypass and Relief. (a) Bypass benefits more when the number of rejections is high (either because of low timeout values, or because of high MPL values). This happens because, in Bypass, rejected requests, when they re-enter the system, they get another chance to be "lucky" requesting a loaded tape. (b) Relief is, by nature, a fair algorithm. It tries to evenly distribute the waiting times among requests and establish the same waiting time for them. As MPL increases this waiting time approaches and surpasses the rejection timeout. This leads to smaller rejection ratios in Relief for smaller MPL values. The above two observations explain why Bypass is closer to Relief for a timeout value of 5 minutes and for high MPL values when timeout is 20 minutes. The most important thing to note, however, is that for small timeout values (t out < 20 mins) and for high MPL values with bigger timeout values (e.g., if the timeout value is 20 minutes and MPL is greater than about 30 requests) the rejection ratios are unacceptable. The results in this subsection have a significant utility in that they can help the video storage system to establish an admission controller so that requests are not admitted in the system if a given rejection ratio constraint will be violated. 13
R=1, D=5, tAccess=[170s-230s], Unicast, RO=0, theta=-0.271 1 0.9
Rejection Ratio
0.8 0.7 0.6 0.5 0.4 0.3
Bypass (tOut=20mins) Relief (tOut=20mins) Bypass (tOut=5mins) Relief (tOut=5mins)
0.2 0.1 0 20
40
60
80
100 120 MPL (N)
140
160
180
200
Figure 5: Effect of the timeout value in Relief and Bypass. 7.3.2
Throughput with Rejection Ratio Constraints
Given that real video storage systems with very high rejection ratios of requests would be intolerable, we now focus on the throughput of the various scheduling algorithms, given that the admission controller of a video server will only admit the maximum number of requests that can be serviced without exceeding a certain threshold value on the rejection ratios. Table 2 shows these results which were obtained as follows: First, we determined, from the graphs of the previous subsection and for each algorithm, the maximum MPL value for which the rejection ratio constraint ( 0:1) is satisfied. Subsequently, we turned to the graph showing the throughput of the algorithms as a function of MPL (not shown here for space reasons) and we determined the corresponding throughput of each algorithm for its maximum MPL value. Table 2:
D
=5, taccess =200secs, =-0.271, tout =20mins, RR
max MPL
TP (reqs/hour)
Relief
25
96.32
MQL
25
94.92
FCFS
19
64.14
Bypass
24
97.80
0:1, RO=0, unicast
We can see that Relief achieves the highest maximum MPL value and slightly better throughput compared to MQL and considerably better throughput compared with FCFS. It has, however, slightly worse throughput than Bypass. In particular, Relief achieves 1.47% higher throughput than MQL, 50.17% higher throughput than FCFS, and 1.51% lower throughput than Bypass.
14
7.4
Performance Comparison of Scheduling Algorithms Under Multicasting
7.4.1
Rejection Ratio
R=1, D=5, tAccess=200s, tOut=20mins, Multicast, RO=0, theta=0.271 1
0.8 Rejection Ratio
Rejection Ratio
0.8
R=1, D=5, tAccess=200s, tOut=20mins, Multicast, RO=0, theta=-0.271 1
0.6
0.4
0.2
0 20
40
60
80
100 120 MPL (N)
Bypass FCFS Relief MQL 140 160 180
0.6
0.4
0.2
0 200
20
40
60
(a)
80
100 120 MPL (N)
Bypass FCFS Relief MQL 140 160 180
(b)
Figure 6: Rejection ratios for two different Zipf access distributions. Figures 6 show the rejection ratios of the algorithms for timeout equal to 20 minutes for two different Zipf distributions. We can see from the figure that: 1. FCFS does not exploit the multicasting capability and hence it has the highest average start-up latency which naturally leads to greater rejection ratios. 2. Bypass, on the other hand, allows more multicasting than FCFS but does not specifically exploit it like MQL and Relief. Bypass is mainly concerned with picking the request group that requires no robot exchange. That’s why its performance is between that of MQL and Relief. 3. The Relief algorithm has better rejection ratio performance than MQL for MPL values greater than 40. The better performance is due to the fact that, by design, Relief aims to relieve the system from as much waiting misery as possible per service time unit. 4. We can also see that for more skewed access distributions: (i) the rejection ratio values are smaller for all algorithms, except FCFS, and (ii) the difference in the rejection ratios of Relief and MQL starts at smaller MPL values.
7.4.2
Throughput with Rejection Ratio Constraints
Table 3 shows these results which were obtained as in the case of Table 2.
15
200
Table 3: D=5, taccess =200secs, =-0.271, tout =20mins, RR
0.1, RO=0, multicast
max MPL
TP (reqs/hour) for max MPL
Relief’s improvement from others
Relief
48
221.84
.
MQL
31
161.28
37.54%
FCFS
27
73.2
203.06%
Bypass
42
174.04
27.46%
We can see that Relief achieves higher maximum MPL value and throughput for that value.
8 Impact of Tape Replication Figures 7a and 7b show the throughput of Relief as a function of MPL under unicasting for three different D-values, R = 2, and = 0:271. We chose R = 2 because we saw earlier that throughput is not improved when we increase the number of drives and stay with one robot. The conclusions drawn from this study follow. The first three conclusions hold for the MQL and Bypass algorithms also and for both metrics: rejection ratio and throughput without timeouts. For this reason and for space reasons we only show the performance of Relief with no timeouts under unicasting, but we also report on our other important findings. 1. Although not shown, only a slight improvement was evidenced for less skewed access distributions ( = 0:271). This improvement is due to the fact that replication increases the probability that the selected requests find an ‘available’ tape, and thus allow the algorithm’s optimization to a greater extent. 2. For more skewed ( = 0:271) distributions, the performance improvement is very significant (see figures 7a and 7b). The additional improvement is due to a greater drive utilization. For example, for more skewed distributions, having more drives while there is only one replica per tape, results in some drives being very underutilized. 3. For multicasting environments, (not shown in the graphs) replication does not help. This is because the biggest contributor to improvement is by far because of the multicasting feature. 4. An interesting subtle behavior is that the throughput of the system with D = 5 is better than that with D = 10 when there is no replication in unicasting. This is because of a lower aggregate drive utilization for D = 10 than for D = 5. The latter, in turn, is due to the fact than when D = 10, more requests have a chance to be serviced by drives. These requests will typically induce more robot exchanges than the workload requests served when D = 5. This conclusion is also supported by the significant lower rejection ratio we observed when D = 10.
16
R=2, tAccess=[170s-230s], Unicast, RO=0.2, theta=-0.271, Relief 220
75 70 65
D=3 D=5 D=10
200 Throughput (reqs/h)
Throughput (reqs/h)
R=2, tAccess=[170s-230s], Unicast, RO=0, theta=-0.271, Relief 85 D=3 D=5 D=10 80
180 160 140 120 100 80
60
60 55
40 20
40
60
80
100 120 MPL (N)
140
160
180
200
20
40
(a)
60
80
100 120 MPL (N)
140
160
180
(b) Figure 7: Impact of replication.
9 Concluding Remarks We have studied the performance behavior of a robotic video tape library under a variety of workloads (access distributions, access times, unicasting/multicasting) and under a variety of scheduling algorithms (FCFS, MQL, Bypass and Relief). Our major contributions are the problem definition, along with its nuances (skew of tape-access probabilities, unicasting/multicasting, simultaneous resource allocation, replication, performance metrics, etc.), the design of Relief, a novel, ’fair’ scheduling algorithm, and its near-optimality proof. In contrast, other scheduling algorithms use ad-hoc mechanisms to achieve fairness. the extensive experimentation, showing that Relief outperforms its competitors by 203% (compared to FCFS), 27% (compared to Bypass) and 37% (compared to MQL) in throughput, for the same rejection ratio. (Note that Bypass and MQL are unfair and starvation-bound)! the conditions under which the robot-arm and the tape resources become the bottlenecks (Eq. 8), Ongoing and further work includes the study of tape-replication schemes and the study of tape library scheduling algorithms for other applications.
References [1] A.L. Chervenak, "Tertiary Storage: An Evaluation of New Applications", Doctor of Philosophy Thesis, Department of Computer Science, University of California, Berkeley, 1994. [2] A.L. Chervenak, "Challenges for tertiary storage in multimedia servers", Parallel Computing, Jan. 1998, pp.157-176. [3] S. Christodoulakis, P. Triantafillou, and F. Zioga, "Principles of Optimally Placing Data in Tertiary Storage Libraries", In Proc. of the 23rd Intern. Conf. on Very Large Data Bases, (VLDB) August 1997. 17
200
[4] A. Dan, D. Sitaram, and P. Shahabuddin, "Dynamic Batching Policies for an On-Demand Video Server", Multimedia Systems, 1996, no. pp. 112-121. [5] S. Ghandeharizadeh and C, Shahabi, "On Multimedia Repositories, Personal Computers, and Hierarchical Storage Systems", ACM Multimedia 94. [6] L. Golubchik, R.R. Muntz, and R.W. Watson, "Analysis of Striping Techniques in Robotic Storage Libraries", 14th IEEE Symposium on Mass Storage Systems, 1995, pp.225-238. [7] L. Golubchik, R.K. Rajendran, "A study on the use of tertiary storage in multimedia systems", Proc. of joint IEEE/NASA Symp. on Mass Storage Systems, March 98. [8] J. Henessy, "The role of data storage in broadcastings future" In www.tvbroadcast.com/ archive/2897.4.htm, 1997. [9] B.K. Hillyer and A. Silberschatz, "On the Modeling and Performance Characteristics of a Serpentine Tape Drive", In Proc. of the ACM SIGMETRICS Intern. Conf., 1996. [10] B.K. Hillyer and A. Silberschatz, "Random I/O Scheduling in Online Tertiary Storage Systems", In Proc. of the ACM SIGMOD Conference on the Management of Data, pp. 195-204, 1996. [11] B.K. Hillyer and A. Silberschatz, "Scheduling non-contiguous tape retrievals", Proc. of joint IEEE/NASA Symp. on Mass Storage Systems, March 98. [12] T. Johnson, "An Analytical Performance Model of Robotic Storage Libraries", Performance Evaluation, 27+28, pp. 231-251, 1996 [13] T. Johnson and E.L. Miller, "Performance Measurements of Tertiary Storage Devices", In Proc. 24th International Conference on Very Large Data Bases, (VLDB), August 1998. [14] T. Johnson and E.L. Miller, "Benchmarking tape system performance", Proc. of joint IEEE/NASA Symp. on Mass Storage Systems, March 98. [15] A. Kraiss and G. Weikum, "Vertical Data Migration in Large Near-Line Document Archives Based on Markov Chain Predictions", 23rd Intern. Conf. on Very Large Data Bases, (VLDB) August 1997. [16] S.W. Lau, J.C.S. Lui, and P.C. Wong, "A Cost-Effective Near-Line Storage Server for Multimedia Systems", IEEE Intern. Conf. on Data Engineering, 1995. [17] D. Simpson, "Untangle your tape storage costs", In Datamation, June 1997. [18] P. Triantafillou and T. Papadakis, "On-Demand Data Elevation in Hierarchical Multimedia Storage Servers", In Proc. of the 23rd Intern. Conf. on Very Large Data Bases, (VLDB) August 1997. ============================================================================ ============================================================================ 18
Appendix A (To be read at the discretion of the reviewer) In this appendix, we show that, in a hierarchical storage server (HSS), the access frequency distribution of the tape requests that miss the disk-based cache and reach the tertiary storage level, is skewed enough to be approximated by a (skewed) Zipf distribution. We considered a tape library able to store 750 GB of information in 300 tapes (2.5 GB per tape). Considering that a disk cache pool is about 5% of the tertiary backing store, and that a hard disk is able to store 2.5 GB of data, we need 15 disks, each storing a single movie. We then used a value for the Zipf distribution equal to -1 (which results in the most skewed access distributions suggested in [1]). For such a skewed distribution, and for 1000 requests in the overall system, the most frequent movie is requested 610 times. Provided that the bandwidth of each disk can support up to 20 MPEG-2 streams simultaneously, the 15 disks can support 300 requests for the most frequent movie (this means that the movie is replicated on each of the 15 disks). The rest 310 requests for the most popular movie along with all the requests for the rest of the movies, miss the cache and head towards the tertiary storage level. We studied the missed requests and compared their probability distribution to the one derived from the Zipf formula by setting to -0.271. We compared the skew of the two vectors (using the vector majorization and Schur function theory) and we observed that the requests that reach the tape library form a more skewed distribution than the one resulted from the Zipf formula for = 0:271. Thus, this shows that the values selected in this paper are appropriate.
19