Hierarchical Scheduling Algorithms for Near-Line ... - Semantic Scholar

5 downloads 2931 Views 72KB Size Report
efficient scheduling algorithms for tape-based robotic storage libraries. It contributes ... service of a request from a drive involves a drive seek and a data retrieval. Translation ... All queues have a field which can be either marked or unmarked.
Hierarchical Scheduling Algorithms for Near-Line Tape Libraries Peter Triantafillou Dept. of Electronics and Computer Engineering Technical University of Crete Chania 73100, Greece [email protected]

Ioannis Georgiadis Department of Computing Imperial College 180 Queen's Gate, London SW7 2BZ, UK [email protected]

Abstract Robotic tape libraries (RTLs) currently enjoy a prominent place in the storage market, with a reported average annual growth rate approaching 34%, primarily due to their low cost per MB figures. Given the everincreasing requirements for storage of several modern applications, this will continue to hold. However, despite this fact and the fact that their access times continue to be very slow (e.g., tens of seconds) the central issue of efficiently scheduling accesses to RTLs has not received the attention it deserves. This paper contributes a study of efficient scheduling algorithms for tape-based robotic storage libraries. It contributes novel algorithms and experimentally evaluates their performance comparing them against that of well-known algorithms found in other environments. The paper’s main contribution, hierarchical scheduling algorithms, can offer significant performance improvements, such as approximately 50% shorter average service times, while at the same time providing fair schedules.

1. Introduction Tertiary storage devices have become a very attractive choice for applications with large storage requirements. Complex scientific projects, and new applications, such as digital libraries and video-based servers, handle large amounts of data, which can only be stored economically in tapes. Tertiary storage devices considered in this paper consist of a group of media drives, a group of robots, and a storage rack containing the media, which are magnetic tapes/cartridges. These devices are called Robotic Storage Libraries (RSLs). RSLs provide unique cost per MB advantages over magnetic disks but they also have features that impose limitations in integration with most applications. RSLs can store orders of magnitude more data with lower cost than magnetic disk arrays and they can achieve high data rates. However, even high-end products, exhibit access times of tens of seconds with large variance.

2. Scheduling in RSLs Due to the high cost of media switches and the high latency caused by uncorrelated accesses to an RSL, clever scheduling of I/O requests could result in significant performance gains. A good scheduling scheme should try to: (i) reduce average service time of a request and (ii) at the same time provide a fair schedule. In this paper we consider scheduling algorithms with requests that involve only retrieval of data, without striping and caching at the level of the tape library.

2.1. Related work There has been a significant amount of work, which provided the basis for this research. Some dealt with detailed modeling of the behavior of various models of RSLs. Others derived scheduling schemes for serving requests to a given tape. Chevernak [1] initiated efforts to accurately model different technologies of RSLs. She measured the performance of low-end, medium, and highend tape libraries under three different workloads. Johnson & Miller [7] studied the behavior of six different robotic libraries belonging to the two most common tape recording technologies: namely helical scan and serpentine. Hillyer and Silberschatz [5, 6] developed a detailed model for the seek operation for both a serpentine and modified serpentine tape drive through measurement of commercial devices. They derived a number of scheduling algorithms for single-tape accesses. Georgiadis, Faloutsos & Triantafillou [3] developed and measured the performance of four scheduling schemes in a video server environment. Lau, Lui,and Wong [9], Triantafillou and Papadakis [11], Chervenak [2], Golubchik and Rajendran [4] all studied issues related to the use of tertiary storage in multimedia systems. To our knowledge, there is a lack of study regarding the issues pertinent to the efficient scheduling of requests in RSLs. In this paper we will contribute new scheduling algorithms and experimentally evaluate their performance

under different workload types, using a popular tape library as a case study.

3. Scheduling algorithms A request has at least the following fields: The id of the desired tape, the position in the tape where desired data blocks reside, and the data size to be transferred. Thus, the service of a request from a drive involves a drive seek and a data retrieval. Translation of the logical seek position and data size fields of the request depends on the particular tape drive technology used and does not affect the structure of the request. Requests come with a mean arrival rate, following a particular distribution. They are placed in the scheduler’s queue of requests. The queue is scanned, a (batch of) request(s) is extracted, (reordered) and sent to the server for execution.

3.1. The First In First Out (FIFO) algorithm The simplest scheduling scheme is FIFO. The requests are executed in the order they arrive to the RSL. This scheme is expected to have poor performance for heavy workload. It will lead to a large number of costly tape switches. Furthermore, the drive will remain idle for large periods of time, waiting for tape switches to be completed, or spend much time on mounting/unmounting tapes. Essentially, FIFO is not a scheduling scheme; it is used only for comparison purposes.

3.2. The Shortest Access Time First (SATF) algorithm Access time is defined as the time it takes for requested data to start flowing, i.e., it is the time to load the desired tape, seek to the specified position on the tape, and start reading. SATF scans the entire scheduler’s queue of requests, extracts the request that needs the shortest access time and sends it to the server for execution. Under a heavy workload, when the scheduler’s queue of requests has an adequately large average length, the algorithm is expected to minimize tape switches and positioning overhead. However, the algorithm is greedy and, as a result, it may exhibit request ‘starvation’. SATF is expected to produce large variance on the average service time of a request. SATF is the first algorithm that needs to estimate the time of some part of the execution of a request. It needs an estimate of the tape switch and seek time.1

3.3. The OPT(N,K) Algorithm The optimal scheduling algorithm finds all permutations of the requests in the scheduler’s queue, estimates the execution cost of each permutation, and chooses the one which has the minimum execution cost. Obviously, the optimal scheduler has algorithmic complexity O(n!), where n is the number of requests in the scheduler’s queue. To be practical, the algorithm is applied only to the first N requests of the queue. Therefore, N requests are extracted from the scheduler’s queue. For all permutations of these requests, the execution cost is calculated and the permutation with the minimum cost is selected. The parameter, K, is used to control how often will the optimal schedules for N requests computed. When K requests from the batch of the N requests that has been created in the previous stage, have been sent to the server for execution, the process is repeated. Hence, the OPT(N, K) algorithm. (Similar versions of the OPT algorithm have been suggested in [8] for disk scheduling). In order to find the best permutation, an estimate of the execution cost of each permutation must be calculated. Calculating this is an elaborate process, especially for large N.

3.4. Clustering and Hierarchical Scheduling (CLUST) According to this scheme: (i) requests coming to the scheduler are grouped according to some criterion, such as the tape they need to access; (ii) a group is chosen based on some particular criteria; and (iii) a scheduling algorithm is used to schedule the requests in the group. Thus, the algorithm can be viewed as a hierarchical scheduler with two levels [10]. At the upper level, a group of requests all asking for the same tape is selected. At the lower level, a group’s requests are scheduled. The major objective of the algorithm is to minimize the overhead, but not at the cost of starvation. That is, it must have a performance superior to SATF and OPT(N,K) without suffering from a large variance in average service time. At the upper level, CLUST employs a modified version of the round-robin policy (M-RR). One queue per tape is maintained. All queues have a field which can be either marked or unmarked. Initially all queues are unmarked. The first time M-RR is called, it chooses the longest queue of requests and it places a mark on it. Then it sends the selected queue to the low-level algorithm. The subsequent calls to M-RR proceed as follows:

1

In most cases, the average tape switch time, a simplified model of the seek time, and a constant transfer rate are only provided. The alternative is to run on the RSL a number of experiments and construct a model based on the measurements. Then the model can be employed by all scheduling algorithms that need to estimate the cost of execution of a

request. The method above has been applied by Hillyer and Silberschatz [5, 6] and Johnson and Miller [7].

1. If all queues have been marked, all marks are removed. The longest unmarked queue is selected. 2. The longest marked queue is found. If its length is greater than the length of the queue selected in step 1, then the queue is being unmarked. 3. The queue that has just been selected in step 1, is being marked. M-RR, theoretically, suffers from starvation: it is possible to constantly alternate between the queues of requests for the two most popular tapes. Thus, for highlyskewed tape accesses the criteria for unmarking recentlymarked queues (step 2) should be more strict. However, this scheme, in general, can achieve both a relative fairness, by slowly moving to queues with decreasing length, and performance improvement over the traditional round-robin, as it avoids removing popular tapes from drives. There are several other policies that could have been used at the upper level: the Maximum Queue Length (MQL), the queue of requests with the greatest data size, the queue with the oldest requests, etc. Later, some of these alternatives are experimentally evaluated. At the lower level, OPT(N,N) has been employed. The first N requests, from the queue that has been selected in the previous stage, are extracted and reordered to result in the best execution sequence. Thus, the algorithm is parameterized on the number of requests N. CLUST(N) denotes this algorithm. Obviously instead of OPT, all algorithms that appear in the literature for single-tape scheduling [5] such as FIFO, shortest seek time first, etc., could be used. Once a batch of N requests has been selected and reordered, it is pushed in a queue associated with a drive. All requests of that queue need the same tape and will be served by the drive associated with the queue. When the last request of such a queue is extracted, the algorithm is called again to produce a new batch of requests for the corresponding drive. The procedure is repeated until all these queues are filled with requests. The scheduler, then, collects the first request from each of these queues and sends it to the server. 3.4.1. An example. Figures 3 through 6 show the stages of CLUST(3) using M-RR at the upper level and OPT(3,3) at the lower level. The library has only four tapes for simplicity and three drives. Initially (Figure 1): • Tape 1 is needed by five requests and its queue is unmarked. Tape 2 is needed by four requests and its queue is unmarked. Tape 3 is needed by six requests and is marked and tape 4 is needed by one request and is marked (Figure 1-1). • Drives 1 and 2 do not have any requests in their queue. Drive 3 has requests from a previous call to the algorithm (Figure 1-2).

• We assume that drives 1 and 3 are idle, so the requests at the head of their respective queues will be sent immediately to the server. 1 Queues of requests for the same tape Tape ID 1 Request ID 15

Request ID 14

Request ID 13

Request ID 24

Request ID 23

Request ID 22

Request ID 36

Request ID 35

Request ID 11

Request ID 12

Tape ID 2 Request ID 21

Tape ID 3 Request ID 34

Request ID 33

Request ID 32

Request ID 31

Tape ID 4 Request ID 44

2 Queues of requests assigned to the same drive Drive ID 1 EMPTY

Drive ID 2 EMPTY

Drive ID 3 Request ID 43

Request ID 42

Request ID 41

Figure 1. Initial stage of scheduler. 1

Queues of requests for the same tape Tape ID 1 Request ID 15

Request ID 14

Request ID 24

Request ID 23

Request ID 36

Request ID 35

Tape ID 2 Request ID 22

Request ID 21

Performing optimal scheduling

Tape ID 3 Request ID 34

Request ID 33

Request ID 32

Request ID 31

Tape ID 4

Request ID 13

Request ID 44

2

Request ID 12

Request ID 11

Queues of requests assigned to the same drive Drive ID 1 Request ID 12

Optimally scheduled

Request ID 12

Request ID 11

Request ID 13

Request ID 11

Request ID 13

Request ID 41

Request ID 42

Drive ID 2 EMPTY

Drive ID 3 Request ID 43

Figure 2. M-RR and OPT(3,3) applied once. Stage 1 (Figure 2): 1. M-RR finds that queue of tape 1 is the longest unmarked queue and that queue of tape 3 is the longest marked queue and also longer than queue 1. So it marks queue 1 and removes the mark off queue 3 (Figure 2-1). 2. The first three requests are extracted from queue 1 and reordered to form an optimal sequence. 3. They are placed in the first empty drive queue found, that of drive 1 (Figure 2-2). Stage 2 (Figure 3): The M-RR is applied once more for the second empty drive queue (queue of drive 2): 1. M-RR finds that queue of tape 3 is the longest unmarked queue and that there is no marked queue longer

than the queue of tape 3. Therefore, it just marks queue 3 (Figure 3, part 1). 1

Queues of requests for the same tape Tape ID 1 Reques t ID 15

Reques t ID 14

R eques t ID 24

Request ID 23

R equest ID 22

Request ID 36

R equest ID 35

Reques t ID 34

popularity follows Zipf’s law and all objects of a tape are equally popular. Requests’ arrival intervals follow an exponential distribution. The interval gradually increases until the system becomes unstable.

Tape ID 2 R eques t ID 21

4.2. Case study: The Ampex DST812

Tape ID 3

Performing optimal scheduling

R eques t ID 33

R eques t ID 44

2

Request ID 32

Request ID 31

Queues of requests assigned to the same drive Drive ID 1

Optimally scheduled

Request ID 32

R equest ID 33

R equest ID 31

Reques t ID 12

Request ID 11

Request ID 13

R eques t ID 32

Reques t ID 33

Reques t ID 31

Reques t ID 43

Request ID 41

R eques t ID 42

Drive ID 2

Drive ID 3

Figure 3. M-RR and OPT(3,3) applied for a 2

nd

time.

2. The first three requests are extracted from queue 3 and reordered to form an optimal sequence. They are placed in the first empty queue found, that of drive 2 (Figure 3-2). Stage 3: The procedure now stops since no drive queue has been left empty. Drives 1 and 3 are idle, so the first request from queues 1 and 3 are extracted and sent to the server. 4. Performance evaluation We present our performance results, based on a simulation testbed consisting of approximately 20,000 lines of commented C++ code. The above algorithms have been evaluated under three different workloads, using the Ampex DST812 library as a case study. The primary performance metrics are: (i) the average total delay (waiting and service time) and (ii) the relative deviation of the average total delay (defined as the standard deviation over the average delay) which gives an indication for the predictability of the algorithms performance and their fairness. Three distinct workloads will be employed in an effort to investigate the behavior of an RSL and the performance of the scheduling algorithms under different access patterns. • Random images workload: The objects are images of constant size 100KB. All tapes are equally popular and all objects on each tape are equally popular. • Digital library workload: The objects are multimedia documents containing short video clips (MPEG-II video of duration less than 30 sec) of constant size 10MB. Tapes’ popularity follows Zipf’s law, all objects of the same tape are equally popular. • Video-server workload: The objects are MPEG-II video of duration 100 min. of total size 2.7 GB. The tapes’

The Ampex DST 812 tape library consists of 256 19mm tapes with capacity 50GB each, 4 DST 312 helical scan drives and a single robot. Tapes have multiple load/eject zones. Drives have a transfer rate of 20MB/sec and search rate of 1.6GB/sec. For simplicity, the seek model is a linear function of the seek distance, with a startup cost of 8.5 sec (this has been found to closely approximate long seeks in helical scan drives). Also for simplicity, there is no distinction between long and short seeks, and first seek after mount. Minimum transfer size is 1024 KB. Mount time is 10.1 sec, eject time is 12.24 sec with relative deviation of 3.1 sec [7]. Tape switch takes 7 sec.

4.3 Discussion FIFO, as expected, is the worst scheduling algorithm for all workloads. OPT performs surprisingly bad for all workload types under heavier workloads. This is because it considers only the first few requests from the queue, while the average queue length can be 30-100 requests. SATF is a greedy algorithm and this can be seen by the large deviation in the average total delay it exhibits. Still, it is the second best algorithm. It outperforms OPT because it scans the entire queue of requests. Ampex DST812 (10MB object size) 1500 1400 1300 1200 1100 1000

Total delay (sec)

Tape ID 4

900

FIFO 800

SATF

700

OPT(6,1) CLUST(7)

600 500 400 300 200 100 0 30.2

19.7

16.1

13.0

10.1

9.6

9.1

8.6

8.2

8.0

7.8

7.5

Arrival interval (sec)

Figure 4. Digital library workload (total delay). CLUST(7) with M-RR is the best algorithm both in terms of response time (average total delay) and in terms of relative deviation. It performs well even in the random image retrieval workload, where the tapes’ access pattern is uniform. CLUST(7) when using modified round-robin (M-RR) at the upper level, gives the best results compared

to CLUST(7) employing the maximum queue length policy (MQL) or the traditional round-robin (RR) for the upper-level algorithms. In the video server workload 2, since the dominant part of the service time is the transfer time for the large objects, the improvements due to our hierarchical algorithms are smaller. Ampex DST812 (10MB object size) 2500

2250

2000

Total delay (sec)

1750

than the second-best algorithm (SATF), while ensuring fairer schedules. This work is part of an on-going project that includes the expansion of our simulation testbed to (i) include detailed access-cost models for serpentine and helical-scan tape drives, already developed by other researchers and (ii) allow the easy incorporation of new scheduling algorithms within our testbed. We plan to make our completed testbed available to the community. Other work includes the development and evaluation of new scheduling algorithms, including relevant TSP heuristics, and studying the fairness issues at the higher-level scheduling algorithm (M-RR) as it depends on the skew of tape accesses.

1500

CLUST(7), MQL CLUST(7), RR

1250

CLUST(7), M-RR SATF

1000

750

6. References [1] A. L. Chevernak. Tertiary Storage: An evaluation of new applications. Ph.D. dissertation, Dept. of Comp. Science, Univ. of California-Berkeley, 1994.

500

250

0 30.2

19.7

16.1

13.0

10.1

9.6

9.1

8.6

8.2

8.0

7.8

7.5

Arrival interval (sec)

Figure 5. Alt. versions of CLUST(7)(total delay).

[3] C. Georgiadis, P. Triantafilou and C. Faloutsos. Scheduling and performance of robotic tape libraries in video server environments. Technical report, Dept. of EE&CS, Tech. Univ. of Crete, 1997.

Ampex DST812 (10MB object size) 5

4.5

4

Relative deviation of total delay

[2] A. L. Chevernak. Challenges for tertiary storage in multimedia servers. Parallel Computing, Jan. 1998, (24) 157-176.

3.5

3

CLUST(7), MQL CLUST(7), RR CLUST(7), M-RR SATF

2.5

2

1.5

1

0.5

0 30.2

19.7

16.1

13.0

10.1

9.6

9.1

8.6

8.2

8.0

7.8

7.5

[4] L. Golubchik and R. Rajendran, A study on the use of tertiary storage in multimedia systems. NASA/IEEE joint Mass Storage Systems Symposium, March 1998. [5] B. Hillyer and A. Silberschatz. Random I/O scheduling in online tertiary storage systems. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.

Arrival interval (sec)

Figure 6. Alt. versions of CLUST(7) (delay deviation).

5. Conclusions The paper’s main contribution, hierarchical scheduling, represents a framework for developing promising scheduling algorithms. It permits the reconciliation of seemingly conflicting goals, such as reduced access times and fair schedules. The high-level algorithm can ensure a fair schedule, while low-level scheduling algorithms allow for localized optimizations for access times. We offered a particular instance of hierarchical scheduling which employs a modified round-robin algorithm for selecting groups of requests and the OPT(N,N) algorithm for scheduling the group’s requests. Our experimental results show that the hierarchical scheduling algorithm enjoys up to 50% shorter total delays 2

Random image retrieval and video workload results are only available in the extended version of this paper. This is available through e-mail.

[6] B. Hillyer and A Silberschatz. Scheduling non-contiguous tape retrievals. In joint IEEE/NASA Symposium in Mass storage systems, 1998. [7] T. Johnson and E. Miller. Performance measurements of tertiary storage devices. Proceedings of the 24 th VLDB Conference New York, USA, 1998. [8] D. Jacobson and J. Wilkes, Disk scheduling based on rotational position, Tech. Report, HPL-CSP-91-7, Feb. 1991. [9] S.W. Lau, C.S. Lui, and T.C. Wong, A cost-effective nearline storage for multimedia systems, IEEE Int. Conf. on Data Engineering, 1995. [10] Y. Roboyannakis, P. Muth, G. Nerjes, M. Paterakis, P. Triantafilou and G. Weikum. Disk scheduling for mixed-media workloads in a multimedia server. ACM Multimedia 98, Sep. 98. [11] P. Triantafilou, T. Papadakis. On-demand data elevation in a hierarchical multimedia storage server. Proceedings of the 23 rd VLDB Conference, Athens, Greece, 1997.