A Prediction-based and Cost-based Replica Replacement Algorithm ...

40 downloads 7472 Views 2MB Size Report
Jul 24, 2002 - various universities and research centers to collaborate ... LFU) in a Data Grid simulation: OptorSim in section 4, .... We call cost factor as C. C is defined as following: .... performance in the simulation test with grid simulator.
A Prediction-based and Cost-based Replica Replacement Algorithm Research and Simulation Ma Teng, Luo Junzhou Department Computer Science and Engineering, Southeast.University, Nanjing, 0096,China

Abstract For the high latencies of the Internet, it becomes a big challenge to access such large and widely distributed data fast and on data grids. To address this challenge, large amounts of data need to be replicated in multiple copies at several distributed sites. However, the number and size of storage are limited. So a good replacement algorithm is important to the performance and of replication technologies. In this paper, we introduce a new replica replacement algorithm that combines prediction factors and replacement cost factors together. Through predicting the popularity of replica in future time windows, hot spot replica is kept to improve mean job time. Cost factors are most concerned about replica replacement cost such as network latency and bandwidth, replica size and system reliability. We present replacement algorithm that achieves a good a balance between mean job time and bandwidth resource consumption. By using OptorSim simulator to compare our PC-based algorithm with traditional replacement algorithm, we the results show that our PC-based and of the algorithm improves data access within the overall data grid.

1. Introduction A data grid enables thousands of scientists sitting at various universities and research centers to collaborate and share their data and resources. It connects a collection of hundreds of geographically distributed computers and storage resources located in different parts of the world to facilitate sharing of data [ The main challenge is supporting fast data access on the Internet. The major barrier to supporting fast data access on a Grid is the high latencies of Wide Area To address this challenge, large amounts of Networks data need to be replicated in multiple copies at several world wide distributed sites. However, the number and size of storage are limited. So, a good replication strategy is needed to anticipate and analyze users' data requests and to replace unpopular replicas out accordingly. Current replication technologies use only static or driven replication services, to manually replicate and

replace data files But the Grid environment however, is highly dynamic. The resources availability and network performance change constantly and data access requests also vary with each application and each user. To simplify the mathematical model, we will make an assumption for replica access pattern in section 3.2. The predictions of the replication decisions are decided by accumulated historical access statistics. Meanwhile the costs of the replication decisions are calculated based on many factors, such as network latency and bandwidth, replica size and system reliability. The algorithm combining the prediction with cost together will allay the mean job time, reduce the network bandwidth consumption and access latency and improve performance of the overall data grid. In the remainder of the paper, we first introduce some of traditional replica replacement algorithms and then give out our replica replacement strategy including two algorithms. One is Prediction-based replica replacement algorithm, and the other is Cost-based replica replacement algorithm. Through the output of simulating our Prediction-based and Cost-based algorithm (PC-based) and traditional replica replacement algorithms (LRU LFU) in a Data Grid simulation: OptorSim in section 4, we argue that the replica replacement algorithm that we propose offers better performance than traditional replica replacement algorithms.

2. Related work Various replica replacement algorithms and data grid simulations have been undertaken in recent years, among them popularity-based model economic model [5] [6] and OptorSim In [4] the authors show a decentralized architecture for adaptive media dissemination. The authors define the as a Zipf distribution and popularity of the meanwhile give a formula based on levels of replica popularity. Similarly, our prediction algorithm also makes for the replica popularity. But the mathematical a model we assume is not the same with the model in The economic model [5] is used for optimizing file replication in a Data Grid. In this model, there are two main classes of actors: Computing Elements and

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE

Storage Brokers mapping from file space to file-ID space, Content similarity is defined. This definition is also used in this paper. In [6] the authors think replication Optimization is obtained via interaction of the actors in the model, whose goals are maximizing the profits and minimizing the costs of data resource management, The authors provide four classes of actors: Computing Element, Access Mediator, Storage Broker and Storage Element. The Marketplace Protocols for the interaction of four actors is also provided. OptorSim [7] is a data grid simulator mainly for economic model. The main advantage of OptorSim is that it performs two-stage optimization. Scheduling decisions are based on both the location of data and the status of network links between grid sites, while optimization during the run-time of a job takes into account dynamic variations in the distribution of data and in the behavior of network resources. The disadvantage of OptorSim is that in OptorSim the size of replicas is assumed to be the same each other. So this economic model does not take the replacement cost into account. This situation is different from the real computing environment.

3. Replica replacement strategy This section firstly introduces four traditional replica replacement algorithms: LRU, LFU, Random and FIFO, and then provides prediction-based and cost-based replacement algorithms.

3.1. Traditional replica replacement algorithm Because of the limit of storage of SE, when the storage level of SE comes to some threshold, the performance of SE will drop dramatically. So when a new replica arrives, once the storage level of SE arrives above the threshold, the replica replacement algorithm must be called to delete one of the replicas to make room for the new replica. The traditional replacement algorithms are Least frequently used algorithm (LFU), Least recently First in First out (FIFO), and used algorithm Random replacement algorithm. LFU algorithm replaces the least accessed replica with the new replica in a time window. This algorithm is based on access status of the recent time window, so it cannot strictly show a long-term access status. LRU algorithm replaces the replica that is not used for the longest time recently. This algorithm is beneficial for the latest replica. However, just the same with LFU algorithm, it cannot predict the hot spot replica in time window. It is a discrimination algorithm that discriminatesthe earlier arriving replica. FIFO algorithm is the same with the LRU that discriminates the earlier arriving replica. The same with

two algorithms above, it cannot predict the hot spot of accessed replica. Random replacement algorithm selects randomly one of replicas to abandon. The advantage of this algorithm is the ease of the hardware implement. The disadvantage of it is that hot spot replica may be replaced out to cause the falling of the throughputs overall data grids and the larger network bandwidth consumption. Sometimes it will cause performance jolt. Given the advantage and disadvantage of these four algorithms, we introduce a prediction-based and based replica replacement strategy.

3.2. The Prediction-based replica replacement algorithm First, we introduce the concept of content similarity. The definition of the content similarity in a Data Grid is still an open problem. Here we just introduce a definition of it that can be used in the prediction functions we have defined. First we define the replica space {R] as the as a set set of all the replicas and the replica-ID space of replica identifiers. Replica identifiers are assumed to be integer positive numbers. Content similarity can be as follows. The defined as a mapping between {R] and smaller the difference the bigger is the content similarity between replica and Based on the assumptions above, the history of replica requests can be seen as a random step in space of replica identifiers. Assuming replica access starts from replica So access event can be seen as series of identifier ) in which i . random steps ( ,i According to spatial and time locality, files close in file space are more likely to be requested close together in time. We only consider a situation that a single user submits a batch of work onto data grid. The replica requests of a user have a correlation each other. We assume steps as follows:

. .

15

5

20

25

t

Figure 1. An example of replica access pattern > in replica identifier space According to

is 1 or -1;

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE

We assume: the probability of s=l is 1 is probability of

+

+

+

+

. .

+

=

so the

i=l

Each can be seen an experiment, in which the output of the experiment is 1 or -1. We assume: = = - . Repeating running n times of the experiments independently, it can be seen n-Bernoulli experiment. Assuming: is X, meanwhile the The number of event {S = number of event

=

The prediction function returns the most probable number of times a file will be requested within a time window T in the future based on the requests (for that or in the past. similar files) within a time window represents the access Assuming number of the replica whose identifier is the positive integer k in the future n replica accesses. and this number is predicted based on the historical information of times of replica accesses. is computed as follows1. in which is the counter of replicas whose identifier is k. 2. For to

+

if In which n is the overall number of experiments, and k is

in which, 3.

i=l

n+k 2 n- k 2

replica in

n+k

n-k

n- k

We assume: : measured value at (identifier of access replica at T time) T: historical time window T': predicted time window M: the number of event in historical time window N: the number of event in historical time window

T n = (M +N )*T' =

2

,

k+ ,

is computed according to

+

N ,n ) represents the popularity of this time window.

3.3. Cost-based replica replacement algorithm

n+k

M M+N N M+N

Suggest Documents