IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 27, NO. 1,
JANUARY 2016
31
A Round-based Data Replication Strategy Mohammad Bsoul, Alaa E. Abdallah, Khaled Almakadmeh, and Nedal Tahat Abstract—Data Grid allows many organizations to share data across large geographical area. The idea behind data replication is to store copies of the same file at different locations. Therefore, if a copy at a location is lost or not available, it can be brought from another location. Additionally, data replication results in a reduced time and bandwidth because of bringing the file from a closer location. However, the files that need to be replicated have to be selected wisely. In this paper, a round-based data replication strategy is proposed to select the most appropriate files for replication at the end of each round based on a number of factors. The proposed strategy is based on Popular File Replicate First (PFRF) strategy, and it overcomes the drawbacks of PFRF. The simulation results show that the proposed strategy yields better performance in terms of average file delay per request, average file bandwidth consumption per request, and percentage of files found. Index Terms—Data Grid, replication strategy, PFRF, round-based, simulation
Ç 1
INTRODUCTION
A
Data Grid is a collection of storage and computational resources that are distributed over a wide area network. The Data Grid provides a remote access to data and other resources [1], [2], [3], [4]. In the Data Grid, the users are distributed across a large geographical area. These users need access to a large volume of data which might be in a far node. This access consumes a large amount of time of bandwidth. Hence, data replication is needed to make more than one copy of the same file at different nodes which helps the user to fetch the file from its own storage or from the storage of a close node. As a result, both the consumed time and bandwidth will be reduced [5]. In the literature, many researchers proposed strategies [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17] to select the best nodes to which you replicate the files. Nevertheless, none of these strategies are round-based and they also do not take into account the variation that might occur in user behavior. Dividing the time into rounds usually leads to a better decision on which files to replicate because this decision is made after a large number of file requests and therefore the users will determine more accurately which files need to be kept in their storages. In real situation, the user behavior might change since users sometimes change their interests in files. PFRF strategy [18] fills those gaps by dividing the time into rounds and at the end of each round it calculates the popularity of different files. Then, it only replicates a percentage of the most popular files to different clusters.
M. Bsoul and A. E. Abdallah are with the Department of Computer Science and Applications, Hashemite University, Zarqa 13115, Jordan. E-mail: {mbsoul, aabdallah}@hu.edu.jo. K. Almakadmeh is with the Department of Software Engineering, Hashemite University, Zarqa 13115, Jordan. E-mail:
[email protected]. N. Tahat is with the Department of Mathematics, Hashemite University, Zarqa 13115, Jordan. E-mail:
[email protected]. Manuscript received 6 Apr. 2014; revised 30 Nov. 2014; accepted 1 Dec. 2014. Date of publication 6 Jan. 2015; date of current version 16 Dec. 2015. Recommended for acceptance by S. Olariu. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2015.2388449
On the other hand, PFRF strategy has a number of drawbacks. First, this strategy does not determine to which cluster node the file is replicated. Therefore, a number of factors have to be used in that determination such as number of requests, free storage space, and node centrality. Second, this strategy only considers the number of requests to determine the file popularity (importance). However, there are other important factors to determine the popularity of a file such as how many times it was requested in the last round and the file size. Third, in PFRF, the average popularity for a file is defined as the sum of the popularities of the file only in the clusters having it divided by how many clusters having this file. However, some clusters might not have the file but have a high request rate for it. Therefore, the average popularity for a file is better to be defined as the sum of the popularities of the file in all clusters divided by the number of clusters in the Data Grid. Eventually, in case of replicating a file to a cluster node with no enough free storage space, PFRF compares the popularity of the new file to the popularity of each stored file separately, while in our proposed strategy, the popularity of the new file is compared to the sum of popularities of a group (one or more files depending on the space still needed to store the new file) of files at once. In this paper, a data replication strategy named Improved PFRF (IPFRF) is proposed. This strategy is based on PFRF but overcomes its drawbacks. This paper is structured as follows. Related works are presented in Section 2. Section 3 describes the used system framework. Section 4 presents the new proposed strategy called IPFRF. Section 5 describes the three metrics which are used for measuring the performance of the strategies. Section 6 explains how the simulation is configured. The simulation results are discussed in Section 7. Finally, Section 8 concludes the paper and describes future work.
2
RELATED WORKS
This section presents the works that studied the problem of dynamic replication in Data Grid. In [8], the authors proposed a dynamic data replication algorithm named Latest Access Largest Weight (LALW).
1045-9219 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
32
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
Their algorithm associates a different weight to each file access record. The more recent file access record has a larger weight. The first disadvantage of this algorithm is that it does not determine to which node within the cluster each file has to be replicated. The second disadvantage of this algorithm is that it only considers the number of requests to determine the importance of files. The authors of [10] proposed two replication algorithms. The first algorithm is for load balancing. This algorithm is used to replicate files to proper server locations so the workload on each server is balanced. The second algorithm is used for determining the required minimum number of replicas when knowing the maximum capacity of each server. The drawback of the two algorithms is that they always consider that the user behavior is fixed. In [6], two dynamic replication algorithms named Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU), were proposed to reduce the average response time of data access. These algorithms place the files in close locations to the clients that their number of requests is higher than a predetermined threshold. The disadvantage of these two algorithms is that they only take into consideration the number of requests to determine the locations of files. However, there are a number of other important factors to consider in that determination such as node free storage space and node centrality. The authors of [7] introduced an approach named FairShare Replication (FSR) to balance the load and storage usage of the servers in Data Grid. It takes into consideration the number of requests and the load on the nodes before determining to store the files or not. The first drawback of this approach is not taking the centrality of the node in that determination which is important to reduce the overall consumed time and bandwidth to obtain the requested nonexisting files. The second drawback of this approach is that it does not take the variation in user behavior into account. In [12], the authors presented a strategy that is based on Fast Spread but superior to it. The idea behind this strategy is using a threshold to determine if the requested file needs to be copied to the requesting node. In this strategy, the only factor used in determining the importance of a file is the number of requests. In [11], another strategy that is based on Fast Spread has also been presented. However, this strategy considers more factors to determine the importance of a file such as the number and frequency of requests, size of the file, and the last time the file was requested. This strategy does not consider the scenario where user behavior might vary. In [15], the authors implemented a strategy that is based on historical data access and proactive deletion. The strategy uses the historical popularity of the file to determine if to replicate the file or not. In this strategy, the request time has an effect on the weight of access records. The disadvantages of this strategy are not considering the size of the file as an important factor in the replication decision and not considering that sometimes the user behavior might vary. The authors of [14] proposed a replication algorithm called Dynamic Hierarchical Replication (DHR). This algorithm replicates files to the sites that have the highest number of requests for files. The first drawback of this algorithm is that it only considers number of requests to determine if
VOL. 27, NO. 1,
JANUARY 2016
to replicate the file or not. The second drawback is that it assumes that the user behavior is always fixed. A new replication algorithm named Popularity Based Replica Placement (PBRP) was proposed in [16]. This algorithm reduces data access time by creating replicas for files with high number of requests. PBRP places replicas close to the nodes that requested them frequently. This algorithm has the same drawbacks of the previous algorithm. The authors of [17] presented a category-based dynamic replication strategy for Data Grids that takes into consideration that the files exist on a node belong to different categories. Each of these categories is given a value that determines its importance for the node so when the node’s storage is full; the node starts to store only the files that belong to the category with the highest value. In this strategy, the only factor used in determining the importance of a category is the number of requests. In [9], the authors proposed a strategy called Bandwidth Hierarchy Replication (BHR), which reduces the access time by avoiding congestions in the Data Grid. In BHR strategy, the Data Grid is divided into regions and the bandwidth between the nodes within the same region is higher than the bandwidth between the nodes within different regions. As a result, if the requested file is located in the same region, then less time will be consumed in bringing the file to the requesting node. The requested file is placed in a node that has a large amount of bandwidth between it and the node where the job will be executed. A new replication strategy that is based on BHR was introduced in [13]. The Data Grid is divided into regions that each has its own region header. The strategy replicates files to the nodes that requested them frequently. Additionally, the strategy increases the data availability by replicating the files to the region headers. Therefore, when a node requests a file, it can be brought from the region header in its region. The first disadvantage of BHR and this strategy is that they only consider the number of requests to determine the importance of each file and do not consider the size of it. The second disadvantage of them is comparing the number of requests of the requested file to the number of requests of each stored file separately and not to the number of requests of a group of stored files in the case deletion is required to provide space to store the requested file. In [18], the authors proposed a new strategy named PFRF. In PFRF, the time is divided into rounds and at the end of each round the popularity of each file is calculated. Then, it only replicates a percentage of the most popular files to different clusters. In addition to the drawbacks mentioned for the above algorithms, all these algorithms except PFRF are not roundbased. It was mentioned before in Section 2 that dividing the time into rounds usually leads to a better decision on which files to replicate because this decision is made after a large number of file requests and therefore the users will determine more accurately which files need to be kept in their storages. PFRF strategy has a number of drawbacks that will be mentioned in details in Section 4. In this paper, we propose a new replication strategy that is based on PFRF strategy but overcomes its drawbacks. Additionally, the new proposed strategy overcomes the drawbacks of other existing replication algorithms mentioned above by
BSOUL ET AL.: A ROUND-BASED DATA REPLICATION STRATEGY
33
Fig. 1. System framework.
having unique features which will be discussed in details in Section 4.
PFRF computes a file popularity (FPi;c;n ) for each file i in each cluster c at round n as follows:
3
FPi;c;n ¼
SYSTEM FRAMEWORK
In the system framework, there is a number of clusters. Every cluster comprises a number of nodes that are located in a close geographical area. Moreover, there is a Master site that has all the files in the Data Grid. The storage of each cluster node is small and therefore cannot accommodate all the files in the Data Grid. For this reason, the non-existing files need to be brought from other nodes. The node will start searching for the non-existing files at the closest node. If the closest node does not have the file, it searches the next closest node, and so on. Fig. 1 shows the system framework, where each rectangle represents a cluster. CN is an abbreviation for cluster node.
4
IPFRF STRATEGY
PFRF is a replication strategy that divides the time into rounds of fixed length. Then, at the end of each round, it determines the candidate files for replication based on their order in a set named S. The algorithm of this strategy consists of four phases which are file access aggregate phase, file popularity calculation phase, file selection phase, and file replication phase. In the first phase, PFRF computes the number of requests for each file i in cluster c at round n, sorts these number of requests in a decreasing order, then stores them in a list named S. Next, PFRF computes the number of files that were requested by all nodes within cluster c at round n (NOFRANðc; nÞ). In the second phase,
FPi;c;n1 þ NORi;c;n a FPi;c;n1 b
if NORi;c;n > 0; if NORi;c;n ¼ 0:
ð1aÞ (1b):
Where NORi;c;n is the number of requests for file i in cluster c at round n, and a and b are constants, where a < b. The reason why a must be less than b is described in [18]. Then, PFRF computes an average file popularity for each file in each cluster at the current round as follows: AV GPOPi ¼
PNOCHF
FPi;c;n : NOCHF
c¼1
(2)
Where NOCHF is the number of clusters having the file. In the third phase, PFRF sorts the set S of each cluster based on the average file popularities in a decreasing order. Next, PFRF computes the number of candidate files for replication (CFFRc;n ) in each cluster c at round n as follows: CFFRc;n ¼ NOFRANðc; nÞ ð1 XÞ:
(3)
Where NOFRANðc; nÞ is the number of files that were requested by all nodes within cluster c at round n, and X is a constant between 0 and 1. PFRF then chooses the first CFFRc;n files from set S as cluster c’s replication candidates at round n. In their study, X is set to 0.8. As a result, PFRF replicates the top 20 percent of the frequently accessed files. In the last phase, PFRF checks if each candidate file for replication exists on the cluster or not. If it exists, PFRF takes
34
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
no action. Otherwise, it checks if there is a node in the cluster with enough free storage space to store the file. If there is such a node, PFRF replicates the file from the closest cluster node having this file to that node. Otherwise, it deletes a number of files that are less popular than the file from a node in the cluster in order to provide enough free space to store the file. IPFRF is based on PFRF, but overcomes its drawbacks. In PFRF, there is no specification for the cluster node the file has to be replicated to. The nodes within the same cluster might have interests in different files. Additionally, those nodes have different statuses. In IPFRF strategy, for each candidate file for replication i, we calculate the suitability (Sn;i ) for each node n in the cluster as follows: Sn;i ¼ A
NORn;i FSSn SODn þC 1 : þB HNORi TSS HSOD
(4)
Where Sn;i is the suitability of cluster node n for file i, NORn;i is the number of requests from cluster node n for file i, HNORi is the cluster node with highest number of requests for file i, FSSn is the free storage space of cluster node n, TSS is the total storage space, SODn is the sum of distances between cluster node n and the other nodes in the cluster, and HSOD is the cluster node with the highest SOD. A, B, and C are constants and are used to assign weights to the above three factors, where A þ B þ C ¼ 1. If the three factors have the same weight, then these constants are set to the same value which is 13. The three factors can be assigned different weights based on the user preference. If the user who employs this strategy is more interested in selecting the cluster node with the highest number of requests, then the user can assign a higher weight to the first factor by increasing the value of A and decreasing the values of B and C in return. If the user is more interested in balancing the load on different nodes in the cluster, then the user can assign a higher weight to the second factor by increasing the value of B and decreasing the values of A and C in return. If the user is more interested in selecting the cluster node that is more centered between the other nodes in the cluster, then the user can assign a higher weight to the third factor by increasing the value of C and decreasing the values of A and B in return. SODn is calculated by summing the distances from cluster node n to the other nodes in the same cluster. The values of SODs of the cluster nodes within the same cluster are compared to each other and the SOD with the maximum value is set to be the HSOD. The number of requests reflects the cluster node interest. Free storage space is taken into account to balance the load on the nodes in the cluster. The sum of distances represents the centrality of the node in its cluster. Taking the centrality into consideration results in reducing the overall consumed time and bandwidth to obtain the non-existing files. Then, the cluster node that will be the host for the file is the node with the highest value of suitability. This improvement is called Sub-strategy1 in Section 7. The second drawback of PFRF is that it only considers the number of requests to determine the popularity of files. However, in IPFRF, the file popularity (FPi;c;n ) for file i in cluster c at round n is calculated as follows:
( FP
i;c;n1 þðNORi;c;n aÞ
FS
FPi;c;n ¼
VOL. 27, NO. 1,
NORi;c;n TNORn
JANUARY 2016
if NORi;c;n > 0; ð5aÞ if NORi;c;n ¼ 0: ð5bÞ
0
Where NORi;c;n is the number of requests from cluster c at round n for file i, FS is the file size, TNORn is the total number of requests at round n, and a is a constant. As in PFRF strategy, we assumed that all the files in the first round follow the binomial distribution which means that the initial value of file popularity (FPi;c;0 ) is 0.5. Considering the file size is important because the cluster nodes have limited storage space, and small files will occupy less space. However, if file i was not requested at all in cluster c at round n (NORi;c;n ¼ 0), the file size will not have any effect on the calculation of FPi;c;n and its value will be zero whatever the file size is. As a result, the other files that were requested in cluster c at round n (FPi;c;n > 0) will have a better chance for replication than the files that were not requested at all (FPi;c;n ¼ 0) even if the requested NOR
i;c;n ones have large sizes. TNOR represents the percentage of n number of requests from cluster c at round n for file i to the total number of requests in this round. This improvement is called Sub-strategy2 in Section 7. The third drawback of PFRF is how it calculates the average popularity for a file. In PFRF, the average popularity for a file is equal to the sum of the popularities of the file only in the clusters having it divided by how many clusters having this file (refer to equation (2)). On the other hand, IPFRF strategy considers all the clusters in the Data Grid in the calculation of average popularity for a file even if some of them do not have the file. The reason is that the cluster that do not have the file might have a high access rate for it. Thus, they need to be considered in the calculation. In IPFRF strategy, the average popularity (AV GPOPi ) for file i in cluster c at round n is calculated as follows:
AV GPOPi ¼
PNOC
FPi;c;n : NOC
c¼1
(6)
Where NOC is the number of all clusters in the Data Grid. This improvement is called Sub-strategy3 in Section 7. The last drawback of PFRF is comparing the candidate file for replication to each stored file separately and not to a group in the case deletion is required to provide space to store the file. However, IPFRF compares the popularity of the candidate file for replication to the sum of popularities of a group of existing files at once. The number of files in the group depends on the space still needed to store the candidate file for replication. The reason of comparing the candidate file for replication to a group is that this file might be more important than each file in the group, but not more important than the group as a whole. In IPFRF, after selecting the cluster node with the highest suitability, the popularities of the existing files on this node are sorted in increasing order in a list. Then, the above group will contain the corresponding files of the first n popularities in the list that their sizes are greater than or equal to the space still needed to store the candidate file for replication, where n 1. If the popularity of the candidate file for replication is greater than the sum of popularities of the files in that group, then the file will be
BSOUL ET AL.: A ROUND-BASED DATA REPLICATION STRATEGY
35
TABLE 1 Definitions of Pseudocode Variables of IPFRF Strategy Variable
Definition
CList NORList FileList NOFRAN X CNHS CNHSFSS SFileList SOS PopList GoFs SizeList
The list that contains the clusters in the Data Grid. The list that contains how many times each file has been requested by the specified cluster. The list that contains the existing files on the Data Grid. Number of files that were requested by all nodes in the specified cluster. Constant between 0 and 1. The cluster node with highest suitability. The CNHS’ free storage space. The list that contains the files of the corresponding values in S. The variable that contains the sum of sizes of a group of files on the CNHS. The list that contains the popularities of existing files on the CNHS sorted in increasing order. Group of files. The list that contains the sizes of the corresponding files in PopList.
replicated to the node and the files in that group will be deleted. Otherwise, the opposite will happen. This improvement is called Sub-strategy4 in Section 7.
Pseudocode 1. IPFRF Strategy at the End of Round n 1 for c ¼ 1 to CList:size do 2 Store the values of CListðcÞ:NORList in CListðcÞ:S then sort them in a decreasing order; 3 for i ¼ 1 to FileList:size do 4 if CList(c).NORList(i) > 0 then 5
FPi;c;n ¼
FPi;c;n1 þðNORi;c;n aÞ FS
NORi;c;n TNORn
6 if CList(c).NORList(i) ¼ 0 then 7 FPi;c;n ¼ 0 8 for i ¼ 1 to FileList:size do PNOC
FP
/*Equation 5a*/ /*Equation 5b*/
/*Equation 6*/ 9 AV GPOPi ¼ c¼1NOC i;c;n 10 for c ¼ 1 to CList:size do 11 Sort the values in CListðcÞ:S in a decreasing order based on the average popularities (AV GPOP ); 12 for x ¼ 1 to ðintÞðCListðcÞ:NOFRAN ð1 XÞÞ do 13 Select the CNHS for CListðcÞ:SFileListðxÞ; 14 if CNHS:hasitðCListðcÞ:SFileListðxÞÞ ¼ false then 15 if CNHSFSS CListðcÞ:SFileListðxÞ:Size then 16 Replicate CListðcÞ:SFileListðxÞ to CNHS from the closest cluster node having it; 17 else 18 Initialize SOS to 0; 19 for y ¼ 1 to CListðcÞ:PopList:size do 20 if SOS þ CNHSFSS < CListðcÞ:SFileListðxÞ:Size then 21 SOS ¼ SOS þ SizeListðyÞ; 22 else 23 Break; P 24 PopularityGoFs ¼ y1 z¼1 CListðcÞ:PopListðzÞ /*The sum of popularities of the group of files that need to be deleted in order to store the new file*/ 25 PopularityCListðcÞ:SFileListðxÞ ¼ CListðcÞ:PopList(Index of CListðcÞ:SFileListðxÞ in PopListÞ; /*The popularity of the new file*/ 26 if PopularityGoFs < PopularityCListðcÞ:SFileListðxÞ then 27 for z ¼ 1 to y 1 do 28 Delete PopListðzÞ, SizeListðzÞ; 29 Replicate CListðcÞ:SFileListðxÞ to CNHS from the closest cluster node having it;
Pseudocode 1 shows the algorithm of this strategy at the end of round n. For definitions of variables used in the pseudocode, refer to Table 1.
5
COMPARISON METRICS
In the current work, there are n clusters C1 ; C2 ; . . . ; Cn in the Data Grid, m cluster nodes CN1 ; CN2 ; . . . ; CNm in each cluster, and one Master site. In this paper, three metrics are used to measure the performance of the strategies: average file delay per request, average file bandwidth consumption per request, and percentage of files found. The first two metrics need to be minimized, while the third metric needs to be maximized. Average file delay per request is equal to the sum of delays resulted from transmitting the files from the sending nodes to the receiving nodes divided by the total number of requests in the simulation. Minimizing it means that the nodes get the files they need in less time. If the node has the file it needs, the delay in this case is considered zero. Average file bandwidth consumption per request is the sum of consumed bandwidths for file transfers between nodes divided by the total number of requests in the simulation. The consumed bandwidths for file transfers must be reduced in order to decrease the possibility of congestion to occur and because the bandwidth is limited. Percentage of files found is equal to the number of requested files that were found on the storages of the requesters (locally) divided by the total number of requests in the simulation. This metric gives an indication on how accurately a strategy can predict the user behavior. Table 2 shows the metrics used to measure the performance of the strategies.
6
SIMULATION CONFIGURATION
In this paper, an event-driven simulator written in Java was used for evaluating the PFRF strategy, the four sub-strategies that compose IPFRF strategy, and the IPFRF strategy. The verification of the simulation is done by ensuring that the obtained output (for a small environment with small number of nodes, clusters, and files) matches with the results obtained by hand calculations. A performance comparison between the strategies is made under two scenarios. In the first scenario, the uniform distribution is used to determine which files are requested by the cluster nodes. In
36
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 27, NO. 1,
JANUARY 2016
TABLE 2 Metrics of Strategies SOD AFDPR ¼ TNOR SOBC AFBCPR ¼ TNOR
POFF ¼ NOFFL TNOR 100
AFDPR ¼ Average file delay per request, SOD ¼ Sum of delays in the simulation, and TNOR is the total number of requests during the simulation. AFBCPR ¼ Average file bandwidth consumption per request, and SOBC ¼ Sum of bandwidth consumptions in the simulation. POFF ¼ Percentage of files found, and NOFFL ¼ Number of files found locally in the simulation.
this scenario, the probability of requesting any of the files is the same. In the second scenario, zipf distribution is used to determine which files are requested by the cluster nodes. This distribution represents the case that the probability of making a request for some files is higher than the probability of making a request for the rest of files. It was found that the popularity of files requested follows the zipf distribution [19]. In each of the two scenarios, we made the comparison under 4 different round lengths (900,000, 450,000, 225,000, and 112,500). In PFRF, the authors of this strategy did not mention which node in the cluster is selected to store the candidate file for replication. Therefore, we assumed that in the case a file needs to be replicated to a cluster, the first node in the cluster is checked to see if it has enough free storage space to store it. If there is no enough free storage space, the second node is checked and so on. If none of the nodes in the cluster have enough free storage space to store the file, one of them is selected randomly and in this case a number of less popular files on the selected node need to be deleted in order to provide space to store the new file. On the other hand, in IPFRF strategy, the cluster node that is selected to store the new file is the node with the highest suitability (refer to equation (4)). In each simulation, all the nodes are set to employ one of the strategies under a set of parameters. For each strategy, the simulation was run eight times; four for each of the two scenarios since in each scenario the comparison is made under four different values of round length. Since six strategies are evaluated, the total number of simulation runs was 48 (8 6). We assumed in our simulations that the disk space is limited in relation to the size needed to store all the files in the Data Grid. The storage space of each cluster node is 50,000 mega bit and the number of cluster nodes in each cluster is 10 and therefore the storage space for each cluster is 50,000 10 which is equal to 500,000 mega bit. However, there are 200 different files in the Data Grid and the size of each file is between 1,000 and 20,000 mega bit and since the uniform distribution is used to represent the file sizes, the file size will be 1;000 þ2 20;000 on average which is equal to 10,500 mega bit. Thus, the size needed to store the 200 different files is equal to 200 10,500 which is equal to 2,100,000 mega bit and that is much higher than the storage space of the cluster which is only 500,000 mega bit. In all simulations, the nodes within the same cluster are distributed in a 10,000 10,000 m2 region. Moreover, the difference between the x-axes of the lower right corner and upper left corner of two successive clusters is 100,000, and which is the same for y-axes. The coordination of points that represent the location of cluster nodes are selected randomly taking into consideration that no two points have the
same coordination. The Master site is given a location at (0,0). At the beginning of the simulation, each cluster node has five files or less depending on the size of the files so their sizes do not exceed 50,000 mega bit. The inter-arrival times of nodes’ requests are equal to 3. When the zipf distribution is employed, there is a change in user behavior every 10 rounds. This is achieved using the following equation: RFN ¼ ðGRFN þ ði 4ÞÞ
mod NAV:
(7)
Where RFN is the requested file number, GRFN is the generated requested file number by the used generator, i is a variable with initial value of zero and it is increased by 1 every 10 rounds, and NAV is the number of allowed values for the requested file number. For example, if GRFN is 40 and NAV is 200 (0 to 199) then after 600 rounds, the value of i will be 60 and the value of RFN will be: RFN ¼ ð40 þ ð60 4ÞÞ
mod 200 ¼ 80:
(8)
There is no change in user behavior when the uniform distribution is used because the probability of requesting any of the files is the same. Therefore, there is no point from changing the user behavior. The delay (D) is calculated as follows: D ¼ ðTT Þ þ ðPT Þ FS FS D þ NoINin þ ¼ ðNoINout þ 1Þ TSout TSin PS ðNoINout þ 1Þ NoINin D þ þ : ¼ FS TSout TSin PS (9) Where TT is the transmission time, PT is the propagation time, NoINout is the number of intermediate nodes (e.g. routers, switches) between the sending and receiving nodes and which are outside their clusters, FS is the file size, TSout is the transmission speed outside the clusters, NoINin is the number of intermediate nodes between the sending and receiving nodes and which are inside their clusters, TSin is the transmission speed inside the clusters, D is the distance between the sending and receiving nodes, and PS is the propagation speed. On the other hand, the bandwidth consumption (BC) is calculated as follows: BC ¼ ðNoIN þ 1Þ FS:
(10)
Where NoIN is the number of intermediate nodes between the sending and receiving nodes, and FS is the file size. Table 3 shows the simulation parameters and their values.
BSOUL ET AL.: A ROUND-BASED DATA REPLICATION STRATEGY
37
TABLE 3 Simulation Parameters Parameter
Value
Number of nodes Number of clusters Number of nodes within the same clusters Number of different files Size of each file Number of generated requests The inter-arrival times of cluster nodes requests Simulation time Storage space for every cluster node Storage space for Master site Number of intermediate nodes between two nodes that are in the same cluster Number of intermediate nodes between two successive clusters Transmission speed outside clusters Transmission speed inside clusters Propagation speed A B C
7
SIMULATION RESULTS AND DISCUSSION
In this section, a comparison between PFRF, the four substrategies that compose IPFRF, and IPFRF was made. Tables 4, 5, and 6 show that IPFRF strategy is the superior one in both scenarios. It also can be seen that all sub-strategies have similar or better performance than PFRF strategy in those two scenarios. When comparing the sub-strategies to each other, none of them has the best performance at all round lengths. In general, it also appears that all strategies achieved better performance under scenario2 (Zipf) especially when the round length is large. This is because of having files that are requested more frequently than the rest of files. As a result, the probability of finding the requested files at the requesters’s storages is relatively high if compared with the probability in scenario1 (Uniform). TABLE 4 Average File Felay per Request achieved by PFRF, the Four Sub-Strategies, and IPFRF under the Two Scenarios
100 excluding the Master site 10 10 200 Between 1,000 and 20,000 mega bit 30,000,000 3 90,000,000 50,000 mega bit Large enough to accommodate all the files in the Data Grid 1 3 100 Mbps 1 Gbps 2 105 m/s 1 3 1 3 1 3
It can also be observed that the difference between the performance of IPFRF and PFRF strategies is larger in scenario2 (Zipf) which means that IPFRF strategy can adapt better to change in user behavior than PFRF strategy. Finally, it can be concluded that decreasing the round length has no or slight influence on the performance of all strategies in scenario1 (Uniform). The reason is that there is no change in user behavior in this scenario and therefore the round length will have no or slight influence on the performance of the strategies. However, in scenario2 (Zipf), decreasing the round length has usually a negative effect on the performance of strategies. This is because there is a change in user behavior every 10 rounds as it is mentioned in the previous section and decreasing the round length means increasing the number of rounds since the simulation time for each run is fixed to 90,000,000 and as a result the TABLE 5 Average File Bandwidth Consumption per Request achieved by PFRF, the Four Sub-Strategies, and IPFRF under the Two Scenarios
Round length
Scenario1 (Uniform) 900,000 450,000 225,000
112,500
Round length
PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
1190.93 1142.94 1042.09 1083.07 1110.36 979.52
1208.13 1160.75 1126.62 1055.29 1157.28 989.01
1202.95 1137.96 1061.02 1062.01 1155.52 969.84
PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
Round length
Scenario2 (Zipf) 900,000 450,000
225,000
112,500
Round length
PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
951.34 928.19 897.48 500.58 909.71 373.52
1071.10 1035.46 869.21 1042.41 1051.88 437.58
1103.17 1078.77 944.03 1135.89 1099.19 494.81
PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
1184.82 1160.41 1124.70 1028.22 1143.40 989.51
968.63 935.77 826.68 850.66 952.67 398.83
Scenario1 (Uniform) 900,000 450,000 225,000 127821.58 123037.99 112996.04 117058.02 119799.63 106758.60
127208.45 124776.79 121244.89 111585.39 123100.65 107758.40
Scenario2 (Zipf) 900,000 450,000 103803.15 101479.31 98399.92 58596.12 99641.79 45835.43
105475.55 102202.85 91195.50 93667.85 103890.14 48278.85
129533.24 124808.12 121442.24 114290.49 124482.27 107711.17
112,500 129014.05 122531.00 114887.72 114955.62 124308.05 105793.12
225,000
112,500
115815.11 112253.92 95470.62 112962.09 113904.27 52178.22
119133.70 116691.59 103062.48 122416.46 118733.53 58007.99
38
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
TABLE 6 Percentage of Files found achieved by PFRF, the Four SubStrategies, and IPFRF under the Two Scenarios
Round length
Scenario1 (Uniform) 900,000 450,000 225,000
PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
2.65% 2.71% 3.62% 2.58% 3.25% 3.61%
Round length PFRF Sub-strategy1 Sub-strategy2 Sub-strategy3 Sub-strategy4 IPFRF
2.62% 2.71% 3.70% 2.60% 3.36% 3.79%
112,500
2.57% 2.66% 3.74% 2.73% 3.37% 3.77%
2.56% 2.58% 3.65% 2.49% 3.32% 3.53%
Scenario2 (Zipf) 900,000 450,000
225,000
112,500
5.06% 5.13% 5.47% 7.04% 5.14% 8.09%
2.51% 2.64% 5.38% 2.42% 2.53% 7.34%
2.19% 2.27% 5.15% 1.90% 2.28% 6.96%
3.87% 4.09% 5.71% 4.05% 4.08% 7.65%
VOL. 27, NO. 1,
JANUARY 2016
TABLE 8 Percentage Decrease in Average File Bandwidth Consumption per Request achieved by IPFRF compared with PFRF
Round length
Scenario1 (Uniform) 900,000 450,000 225,000 16.48%
15.29%
112,500
16.85%
18.00%
Scenario2 (Zipf) Round length
900000
450000
225000
112500
55.84%
54.23%
54.95%
51.31%
TABLE 9 Percentage Increase in Percentage of Files found achieved by IPFRF compared with PFRF
Round length
Scenario1 (Uniform) 900,000 450,000 225,000 36.23%
44.66%
112,500
46.69%
37.89%
Scenario2 (Zipf) TABLE 7 Percentage Decrease in Average File Delay per Request achieved by IPFRF compared with PFRF
Round length
Scenario1 (Uniform) 900,000 450,000 225,000 17.75%
16.48%
112,500
18.14%
19.38%
Scenario2 (Zipf) Round length
900,000
450,000
225,000
112,500
60.74%
58.83%
59.15%
55.15%
Round length
900000
450000
225000
112500
59.88%
97.67%
192.43%
217.81%
files found up to 46.69 and 217.81 percent in scenarios 1 and 2, respectively. In future work, we plan to test our new proposed strategy on a real Data Grid. Furthermore, we plan to consider a new scenario where nodes can enter and leave the Data Grid.
ACKNOWLEDGMENTS user will change its behavior more frequently. This more frequent change in user behavior means that the popular files for each node will also change more frequently and which will have negative effect on the performance of the employed strategy. This negative effect results from the need to keep deleting the files that become not popular and copying in their places the new popular files each time the user changes its behavior. Tables 7, 8, and 9 show the percentage decrease in average file delay per request, the percentage decrease in average file bandwidth consumption per request, and the percentage increase in percentage of files found, respectively achieved by IPFRF strategy compared with PFRF strategy.
8
CONCLUSION
In this paper, a round-based data replication strategy called IPFRF has been implemented. IPFRF is based on PFRF but overcomes the shortcomings of PFRF. IPFRF is superior to PFRF in terms of average file delay per request, average file bandwidth consumption per request, and percentage of files found. IPFRF strategy achieved a reduction in average file delay per request up to 19.38 and 60.74 percent in scenarios 1 and 2, respectively, while it achieved a reduction in average file bandwidth consumption per request up to 18.00 and 55.84 percent in the same scenarios. Additionally, IPFRF strategy achieved an improvement in percentage of
Mohammad Bsoul is the corresponding author of the article.
REFERENCES S. Figueira and T. Trieu, Data Replication and the Storage Capacity of Data Grids. Berlin, Germany: Springer-Verlag, 2008, pp. 567–575. [2] D. G. Cameron, A. P. Millar, C. Nicholson, R. Carvajal-Schiaffino, K. Stockinger, and F. Zini, “Analysis of scheduling and replica optimisation strategies for data grids using Optorsim,” J. Grid Comput., vol. 2, no. 1, pp. 57–69, 2004. [3] H. Lamehamedi, B. Szymanski, Z. Shentu, and E. Deelman, “Data replication strategies in grid environments,” in Proc. 5th Int. Conf. Algorithms Architectures Parallel Process., 2002, pp. 378–383. [4] M. Bsoul, “A framework for replication in data grid,” in Proc. 8th IEEE Int. Conf. Netw. Sens. Control, Delft, The Netherlands, 2011, pp. 234–236. [5] K. Ranganathan and I. Foster, “Identifying dynamic replication strategies for a high-performance data grid,” in Proc. GRID ’01: Proc. 2nd Int. Workshop Grid Comput.. London, United Kingdom: Springer-Verlag, 2001, pp. 75–86. [6] M. Tang, B. Lee, C. Yeo, and X. Tang, “Dynamic replication algorithms for the multi-tier data grid,” Future Generation Comput. Syst., vol. 21, no. 5, pp. 775–790, 2005. [7] Q. Rasool, J. Li, G. Oreku, and E. Munir, “Fair-share replication in data grid,” Inform. Technol. J., vol. 7, no. 5, pp. 776–782, 2008. [8] R. Chang and H. Chang, “A dynamic data replication strategy using access-weights in data grids,” J. Supercomput., vol. 45, no. 3, pp. 277–295, 2008. [9] S. Park, J. Kim, Y. Ko, and W. Yoon, “Dynamic data grid replication strategy based on internet hierarchy,” in Proc. 2nd Int. Workshop Grid Cooperative Comput., 2003, pp. 838–846. [10] J. Wu, Y. Lin, and P. Liu, “Optimal replica placement in hierarchical data grids with locality assurance,” J. Parallel Distrib. Comput., vol. 68, no. 12, pp. 1517–1538, 2008. [1]
BSOUL ET AL.: A ROUND-BASED DATA REPLICATION STRATEGY
[11] M. Bsoul, A. Al-Khasawneh, E. E. Abdallah, and Y. Kilani, “Enhanced fast spread replication strategy for data grid,” J. Netw. Comput. Appl., vol. 34, no. 2, pp. 575–580, 2011. [12] M. Bsoul, A. Al-Khasawneh, Y. Kilani, and I. Obeidat, “A threshold-based dynamic data replication strategy,” J. Supercomput., vol. 60, no. 3, pp. 301–310, 2012. [13] K. Sashi and A. Thanamani, “Dynamic replication in a data grid using a modified BHR region based algorithm,” Future Generation Comput. Syst., vol. 27, no. 2, pp. 202–210, 2011. [14] N. Mansouri and G. Dastghaibyfard, “A dynamic replica management strategy in data grid,” J. Netw. Comput. Appl., vol. 35, no. 4, pp. 1297–1303, 2012. [15] Z. Wang, T. Li, N. Xiong, and Y. Pan, “A novel dynamic network data replication scheme based on historical access record and proactive deletion,” J. Supercomput., vol. 62, no. 1, pp. 227–250, 2012. [16] M. Shorfuzzaman, P. Graham, and R. Eskicioglu, “Adaptive popularity-driven replica placement in hierarchical data grids,” J. Supercomput., vol. 51, no. 3, pp. 374–392, 2010. [17] M. Bsoul, A. Alsarhan, A. F. Otoom, M. Hammad, and A. Al-Khasawneh, “A dynamic replication strategy based on categorization for data grid,” Multiagent Grid Syst., vol. 10, no. 2, pp. 109–118, 2014. [18] M. Lee, F. Leu, and Y. Chen, “PFRF: An adaptive data replication algorithm based on star-topology data grids,” Future Generation Comput. Syst., vol. 28, no. 7, pp. 1045–1057, 2012. [19] L. Adamic and B. Huberman, “Zipfs law and the internet,” Glottometrics, vol. 3, pp. 143–150, 2002. Mohammad Bsoul received the BSc degree in computer science from Jordan University of Science and Technology, Irbid, Jordan, the Master’s degree from University of Western Sydney, Parramatta, NSW, Australia, and the PhD degree from Loughborough University, Loughborough, United Kingdom. He is currently an associate professor in the Department of Computer Science of the Hashemite University, Zarqa, Jordan. His research interests include wireless sensor networks, grid computing, cloud computing, distributed systems, and performance evaluation. Alaa E. Abdallah received the BSc degree in computer science from Yarmouk University, Irbid, Jordan, in 2000, the MSc degree in computer science from University of Jordan, Amman, Jordan, in 2003, and the PhD degree in computer science from Concordia University, Montreal, QC, Canada, in 2008. He is currently an assistant professor in the Department of Computer Science of the Hashemite University, Zarqa, Jordan, since 2011. Prior to joining Hashemite University, he was a network researcher at consulting private company in Montreal, from 2008 to 2011. His research interest includes the routing protocols for ad hoc networks, parallel and distributed systems, and multimedia security.
39
Khaled Almakadmeh received the doctorate degree in software engineering from University of Quebec, Quebec, QC, Canada. He is currently an assistant professor in the Department of Software Engineering at the Hashemite University, Zarqa, Jordan. His research interests include software requirements, software effort estimation, and software quality.
Nedal Tahat received the BSc degree in mathematics from Yarmouk University, Irbid, Jordan, the MSc degree in mathematics from Al al-Bayt University, Mafraq, Jordan, and the PhD degree from University Kebangsaan Malaysia, Malaysia. He is currently an assistant professor in the Department of Mathematics of the Hashemite University, Zarqa, Jordan. His research interests include cryptography, with a focus on both classical and function-based digital signatures. He is currently working on developing function based signatures systems using hybrid mode problems. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.