2012 International Conference on Information Technology and e-Services
Dynamic Horizontal Fragmentation, Replication and Allocation Model In DDBSs Hassan I. Abdalla Ali A. Amer
[email protected] [email protected] Department of Computer Science, College of Computer & Information Sciences, King Saud University Abstract - Distributed processing is an effective way to improve performance of database systems. Therefore, fragmentation and proper allocation of fragments across various sites of the network is considered as a key research area in distributed database environment. However, the issue of allocating fragments to the most appropriate sites is not an easy task to perform. In this paper, a synchronized horizontal fragmentation, replication and allocation model in the context of relational database is proposed. A heuristic technique to satisfy horizontal fragmentation and allocation using a cost model to minimize the total cost of distribution is developed. Experimental results are consistent with the hypothesis confirming that the proposed model can efficiently solve dynamic fragmentation and allocation problem in a distributed relational database systems.
Keywords: Distributed DBMS, Fragmentation, Replication, Allocation. I.
INTRODUCTION
Fragmentation and data allocation are the most important techniques in DDBSs design. However, they are often treated separately and rarely together. Both techniques have the same inputs, the key difference between them being the fact that fragmentation starts from global relationships without data allocation being considered, while the data allocation takes into account fragmentation results. Actually, four important issues should be taken into account in distributed database design. First, how global relation should be fragmented. Second, how many copies of a fragment should be replicated. Third, which technique will be used to allocate fragments to the different sites of the network and what information for fragmentation and allocation need to be known. To simplify the overall problem, in this paper we address horizontal fragmentation and fragments allocation issues only. While, the vertical fragmentation (VF) allows relation R to be partitioned into disjoint sets of columns or attributes except the primary key. Horizontal fragmentation (HF) allows relation R to be partitioned into disjoint tuples or instances. Previous works in the area HF or VF have one problem in common: that most of them concentrated only on fragmentation problem and
978-1-4673-1166-3/12/$31.00 ©2012 IEEE
overlooked allocation problem to reduce complexity. The complexity of horizontal fragmentation is O((K2) m) where m is the number of simple predicates and K is the network sites whereas the complexity of vertical fragmentation is O((Km)m) where k is the network sites and m is the number of non-primary key attributes. In the context of relational data model, the existing techniques for HF fall into following two classes: Min-termpredicates techniques [21] and affinity based techniques [6]. After fragmenting the global scheme of the database, the next step is the fragment allocation. There are four data allocation strategies applicable in a distributed relational database: these are centralized, fragmentation (partition), full replication, and partial replication (selective) [7]. Our technique adopts the fragmentation strategy and sometimes partial replication. However, allocation is the process of assigning the database fragments to DDBSs network sites. When data is allocated, it might either be replicated or maintained as a single copy. So, fragment allocation can be either non-redundant or redundant. Under a non-redundant allocation scheme, exactly one copy of each fragment will exist across all the sites, while under redundant allocation schema, more than one copy of each fragment will exist across all the sites [17]. A. RELATED WORK Usually fragmentation is performed without having allocation in mind. While allocation, on the other hand, is always dependant on fragmentation and is performed on the assumption that fragmentation is obtained a priori. In [18], a method was proposed to use min-term predicate to fragment relations. It did that to perform primary horizontal fragmentation, a set of disjoint and completed predicates P={P1,…, Pn} should be determined. In [21], the iterative algorithm called Com-Min was presented to generate a complete and minimal set of predicate Pr
modified according to our requirement of HF and named it Attribute Retrieval and Update Matrix (ARUM). We address horizontal fragmentation and allocation simultaneously in the context of the relational model. So, a heuristic technique to satisfy a synchronized horizontal fragmentation and allocation using cost model is a key contribution of this paper. However, based on the rule of thumb, if fragment is queried more frequently than it is modified, then replication is advisable as it is already done in this work. But, this replication benefit comes with the additional cost of keeping all copies identical. This cost, which could potentially be high, consists of total storage cost, cost of local processing, and communication cost [9].
from a given set of simple predicates. In [24], affinity-based vertical fragmentation method was adapted to horizontal fragmentation. [14] Proposed a graph-based algorithm for horizontal fragmentation such that the predicates clustering based on the predicates affinities. However, [15] presented a horizontal transaction-based partition algorithm which used predicates usage matrix as input, and [13] used predicate affinity for HF in data warehouse. Several researches, have already been submitted the data allocation technique, are restricted in their theoretical and implementation parts and other techniques are ignoring the optimization of the transaction response time. DDBSs design aims to enhance the applications performance by minimizing the amount of irrelevant data accessed by the applications [5], and by minimizing the amount of data transferred in processing the applications [8]. A mathematical modeling for data allocation in DDBSs is proposed using greedy approach [1]. It considered a network communication, a local processing and data storage cost to allocate fragment site by site. This method supports the best location of data fragmentation in DDBSs based on fragment access pattern and cost of moving fragments from one site to other with site constraints maintaining as site capacity (C) and the fragment limit (FL). In [16], formal model of data allocation had been constructed and had derived an algorithm to fragment and allocate the relations.
In this paper, a novel approach is proposed to perform fragmentation horizontally based on sites update and retrieval query frequencies. A properly gathered information about queries will enable the database designer to establish fragmentations that allows an efficient and synchronized (redundant and non-redundant) allocation. The rest of this paper is organized as follows: In section II, we will introduce our proposed model including the fragmentation model, the allocation model and cost functions. In section III the implementation results that justify the developed technique are given. The conclusions we draw from this work is presented in section IV, Finally, future work possibilities are discussed in section V.
Some authors have considered several aspects of the allocation problem into account in different contexts. [12] Combined security considerations into the fragment allocation process. [11] Proposed an algorithm for allocation and replication that adapts to the changing patterns of online requests. Grouping partitioning into an automatic design framework is considered in [3], and [4], considered incremental allocation and reallocation based on changes in workload. In [10], data allocation algorithms presented to achieve the minimum overall communication cost. SAGA approach is proposed in [2] where the optimal places are determined for data fragments individually. Data access pattern and transmission cost were used in this approach to help for specifying that places.
II. THE PROPOSED MODEL A. Problem Description and Notations In this work a new synchronized optimal fragmentation and allocation method have been proposed to perform horizontal fragmentation (HF) and allocation using a cost model. The HF divides the intended relation according to the Attribute Retrieval and Update Frequency (ARUF) matrix (table 6). ARUF is a table constructed by calculating a given retrieval and update frequencies for attribute predicates individually. The ARUF calculation is performed using ARUM (table 3), distance cost matrix (table 4), and cost function. An ARUM matrix is a table constructed by placing predicates of a relation attributes as columns and site queries as rows. Our system development phases are shown in Figure 1.
In [19], CRUD matrix is proposed. Its rows indicate relations attributes and columns indicate several applications locations. DDBSs analysts and designers used CRUD matrix in the requirement analysis phase of system development life cycle for making decision of data mapping to different locations [19, 23]. In this paper, CRUD has been
Relation
ARUM
Allocation Phase
2
ARUF
CA
Fragmentation Phase
FIGURE1: The Proposed System Phases
obtain the distance cost matrix which in its turn is used to calculate the average retrieval cost and the average update cost for candidate attribute predicates individually. Using the results of each predicate Pi, the replication decision for the related fragments will be made.
The proposed model will be performed in two states: in the first state site constraints will be relaxed, while in the second state site constraints will be maintained. B. The Heuristic Model This work introduces a simple cost model that collects retrieval and update information at the sites where queries execute. Then it evaluates the cost of fragmenting the intended relation/fragment and the cost of allocating the resulted fragments across different network sites. Finally, the heuristic model chooses the site to store the candidate fragment by selecting the site that sustains the highest query cost for the fragment. The fragmentation and allocation process is performed in a synchronized fashion. In summary, the model adopts the maximum update and retrieval cost heuristic to find the optimal fragmentation and allocation scenarios. C.
Since all queries accessing fragments are not expected to be for retrieval and update purpose, therefore, the query purpose has to be clearly identified. So, let RF(A) be the retrieval frequency and UF(A) be the update frequency for the attributes of relation R. Both RF(A) and UF(A) are assumed to be obtained by queries accessing fragments at each site. Based on the requirements above, the horizontal fragmentation will be performed following a particular mechanism using the predicates Pi as a “selection” predicate. Recall that ARUM has been built. Using ARUM, Sum of Attributes Frequency Table (SAFT) will be computed (table 5). Then, multiplying SAFT with DM (obtained by minimizing communication cost matrix [23]), ARUF matrix will be acquired as shown in (table 6). Using ARUF, the attribute which has the maximum value (V), Max(ARUF(i)), will be selected and used as a Candidate Attribute (CA) for partitioning a relation. Then, the algorithm will perform the following;
The Proposed Fragmentation Model
The proposed fragmentation model is based on a set of simple predicates Pr [P1,…, Pn] where n is the number of relation attributes. These predicates have been assigned to the relation attributes A[A1,….,An]. Attributes with a numeric value will comprise a predicate Pr that have one of three states: (Pri > V), (Pri < V) or (Pri = V). However, attributes with alphabetical value will comprise a predicate Pr that have only one state: (Pri = V).
First, calculates the Average Retrieval Cost (ARC) and the Average Update Cost (AUC) for candidate attribute predicates individually. Second, for each attribute predicate it checks if ARC > AUC. If this is the case then replication of the concerned fragments will be performed over the sites that require retrieval, update or both. However, if ARC < AUC, then the concerned fragments will be allocated to the site having the maximum value for update cost. In either case, site constraints have to be maintained.
For each predicate Pri there are Pri.RFi value and Pri.UFi value representing the predicate retrieval and update frequency values respectively for the corresponding attribute issued by the site queries Q[Q1,…,Qk] executed under a site query strategy where k is the number of site queries that are assumed to be the most frequently used queries in DDBS accounting for say more than 75% of the processing in DDBS. Each query Qk can be executed from any site with a certain frequency (FREQ) to access the intended fragment. The execution frequencies of k queries at m sites can be represented by m × k matrix, QFkm. The attribute predicate “Pr” is represented as follows;
D. The Proposed Allocation Model Assuming that there is a set of n fragments F = {F1, F2 ,..., Fn}exploited by a set of k queries Q = {Q1, Q2, ..., Qk} on a set of m sites S={S1, S2, …., Sm} in a fully connected network, our allocation model concentrates on finding the optimal distribution of F on S. The allocation problem may be expressed mathematically by a function from the set of fragments to the set of sites : F . S . An optimal allocation is the one which minimizes the cost function. The allocation technique must also take into account the imposed constraints such as site capacity.
Ai.value V, RFi.value1, UFi.value1 Where ∈ [, =], ∀ value1, 0 < value1< 100 and, V ∈ attribute domain (Ai.D), ∀ i, 1 AUC, then this fragment will be allocated to all sites requesting it (table 8). TABLE8: Second Fragment Name Jone Matth Adam Kevin
Birth-date 2/6/1984 2/2/1982 4/2/1980 12/12/1979
Job-Id MAN MAN MGR MAN
Salary 1400 1150 1300 1200
Location Jeddah Abha Abha Jeddah
Dept-Id 3 4 2 1
P3: ARC > AUC, then this fragment will be allocated to all sites requesting it (table 9). TABLE9: Third Fragment Name Jone Zamel
Birth-date 22/9/1990 22/9/1977
Job-Id MAN MAN
Salary 1500 1500
Location Riyadh Makah
Dept-Id 3 1
But regarding Site 2 there is constraints violation in site fragment limit so it could not replicate this fragment in Site 2. Table 10 shows the final fragments allocation for the Employee relation.
5
International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'08), 2008.
TABLE10: Final Fragments Allocation Fragment/Site S1 S2 S3 S4 F1 0 0 0 1 F2 1 1 1 1 F3 1 0 1 1
1 Fi = ® ¯0
[2] H. Abdalla, et al.," An Efficient Approach for Data Placement in Distributed Systems", 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering [3] S. Agrawal, V. Narasayya, and B. Yang, “Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design,”Proc. ACM SIGMOD, 2004. [4] A. Chin, “Incremental Data Allocation and Reallocation in Distributed Database Systems,” Data Warehousing and Web Eng.,S. Becker, ed, 2002. [5] Ezeife, C.I. and Ken Barker. “Distributed Object Based Design: Vertical Fragmentation of classes”.International Journal of Distributed and Parallel Databases, Vol. 6, No. 4, Kluwer Academic Publishers, October 1998.
if fragement i is allocated to site j otherwise
IV. CONSLUSIONS In distributed database systems, relations are frequently partitioned, replicated and allocated either redundantly or non-redundantly over DDBS sites in order to reduce network communication costs. The issues of how to fragment relations, when to replicate resulted fragments and how to allocate fragments to network sites are challenging problems that have been previously tackled either by static fragmentation and allocation algorithms, or based on the analysis of a priori known queries.
[6] Bellatreche,L.,Karlapalem,K.,and Simonet, M. vertical fragmentation in distributed object database system with complex attributes and methods. (DEXA)(1996). [7] Nicoleta Iacob, “Data Replication In Distributed Environments”, Annals of the Constantin Brâncui” University of Târgu Jiu, Economy Series, Issue 4/2010. [8] Kamalaakar Karlapalem, Shamkant Navathe, and Magdi Morsi “Issues in Distribution Design of Object Oriented Databases”. Distribut ed Object Management. Morgan Kaufmann Publishers 1994.
However, a proper synchronized fragmentation and allocation of a distributed database has not been addressed extensively. In this paper we have proposed a horizontal fragmentation technique to concurrently divide relations of a distributed database by incorporating a prior knowledge of query frequencies and data access pattern information. Communication costs between sites and site constraints are considered. Using our fragmentation and allocation model no additional complexity is added to fragments allocation to the different sites of a distributed database system because of the synchronization. Thus, significantly improving the DDBSs performance by avoiding frequent remote accesses and high data transfer cost among the sites.
[9] Saeed K. Rahimi and Frank S. Haug, Distributed Database Managment Systems. A John Wiley & Sons, Inc., Publication, IEEE Computer Society, 2010. [10] X. Lin, M. Orlowska, and Y. Zhang, “On data allocation with the minimum overall communication costs in distributed database design,” in Proceedings of the 5th International Conference on Computing and Information, 1993. [11] W. Lin and B. Veeravalli, “An Adaptive Object Allocation and Replication Algorithm in Distributed Databases,” Proc. 23rd ICDCSW ’03, 2003. [12] A. Mei, L. Mancini, and S. Jajodia, “Secure Dynamic Fragment and Replica Allocation in Large-Scale Distributed File Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 9, Sept 2003. [13] H. Mahboubi and J. Darmont, “Enhancing XML Data Warehouse Query Performance by Fragmentation,” in Proc.ACM SAC09, 2009.
V. FUTURE WORK
[14] Ra, M. Horizontal partitioning for distributed database design. In advances in Database Research (1993),M. orlowska and M. Papazog lou, Eds., World Scientific Publishing.
Our study did not cover vertical partitioning which plays an important role in optimizing queries and maintenance overhead, therefore, we plan to extend our work to include vertical partitioning algorithms that help in obtaining better solutions to the DDBS performance problem.
[15] Khalil,N.,Eid,D.,and Khair,M. Availability and reliability Issues In Database and Expert system Applications (DEXA09). [16] Daudpota, NH, Five steps to construct a model of data allocation for distributed database systems. Journal of Intelligent Information Systems: Integrating Artificial Intelligence and DatabaseTechnologies. 1998.
VI. ACKNOWLEDGMENT The authors would like to thank and appreciate the help they received from the Research Centre of the College of Computer and information Sciences at King Saud University in providing the necessary facilities to accomplish this work.
[17] J. O. Hauglid , N. H. Ryeng, DYFRAM: dynamic fragmentation and replica management in distributed database systems, Distrib Parallel Databases (2010) 28: 157–185 , 2010. [18] Ceri,S., Negri,M.,and Pelagatti, Distributed Database Principles and System. McgraW-Hill, New York, 1984. [19] P.Surmsuk, “The integrated strategic information system planning Methodology,” IEEE Computer Society Press, 2007.
VII. REFERENCES
[20] T.Sto¨ hr, H. Ma¨rtens, and R. Rahm, “Multi-Dimensional Database Allocation for Parallel Data Warehouses,” Proc. 26th Int’l Conf.Very Large Data Bases, pp. 273-284, 2000.
[1] H. Abdalla, et al., “Using a Greedy-Based Approach for Solving Data Allocation Problem in a Distributed Environment”,
6
[21] M. T. ¨Ozsu and P. Valduriez, Principles of Distributed Database Systems. New Jersey: Alan Apt, 1999. [22] Toadere, Teodor . (2002). Graphs. Theory, Algorithms and Applications,Editura Albastra, Cluj-Napoca, Romania). [23] J. Whitten, L. Bentley, and K. Dittman, Systems Analysis and Design Methods, 6th Ed, McGraw-Hill, 2004. [24] Zhang,Y. on horizontal fragmentation of distributed database design. In advances in Database Research1993, M. orlowska and M. Papazog lou, Eds., World Scientific Publishing.
7