A Declarative Approach to Optimize Bulk Loading into ... - CiteSeerX

4 downloads 21010 Views 428KB Size Report
In this paper, we develop a general optimization framework for bulk loading based on a declarative ...... ODBC to send SQL queries to the relational source.
A Declarative Approach to Optimize Bulk Loading into Databases SIHEM AMER-YAHIA AT&T Labs–Research, USA and SOPHIE CLUET INRIA,France

Applications such as warehouse maintenance need to load large data volumes regularly. The efficiency of loading depends on the resources that are available at the source and at the target systems. Our work aims to understand the performance criteria that are involved in bulk loading data into a database and devise tailored optimization strategies. Unlike commercial systems and previous research on the same topic, our approach follows the fundamental database principle of physical-logical independence. A loading program is represented as a sequence of algebraic expressions. This abstraction enables the use of appropriate algebraic rewritings to optimize a loading program and of a cost model that takes into consideration efficiency criteria such as the processing times at the source and target systems and the bandwidth between them. A slow loading program may be preferable if it does not slow down other applications by consuming too much memory. Thus, we view the problem of optimizing a loading program as finding a compromise between several efficiency criteria. The ability to represent loading programs in an algebra and performance criteria in a cost model has two very desirable properties: reusability and efficiency. Database programmers do not have to write loading programs by hand. In addition, tuning loading programs becomes easier since programmers have a better control on the performance criteria specified in the cost model. The algebra captures data transformations that would have been otherwise hard-coded in loading programs. Consequently, richer optimizations can be explored. Finally, our optimization techniques are not specific to one particular system. They can be used for loading data from and to any structured store (e.g., relational, structured files). We implemented our ideas in a complete environment for migrating ODBC compliant databases into the O object-oriented database system. This prototype provides a declarative view language to specify loading, an interface to specify directives such as desired database physical organization and constraints on several criteria such as resource and bandwidth consumption, an algebraic optimizer, a code generator and an execution environment to control failures and guarantee incremental loading. Our experiments show that a tailored optimization is necessary when loading large data volumes into a database. Categories and Subject Descriptors: H.2.7 [Database Management]: Database Administration—Data warehouse and repository General Terms: Design, Management, Performance, Reliability Additional Key Words and Phrases: Declarative bulk loading, algebra, side-effects, recovery

S. Amer-Yahia, AT&T Labs–Research, 180 Park Ave, Florham Park, NJ, USA. S. Cluet, INRIA, Domaine de Voluceau, Le Chesnay Cedex, France. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2004 ACM 0362-5915/2004/0300-0001 $5.00



ACM Transactions on Database Systems, Vol. , No. , 2004, Pages 1–0??.

2



Sihem Amer-Yahia et al.

1. INTRODUCTION In many applications, large amounts of data need to be transferred from one system to another in order to create new data. This is the case when building and maintaining data warehouses [H.Garcia-Molina et al. 1998; Labio et al. 2000], replicating existing data, building mirror Internet sites [Roche and Philippot ] and loading genome files into databases [Pearson 1991]. The major difficulty in bulk loading is to optimize the loading process that may last hours or even days when large amounts of data are involved. According to [H.Garcia-Molina et al. 1998], Sagent Technology builds data warehouses containing more that 100 gigabytes of data and the Walmart warehouse contains 24 Terabytes of data. Another important issue is the ability to resume a load after a failure in order to save time and effort. Finally, a good loading system should organize data on disk so as to optimize further applications. Existing commercial systems partly address these issues but none covers them all. In addition, their loading programs are written by hand and hard-code the choices made for better efficiency. Also, previous research on the topic has been targeted towards specific source and target systems [Oracle ; Fishman and al 1987; Albert and al 1993; INC. 1993; Ontos Inc. 1996; Wiener and Naughton 1996; Labio et al. 2000]. Furthermore, all bulk loading tools are black boxes that generate programs that cannot be easily customized. In practice, system administrators might want to provide guidelines specifying desired data organization on disk or constraints on memory capacity at the source and target systems. The optimization process should be able to take these criteria into consideration in order to generate loading programs that are tailored to the capabilities of the source and target systems. In this paper, we develop a general optimization framework for bulk loading based on a declarative approach that follows the old database principle of logical and physical independence. Users specify their needs in a high level language and the system takes them into consideration to produce efficient and flexible loading programs. The main contributions of this work are: —We used the algebra developed in [Cluet and Moerkotte 1994] to express loading programs. This algebra abstracts these programs in order to capture processing capabilities of the source and the target systems and enable the application of algebraic rewritings to optimize the loading process [Amer-Yahia et al. 1998]. —Since loading programs create new target values, we devised novel algebraic rewritings that take into consideration side-effect operations in order to find the optimal set of loading programs. —We designed a cost model that captures processing times at the source and target systems, communication bandwidth and memory consumption. —We developed a search strategy that splits the loading process into several programs in order to better control failures during loading and guarantee an incremental creation of the target database. The use of an algebra to express loading programs and of a cost model to represent performance criteria create two very important opportunities. First, database programmers do not have to write loading programs since they are generated by the system. Moreover, programmers can tune the performances more easily because they have a better control of the cost criteria. Second, the algebra captures rich data transformations than can be ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



3

exploited during optimization and enable loading data between any two structured stores. We developed a prototype for the migration of relational data into an object database [AmerYahia 1997; ]. We designed a declarative relational-to-object view language to define class and attribute creation from source relations. This language also provides primitives for the specification of physical data organization on the target system. For instance, one can specify desired data clustering so as to avoid an expensive reorganization of the target database after its creation. Once loading programs are optimized according to user-given directives and to available source and target resources, the system generates a complete migration environment. This environment takes care of scheduling these programs and maintaining appropriate checkpoints to resume loading in case of failures. This paper is organized as follows. Section 2 motivates the problem and describes the architecture of a bulk loading system. It also isolates the module that represents our first contribution: the bulk loading optimization framework. Section 3 contains the algebra that is used to express a loading process, a set of examples that show the desired rewritings and, our second contribution, the rewriting rules. Section 4 presents    , the system we developed for relational-to-object loading. Section 5 describes our experiments and performance results. Section 6 contains a comparison of our solution with existing loading systems and prototypes. Finally, Section 7 concludes this work with a discussion. 2. MOTIVATION Our work focuses on efficiency and flexibility when loading data from any structured support (structured files, relational, object-relational or object-oriented database) into a (relational, object-relational or object-oriented) database. In this section, we motivate our approach using an example of loading relational data into an object database. We also present the ideal architecture of a loading system and introduce optimization issues in this context. 2.1 A Loading Example We are given a relational schema that contains information about employees (relation rEmployee), projects (relation rProject) and a many-to-many relationship between them (relation rEmpPrj). rEmployee[emp:integer,lname:string, fname:string,town:string, function:string,salary:real] rProject[name:string,topic:string, address:string,boss:integer] rEmpPrj[project:string,employee:integer] Suppose that we want to populate an object database containing researchers (employees from the relational database who are researchers) and projects. We first create an object schema with two classes: Researcher and Project. An example schema is given below in an ODMG-like syntax [Cattel et al. 1997]: class Researcher name: [firstname: string, lastname: string] salary: real projects: {Project} inverse of Project.staff ACM Transactions on Database Systems, Vol. , No. , 2004.

4



Sihem Amer-Yahia et al.

class Project name: string staff: {Researcher} inverse of Researcher.projects There are several ways of populating this schema. We present two equivalent programs that create the same object database. We give an informal comparison of these programs in terms of communication time between the source relational system and the target object one, processing time and buffer space required for intermediate results. We also compare them in terms of “robustness” to failures, and in terms of the physical organization of target data on disk. Program1 Program2 Part1 select researchers from rEmployee for each record in result create a Researcher object res Part2 select projects from rProject for each record in result create a Project object proj Part3 (rEmployee,rProject,rEmpPrj) project on rEmpPrj attributes for each record in result retrieve project proj retrieve researcher res proj.staff+= res Part4 (rEmployee,rProject,rEmpPrj) project on rEmpPrj attributes for each record in result retrieve researcher res retrieve project proj res.projects+= proj

(rEmployee,rProject,rEmpPrj) sort by project group by project for each record in result create a Project object proj for each employee in the group create or retrieve Researcher object res res.projects+= proj researchers+= res proj.staff=researchers

The first program starts by creating all researchers and projects (Part 1 and 2). Then, it computes the complex attributes projects and staff, between researchers and projects (Part 3 and 4). The second program performs both actions in one step. Outer joins are used to select all relevant employees and projects regardless of whether or not they are involved in a relationship. Note that in order to compute attributes with complex values, either a += operator is used (for set insertion) or a = operator is used (for set assignment). If = is used, appropriate grouping has to be performed before. The table below shows that none of these two programs offers the best solution. In terms of communication time, the first program is clearly better (two relations + the result of a join vs. the result of a large outer-join). However, in terms of relational processing time, the second program is probably doing better (less iterations on source relations). The object processing time measures the number of I/Os performed on the target object database when it is being created. For example, in the first program, the same project ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



5

objects are accessed at least twice (once to create them and once to compute their complex attributes) whereas, in the second program, each project object is accessed only once: at creation time. Buffer space is the amount of memory required for intermediate results and is related to the amount of data communicated at a time. Recovery measures the risk of losing time and effort in case of an involuntary interruption of the loading process. The first program decouples object and value creation and thus, can be more easily decomposed into smaller transactions. Finally, the “target data organization” parameter captures the way data is likely to be organized on disk at the target system. Data organization on disk has an impact on further access to the target database. For example, if we want researchers to be clustered by project on disk, we would rather use the second program. Parameters communication time relational processing time object processing time buffer space recovery target data organization

Program1 + + + researchers,projects

Program2 + + researchers are clustered by project

All these performance criteria are very sensitive to the size of the data that is being loaded at a time. The above comparison shows that there is no unique best solution to the loading problem and that the best solution depends on several criteria such as source and target processing capabilities, resources that are available during loading and the desired target data organization on disk. It is thus essential to provide a mechanism to specify these criteria and take them into consideration to create customized loading programs. 2.2 Architecture and Optimization Figure 1 shows the ideal architecture of a loading system. Source-target mapping transformations may be specified as a default set of rules or may be given by the user (1). These rules define a target schema (2). Users might also provide desired data organization on disk and cost directives such as the weight of communication and processing times (1). Once the target schema is created, the system generates a naive set of loading programs to populate the corresponding database (3). These programs are expressed in an algebra enabling the use of algebraic rewritings to optimize them. In order to find the best set of loading programs, the optimizer uses a search strategy and a cost model that takes into consideration user-given directives (4). Once the optimal set of algebraic expressions is selected, it is used to generate loading programs in the source and target languages. The user does not have to write these programs by hand. These programs must be optimal (in terms of the cost model and the data organization directives), parameterizable, in order to be able to have more control on the loading process (e.g., specify that we want to load data for 5 hours, stop and resume later) and incremental, to be able to stop and resume the load at any time. There requirements are enforced in a loading environment to control program execution and scheduling (6). All of (1), (2), (3), (5) and (6) are specific to the source and target systems that are being used. One of the contributions of this work is (4), the optimization framework, which is generic enough to serve as a basis for optimizing the process of loading any structured data ACM Transactions on Database Systems, Vol. , No. , 2004.

6



Sihem Amer-Yahia et al.

Preliminary Processing

(1)

Source Schema

(2)

SCHEMA CREATION

Compilation

(3)

Target Schema

DEFAULT ALGEBRAIC REPRESENTATION

Source−Target Transformations Physical directives Cost Model Weights

Optimization

(4) OPTIMIZED ALGEBRAIC REPRESENTATION

(5)

P1 P2 P3

. . .

Code Generation

Pn

Parameterized Loading Programs

Source Data

SOURCE SYSTEM

USER

(6) Migration Control Environment

OUR SYSTEM

Target Database

TARGET SYSTEM

Fig. 1. Architecture of a Loading System

(such as ASCII files, XML documents, relational data) to any structured store provided a schema mapping between source and target data is given. The algebra that is used in (4) and the rewriting rules that we developed are based on a data model that captures several kinds of structured data. This will be illustrated in the following sections.

3. BULK LOADING OPTIMIZATION Algebras are the basis for querying and update. In order to abstract the optimization from the specificities of the source and the target languages, we express loading programs in an algebra that enables the definition and the use of appropriate algebraic rewritings to optimize the loading process. We use the algebra of [Cluet and Moerkotte 1994] that we extend with side-effect functions to express the creation and update of new data. This algebra contains a set of wellknown operators similar to relational operators such as selection, projection and join as well as a few others such as  and  . The operators in the algebra are defined on collections of tuples whose attributes can be of any type: atomic, tuple or collection (set and list). Therefore, operators in this algebra can manipulate any source and target data that can be translated into a set of tuples whose attributes could contain an atomic value, a tuple, a collection of atomic values or a collection of tuples. ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



7

3.1 Algebra A loading process can be seen as a three-step process on data : exporting source data, restructuring it and importing it into the target system. The first two steps are easily expressible in the algebra of [Cluet and Moerkotte 1994]. The last one requires extending this algebra by defining two functions to express side-effects on the target database. These side-effects are not random and make use of the result of the two first steps. This algebra is used as the basis for a general optimization framework to bulk load large volumes of structured data into a database. In the sequel,  is an algebraic expressions that contains any sequence of operators.  is a set of tuples from source data.  is a set of tuples from target data.  ,  are attributes.   is a list of attributes.  and  are tuple values. !" is the value of attribute  in  and !# is the value of the sub-tuple of  containing   . $ is a function that will be defined later. The operators: %'&)(*&+ & and Sort are defined in the same manner as in the relational algebra. We give the definition of , and  and refer the reader to [Cluet and Moerkotte 1994] for more details on the other operators.

-!."0 / . 1'2 43657 98:; = ?A@CBED+F 98HGI&)BJ5K :$L2M+NO34F +NPGQ&R+N;S5T98!;U  : for each distinct value of   , creates a partition ? that contains the result of evaluating $ on the tuples of  such that these tuples have the same value for   . For example, V -4.44/0 W . XEZ\Y [ 2]!3 creates one partition for each distinct value of   8 . The partition contains values of   N . To simplify, we will use : V -4.!/0 W .4/0 [ 2 43 to express V -4.!/0 W . X^Z Y [ 2]!3 . When the input tuples are sorted on the grouping attributes in   , we assume that the cost of a , is negligible since it is equivalent to building the set that contains group partitions associated to each unique value of the grouping attributes   . As an example, given a relation R with integer attributes a, b and c: a 1 1 2 1 1

V-4. _ 0:` a\b . c

b 3 4 6 4 3

c 53 78 5 2 34

a 1 (R) results in 1 2

b 3 4 6

g

53,34 78,2

5

 is an operation that iterates on a collection to apply any function to data in that collection. In our case, we use  for two purposes: (i) to create new data from input data and (ii) to transform a set of arbitrary values into a set of tuples to be used by other operators (and vice versa).  is defined as follows:  1 2 43d5e :$L2M+3!F ^GfO ACM Transactions on Database Systems, Vol. , No. , 2004.

8



Sihem Amer-Yahia et al.

This definition applies a function $ to each tuple  in an expression  . In order to keep the result of applying  in the data flow of an algebraic expression, we need to assign it to a new attribute:

Vg)h 0:i 1j+2 43d5e 9k
(%field;)*> key CDATA #REQUIRED reviewid CDATA #IMPLIED rating CDATA #IMPLIED>

articles

BArt

Map k: getAttribute(key) rev: getAttribute(reviewid) rat: getAttribute(rating)

Map t: condnew ( articles, rArticle, k ) t.article_id = k t.reviewid = rev t.rating = rat

authors

BAut/parent

Map id: getID () p: getParent (id) name: getContent (id) Map t: condnew ( authors, rAuthor, id ) t.author_id = id t.name = name t.parent = p

Fig. 2. Naive Algebraic Representation (XML-to-Relational)

We are given a relational schema that maps article elements in the XML document to tuples in the rArticle relation and author elements in the XML document to tuples in the rAuthor relation. rArticle[article id:integer,reviewid:string,rating:real] rAuthor[author id:integer,name:string,parent:integer] ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



11

The fields article id, reviewid and rating corresponds to the article attributes key, reviewid and rating respectively. In rAuthor, the attribute author id is a distinct integer value that is assigned to each author in the XML document. The attribute name corresponds to the content of the element author and the attribute parent points to the article id of the article of whom the author is a child in the original document. Several other XML-to-relational mappings are possible [Amer-Yahia and Fernandez 2002] but this is not the goal of this paper. Figure 2 shows the algebraic representation that creates a relational database from an XML document. The database conforms to the given relational schema. The algebraic expression is composed of two blocks. The first block BArt, creates articles in the rArticle relation. One article is created from each distinct article element in the XML document. The values of each attribute is also computed. The second block B Aut/parent creates authors and computes their attributes. Let us look more closely to BArt. This block takes as input a set of tuples articles that contain the identifiers of all article elements in the XML document. As an example, if a DOM interface is used to read the XML document, computing the articles set would correspond to using a function similar to ?,+Ž 4H4}U+ ‰O~m?'2 ‘\x+| z)]9‘#3 to retrieve pointers to article elements (see http://www.w3.org/TR/DOM-Level-3-XPath/ for more details on DOM-like interfaces). At the level of the algebra, we could define finer operators that characterize this or remain at the level given in the example. This choice depends on how fine we want the optimization to be. The first , operator in BArt augments each tuple in articles with its key, reviewid and rating attributes. The function used for this purpose getAttribute() could be provided by the interface for reading the XML document (e.g. DOM). The second , operator applies side-effect functions to each tuple in the data flow in order to create the corresponding relational tuples and compute their attributes. Note the use of ^@…ƒ†ƒ,„2]3 in order to keep each created tuple  in the data flow and apply update operations to it. The algebraic representation provided in Figure 2 is not the only possible way to create the intended relational database from the XML document. Figure 3 is an example of an equivalent loading program where the XML document is read only once (in a SAX-like fashion) and the target data is created in a single block. Section 3.3 gives more rewriting example and Section 3.4 describes them formally. The second example is the relational-to-object mapping given in Section 4.1. This example loads a relational database into an object one. The object schema contains classes Researcher and Project and their complex attributes: projects and staff. Figure 4 shows the algebraic representation corresponding to the creation of the target object database from the source relational one. This representation is derived from a given schema mapping. We refer to this representation as naive representation. It contains four blocks. Block BRes (resp. BPrj) populates the class Researcher (resp. Project) while blocks Bprojects and Bstaff compute the values of the complex attributes projects and staff. This algebraic representation corresponds to the first program given in Section 2.1 with a variation on the use of a set assignment operation to compute the values of the complex attributes staff and projects. Unlike the case where insertion into a set is used (+=), a grouping operation is needed for set assignment. ACM Transactions on Database Systems, Vol. , No. , 2004.

12



Sihem Amer-Yahia et al.

articles

BAll

Map k: getAttribute(key) rev: getAttribute(reviewid) rat: getAttribute(rating) c: getChildren (’author’)

Map t: condnew ( articles, rArticle, k ) t.article_id = k t.reviewid = rev t.rating = rat Map (c) id: getID () p: getParent (id) name: getContent (id) Map (c) t1: condnew ( authors, rAuthor, id ) t1.author_id = id t1.name = name t1.parent = p

Fig. 3. Other Algebraic Representation (XML-to-Relational)

3.3 Rewriting a Loading Process Several loading algorithms which proceed differently from each other can create the same database (see the two examples in Section 2.1). The algebraic representations of two equivalent programs can look very different. They might contain a different number of blocks. They might create and update data in a different order. We study the equivalence of two algebraic representations, not in terms of the data flows that they carry, but in the effect that each of them has on the target database. In order to do so , we define new algebraic equivalences based on side-effects. These equivalences are used to rewrite one algebraic representation (potentially containing several blocks) into another one. We call these new equivalence rules ’‡“ equivalences. ’‡“ stands for “Database State”. ’‡“ equivalences are less restrictive than standard algebraic rewritings. Two algebraic expressions might be ’‡“ -equivalent even if they do not carry the same data flow. If  8 and  N are two algebraic expressions,  8•”  N~–  8•”E—q˜  N . ’‡“ represents the evolution of the state of the target database which is the set of all created target values at any time of the loading process. Figure 5 shows the evolution of ’‡“ for block BRes . The only operation which modifies ’‡“ is  (since it is the one which applies side-effects). At the end of a block, its algebraic result (data flow) is “thrown” whereas ’‡“ is given as input to the following block. The figure shows the ’‡“ obtained ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases B Res

rEmployee

σ

13

B Prj

rProject

σaddress = ’Rocquencourt’

function = ’Researcher’

Map o: condnew ( rEmployee, Researcher, emp ) o.name = lname+fname o.salary = salary rProject



Map o: condnew ( rProject, Project, name ) o.name = name

B

staff

rEmpPrj

σaddress = ’Rocquencourt’

rProject

Bprojects

rEmpPrj

σaddress = ’Rocquencourt’ rEmployee name = project

rEmployee name = project

σfunction = ’Researcher’

emp = employee

σfunction = ’Researcher’

emp = employee

Group g ; name ; emp

Group g ; emp ; name

Map condnew ( rProject, Project, name ).staff = Map

(g)

Map condnew ( rEmployee, Researcher, emp ).projects = Map

(g)

condnew ( rProject, Project, name )

condnew ( rEmployee, Researcher, emp )

Fig. 4. Naive Algebraic Representation (Relational-to-Object)

after each block of the naive algebraic representation of our example. Note that because of the properties of our side-effect functions (idempotence and insensitivity to null values), two algebraic representations that contain the same blocks (with no new function) but in different order are ’‡“ -equivalent. As an example, no matter in which order the four blocks of the naive representation in Figure 4 are executed, the resulting object database will be the same. We have extended the principle of algebraic optimization to the case of rewriting algebraic expressions that contain side-effect operations. These rewritings make use of the two properties of creation and update functions defined in Section 4: idempotence and insensitivity to null values. In fact, these rewritings rely on standard equivalences extended to allow the presence of redundant and null values in the data flow. We have classified them into two kinds: (i) inter-block equivalences that transform two separate algebraic expressions into a single one and (ii) intra-block equivalences that apply rewriting rules within a single expression. A detailed presentation of the inter-block equivalences is given in Section 3.4. A detailed presentation of the inter-block equivalences is given in Appendix A. For now, we illustrate them using three examples where we introduce block splitting, clustering and ordering. 3.3.1 Merging All Four Blocks.. The first rewriting (see Figure 6) merges all the blocks in the naive algebraic representation presented in Figure 4 into a single one. The resulting block, Ball , is a new algebraic representation that creates the same database as the one created by the initial four blocks. Note that Ball corresponds to the second program in Section 2.1. Let us look more carefully to this merge. The join operations of Blocks B staff and Bprojects have been factored out and transformed into outer-joins. Outer-joins guarantee that even researchers and projects that do not participate to a staff/projects relationACM Transactions on Database Systems, Vol. , No. , 2004.

14



Sihem Amer-Yahia et al. rEmployee φ

target base empty

BRes

σ

function = ’Researcher’ φ

target base empty

Map o : condnew (emp); o.resName = name o.salary = salary DS : all researchers created and their attributes valued

data flow

B

Res

DS : all researchers created and their attributes valued

B

Prj DS : previous DS + all projects created and their attributes valued

Bstaff DS : previous DS + attribute staff of each project valued

Bprojects data flow

DS : previous DS + attribute projects for each researcher valued

the entire database populated Fig. 5. Introducing the DS Parameter

ship are created. If we know that all projects and researchers participate in this relationship, the outer-joins can be replaced by joins (we do not perform this semantic equivalence). The  operation used to compute the values of the projects attribute has been replaced by a set insertion (+= assignment). Researchers are created and updated in an embedded  operation. All  operations have been rewritten into a single one. We introduced a “V‚ operation on the projects name. Note that we can use the ƒ,„ creation function to create projects because project objects are accessed only once. Thus the correspondence table (Project,rProject) is not necessary. Block Ball has five positive characteristics. Since instances of the class Project are accessed only once (at creation time), the (Project,rProject) correspondence table is not necessary and thus is not created. However, the (Researcher,rEmployee) ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases rProject

rEmpPrj



15

Ball

σ

address = ’Rocquencourt’ rEmployee name = project

σ

function = ’Researcher’

employee = emp

πname,emp,lname,fname,salary Sort (name ) Group g ; name ; emp,lname,fname,salary Map o: new (rProject,Project,name) o.name = name;

o.staff = Map

(g )

condnew (rEmployee,Researcher,emp); condnew (rEmployee,Researcher,emp).name = lname+fname; condnew (rEmployee,Researcher,emp).salary = salary; condnew (rEmployee,Researcher,emp).projects += {o}

Fig. 6. Merging All Four Blocks

table has to be maintained. A researcher may belong to several projects and therefore may be accessed several times by …ƒV†ƒ„‡2]!ŠŒ‡3 . Moreover, since all fout blocks have been merged in to one, the number of scans and joins performed in B all is smaller than in the previous four blocks. In addition, we do not make any load balancing assumption. Each operation in the block can be performed at the source system, at the target one or at a “middleware” system (we will show how we use they cost model to enforce the fact that an operation has to be performed at a given site). Assuming an incremental loading inside the object database, we maintain the referential integrity of the staff/projects relationship: i.e., at all times, a project pointing to a researcher is also pointed to by the researcher. Finally, compared to the initial algebraic representation (that contains four blocks), the expression given in Ball provides better support of the following user specification: class Project order by name cluster on staff(Researcher)

The negative aspects of Ball are: The , operation is performed on a very large set, implying large buffer space and expensive computation. In addition, depending on the join selectivity, if we perform the joins at the relational site, the global communication cost might be much higher than that of the naive representation, i.e., the one containing four ACM Transactions on Database Systems, Vol. , No. , 2004.

16



Sihem Amer-Yahia et al. rProject

B’P/R/staff

rEmpPrj

σ

address = ’Rocquencourt’

σtopic = ’Network’

BRes

name = project

πemp

employee = emp

Group g ; name ; emp

Map o: new (rProject,Project,name); o.name = name; o.staff = Map (g ) getnew (rEmployee,Researcher,emp)

B’

= BPrj

with selection topic ’Network’

B’

= Bstaff

with selection topic ’Network’

Prj

staff

Bprojects Fig. 7. Splitting According to User Specifications

blocks, because a larger amount of data might be shipped at a time. Finally, in the case where Ball is used, it is harder to “slice” the loading process into multiple transactions. 3.3.2 Splitting and Merging.. The second rewriting is illustrated in Figure 7. Each of BPrj and Bstaff have been split into two blocks using the predicate topic=Network. Block BRes has been merged with the subpart of blocks BPrj and Bstaff related to Network projects. Thus, at the end of block BP/R/staff, all researchers have been created, as well as all Network projects along with their staff attribute. The remaining parts of the B Prj and Bstaff blocks follow, the expression ends with an unmodified block B projects. This expression represents one possible way to take the following user clustering diACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



17

rective into account: class Project if topic=’Network’ cluster on staff . Note that we can rewrite the three last blocks in any possible way and still preserve this property. On the plus side, block BP/R/staff features a smaller V operation than the naive algebraic expression (only keys are grouped) and potentially smaller communication costs. Furthermore, there is no need to maintain a correspondence table for Network projects. On the negative side, this new representation enforces a join outside the relational system ( A™ r‹š4›tœR ™\™ž'™ r*š ) since it needs an input from BRes . Because of this, it also involves a larger buffering at the object site. B

Prj

rProject

B’

rEmpPrj

Res

σ

address = ’Rocquencourt’ rEmployee name = project

σ

function = ’Researcher’

employee = emp

Sort (emp ) desc

Group nb ; emp,lname,fname,salary ; count (name) Map o: new (rEmployee,Researcher,emp); o.name = lname+fname; o.salary = salary

rProject

Bstaff / projects

rEmpPrj

σ

address = ’Rocquencourt’ rEmployee

name = project

σ

function = ’Researcher’

employee = emp Sort (emp ) desc

Map getnew (rEmployee,Researcher,emp).projects += {getnew (rProject,Project,name)}; getnew (rProject,Project,name).staff += {getnew (rEmployee,Researcher,emp)}

Fig. 8.

Using Relational Grouping

ACM Transactions on Database Systems, Vol. , No. , 2004.

18



Sihem Amer-Yahia et al.

3.3.3 Playing With Group Operations.. Our last example (see Figure 8) shows, among other interesting features, how one can push a , operation to the relational site by adding an aggregate function. The expression consists of three blocks: the unmodified B Prj that takes care of Project objects, B’Res that creates researchers and Bstaff/projects that materializes the relationship between researchers and projects. In block B’Res , a “V‚ operation is added to sort researchers on their key value. This “V‚ also simplifies the , operation that follows and associates to each researcher the number of projects it belongs to. Finally, researchers are created in a  operation. Although we do not have all the information concerning projects that a researcher belongs to, we know the size of the corresponding projects attribute. Thus, assuming the following cluster statement, we are in a position to organize the object storage in an appropriate manner: class Researcher order by desc emp cluster on projects

As was stated above, the expression provides support for the cluster directive specified above and pushes the , operation to the relational site and applies the aggregate function z9vOw‡}U . Assuming that the relational system is more efficient than the object system, this latter characteristic is interesting. If we want to avoid putting extra load on the relational system, this is still interesting since the grouping can be performed more efficiently (no need to create sets). Furthermore, by rewriting the expression appropriately, we could materialize the result of the join and sort operations to avoid evaluating them twice. Unfortunately, this rewriting entails a more important processing at the object site and the need to create two correspondence tables and access them many times (block B staff/projects). However, note that researchers (which should be the largest table) have been sorted on their key at creation time. Thus, we can rely on a sequential access to the correspondence table and to the object store (if it reflects the order of creation) to avoid unnecessary input/output operations. The following section defines our rewriting rules more formally. 3.4 Algebraic Equivalences Intuitively, most of our equivalences are based on the idea that two algebraic expressions are equivalent if the application of the same creation and update functions to the algebraic result (data flow) of each of them has the same side-effect on the database. We classify them into two kinds: (i) inter-block equivalences that transform two separate blocks into a single one and (ii) intra-block equivalences that apply rewriting rules to a single block. We describe inter-block equivalences. Intra-block ones are given in Appendix A. 3.4.1 Inter-block Equivalences:. These equivalences accept two input blocks and transform them into one or two blocks. In the schema below, we want to prove that  8 followed by  N is equivalent to  . Plain arrows represent the data flow (that is discarded at the end of each block) and dotted ones represent the database state (DS). Requirement: These equivalences are possible only if applying an expression N on the result of expression 8 does not modify the content of 8 that will then be used by !Ÿ . The operations contained in :N should not modify m8 (otherwise, the equivalence would not hold). This requirement is always true when :N contains only , operations. , operations, as they have been defined in Section 3.1, either keep the data flow as it is or extend it with more information: , 04i 1 2  8 3 extends each tuple  in  8 with an attribute  that ACM Transactions on Database Systems, Vol. , No. , 2004.

A Declarative Approach to Optimize Bulk Loading into Databases



19

contains the result of evaluating $ on  . Thus, if  8 initially contains tuples of the form =" 8 @  8 &4!4)&+¡¢@ )¡D , the result of  04i 1 2] 8 3 is a set of tuples of the form ="6@ &)U8~@ 48O&!4!)&) ¡ @  ¡ D where   is the type of the result of $k2="‡8~@ 48O&!44+&) ¡ @  ¡ D;3 . By projecting this result on the initial attributes of 8 , we get back all of 8 tuples. Since, m8 is not thrown at the end of block  8 , we can apply Ÿ to it. e1

B1

DS1 e1

B

e1

e2

e2

DS

B2 e3

e3 This is what we want to prove

B1

e1 e1

B1 e1

e2

B2

DS

e1

DS1 e3

B2

DS1

e1

e3

B1 B

e2

πe1

DS e1

e1

e2

e2 DS1

e1

B1

e3

DS e2 is additive (does not remove tuples from e1)

B2

πe1

DS

e2 e3

e3

First Example: The following equivalence is a specific example: e1

B1 e1

Map f1

e1 Map a2 : f2 e2

B2

DS

B

Map a1 : f1 Map a2 : f2 e2

Proof: We can see that in the left part :  1 W 2  8 3d5£ :$ 8 2M+34F •G¤ 8 is first computed then  0 [ i 1 [ 2] 8 3*5e V=" N @$ N 2M+3¥D,

Suggest Documents