A Framework for Designing Materialized Views in

3 downloads 0 Views 241KB Size Report
database. The purpose of the mirroring process is to convert the localized heteroge- ... tual or materialized) of its corresponding localized database. These views ...
A Framework for Designing Materialized Views in Data Warehousing Environment J. Yang

K. Karlapalem

Q. Li

Technical Report HKUST-CS96-35 October 1996

Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong 

Abstract

Data warehouses may contain multiple views with di erent query frequencies. When these views are related to each other and de ned over overlapping portions of the base data, then it may be more ecient not to materialize all the views, but rather to materialize certain "shared" portions of the base data, from which the warehouse views can be derived. In this paper, we address some issues related to determining this set of shared views to be materialized in order to achieve the best combination of good performance and low maintenance. We further develop an algorithm for achieving this goal; we also discuss some existing research work which can be applied to materialized view design within our framework.

The Hong Kong University of Science & Technology

Technical Report Series Department of Computer Science

A Framework for Designing Materialized Views in Data Warehousing Environment Jian Yang

Kamalakar Karlapalem

Qing Li

Dept of Computer Science Dept of Computer Science Hong Kong University of Science & Technology University College Clear Water Bay, University of New South Wales Kowloon, Canberra ACT 2600 Hong Kong Australia fkamal, [email protected] [email protected]

1 Introduction and Motivation In many large organizations a number of localized databases are maintained and managed to support a set of applications. Typically, these localized databases are owned by the departments or sections of an organization and are autonomous. With the organizational data spreading over various parts of the organization there is a need to provide a common platform over these localized databases that support global data processing applications. These applications cater for maintaining consistency of local databases, generating consolidated global reports, supporting local and global on-line transaction processing and decision support systems. Providing integrated access to multiple, distributed, heterogeneous databases and other information sources has become one of the leading issues in database research and industry [1]. The traditional approach to this problem is based on the very general two-step process: (1) determine the appropriate set of information sources to answer the query, and generate subqueries for each sources; (2) gather results from the information sources, combine them, and return the nal answer to the user. This approach is referred as a lazy or on-demand approach to data integration [2], and often uses virtual view(s) techniques. The most recent approach on the other hand is to extract and integrate information of interest from each source in advance and store them in a centralized repository. When a query is posed, the query is evaluated directly at the repository, without accessing the original information sources. We refer this approach as data warehousing, since the repository serves as a warehouse storing the data of interest. One of the techniques this approach uses is materialized view(s). The virtual approach may be better if the information sources are changing frequently, whereas the materialized approach may be better if the information sources change infrequently and very fast query response time is needed. The virtual and materialized view 1

Multilevel materialized view

Global applications, design support applications

support system New local applications

Local applications

Wraper

Wraper

Wraper

Local-member databases coordinator

Shared materialized view set

Correspondence Kowledge Conflict base resolution

...

... Coordination site Site 1

Site 2

Site 1

Site n

Site 2

Site m

Member databases

Local databases

(materialized views of local databases)

Figure 1: The Architecture of Data Warehousing Systems approaches represent two ends of vast spectrum of possibilities. In paper [3], the authors pointed out that the hybrid integrated views, i.e., the combination of fully materialized and virtual views is bene cial, and provided a framework for data integration using the materialized and virtual approaches. However they did not develop the guidelines for determining what should be materialized and what should be virtual views. Our approach for homogenization and integration of disparate databases is supported by a versatile architecture shown in Figure 1, which is a modi ed version of [4]. Wrappers [5, 6] convert data from each source into a common model and also provide a common query language. The knowledge base is used to store information such as correspondence between local database schemas and con ict resolution for any mismatched structures of local database schemas. These components are common in many integration projects [7, 5, 8, 9, 10]. However, the focus of our project is on the following two components:

 Member databases: these are derived databases through what we call "mirroring pro-

cess" [4]. In particular, each member database is a mirror copy of a single localized database. The purpose of the mirroring process is to convert the localized heterogeneous databases into a set of homogeneous databases which can be eciently managed by a single robust DBMS. A member database can be created as a view (either virtual or materialized) of its corresponding localized database. These views are derived based on the information stored in the knowledge base, e.g., schema correspondences of local databases, structure con ict resolution, and moreover, the requirements of new applications. When the member database views are decided whether to be materialized or not, it shall be calculated based on cost of view maintenance and data communication between di erent sites. 2

 Shared materialized view set: for di erent types of analysis, a data warehouse may contain multiple views. When these views are related to each other, e.g., if they are de ned over overlapping portions of the base data, then it may be more ecient not to materialize all of the views, but rather to materialize certain commonly shared views, or portions of the base data, from which the warehouse views can be derived.

In this paper, we only concentrate on the second issue, and assume that the relations of member databases are already obtained. Current research on materialized views is mainly focusing on the techniques for their processing and maintenance [11]. However the methodologies for materialized view design, such as how to determine the set of materialized views based on applications, is merely discussed. The framework presented in this paper highlights some issues in materialized view design in a distributed data warehouse environment. The framework is based on the speci cation of Multiple View Processing Plan (MVPP), from which the problems are formally presented. Furthermore the cost model for materialized view design is analyzed in terms of performance as well as view maintenance. In particular the algorithms for generating MVPPs and determining the set of intermediate nodes to be materialized are presented and analyzed. The discussion here is presented in terms of the relational model with select, project, and join operations. We believe that our approach can be extended to include more complex operations in the relational model, such as query with aggregation functions, recursive queries. The outline for our paper is as follows. Section 2 uses a simple example to illustrate di erent alternatives for materialized view design, while section 3 presents the formal notions of the problem and analyses the related work done in the past. In Section 4, we describe an algorithm for materialized view design. and in section 5 we conclude the presentation of materialized view design methodology by summarizing our results and suggesting some ideas for future work.

2 A Motivating Example This section presents an example to give a progressive overview of several key aspects of materialized view design methodology. Suppose that the member databases contains the following relations: Product (Pid, name, Did) Division (Did, name, city) Order (Pid, Cid, quantity, date) Customer (Cid, name, city) Part (Tid, name, Pid, supplier) We use the shorthands Pd, Div, Ord, Cust, Pt to stand for the above relations respectively. For simplicity, we assume here that these relations are all at the same site, so we will not consider the data communication cost in the following calculation. 3

Suppose that we have the following two frequently asked data warehouse queries: Query 1: Select Pd.name From Pd, Div Where Div.city="LA" and Pd.Did=Div.Did Query 2: Select Pt.name From Pd, Pt, Div where Div.city="LA" and Pd.Did=Div.Did and Pt.Pid=Pd.Pid

Figure 2 (a) gives one access plan for each of the above queries. In order to achieve fast response time, we can materialize some intermediate nodes of each access plan for individual queries. Then the materialized view maintenance costs should be taken into account as well when we calculate the total cost. We notice that tmp2 of query1 is equivalent to that of query2 in Figure 2 (a), which is called common subexpression in [12]. Therefore we can merge these two plans into one plan as shown in Figure 2 (b). If we choose node tmp1 to be materialized, as it can be used for both query1 and query2, the query cost for these two queries will then be less than accessing directly from the base relations Product, Division and Part, and the total maintenance cost will be less than maintaining two tmp1s in local plans. Overall we will have some gains in terms of total cost of global access and view maintenance. Now suppose we have another two frequently asked data warehouse queries: Query 3: Select Cust.name, Pd.name, quantity From Pd, Div, Ord, Cust Where Div.city="LA" and Pd.Did=Div.Did and Pd.Pid=Ord.Pid and Ord.Cid=Cust.Cid and date>7/1/96 Query 4: Select Cust.city, date From Ord, Cust Where quantity>100 and Ord.Cid=Cust.Cid

Figure 3 represents a global query access plan for the above four queries, in which the local access plan for individual queries are combined based on the shared operations on common data sets. We call it Multiple View Processing Plan (MVPP). Now we have to decide which node(s) to be materialized so that the query cost and view maintenance cost is minimal. It is obvious from this graph that we have several alternatives for choosing the set of materialized views: e.g., (1) materialize all the application queries; (2) materialize some of the intermediate nodes (e.g., tmp1, tmp2, tmp4 etc.); (3) leave all the non-leaf nodes virtual. The cost for each alternative shall be calculated in terms of query processing and view maintenance. 4

Query2

Query1

Result 2 Result 1

Pt.name (tmp3)

Pd.name (tmp2)

tmp3 tmp2

(a)

Pt

tmp2 tmp1 Pd

tmp2 tmp1 Pd tmp1 city="LA"(Div)

Division

Product

tmp1 city="LA"(Div)

Division

Product

Part

Query2

Query1

Result 2

Result 1 Pd.name (tmp2)

Pt.name (tmp3) tmp3 tmp2

Pt

(b)

tmp2 tmp1 Pd tmp1 city="LA"(Div)

Product

Part

Division

Figure 2: Individual Query Processing Graph

5

10 35.37k

Query1

Result 1 Pd.name (tmp2)

35.35k

Query2

Result 2 Pt.name (tmp3)

tmp3 tmp2

tmp2 tmp1 Pd

Pd

0.5 50.082m

0.8 12.595m

Result 3 12.594m 50.08m name,quantity (tmp6)

tmp6 tmp2

50.06m

5 Query4

Result 4 city,date(tmp7)

tmp7 tmp5 12.582m quantity>100 (tmp4)

12.044m

12.043m

12.035m

tmp5

35.25k

date>7/1/96 (tmp4)

0.25k tmp1 city="LA"(Div)

Product

Query3

Division

tmp4 Ord

Part

Cust

Order

12.035m

Customer

Figure 3: A MVPP for the Example The assumed size of the relations and other related statistical data is listed in Table 1. Let s stand for the selectivity for the selection condition of above queries, and js stand for join selectivity for a relation involved in a join operation. Here we assume that methods for implementing select and join operation are linear search and nested loop approach respectively. The cost for each operation node in Figure 3 based on data in Table 1 is labeled at the right side of each node. Here the cost is calculated in terms of block access. For example, the cost for obtaining tmp3 by using tmp2 and Part is 50.06 million block access. For simplicity, we assume that all the member database relations Product, Division, Part, Order, and Customer are updated only once for a certain period of time; while within the same period of time, the number of access for each query is: 10 for query1, 0.5 for query2, 0.8 for query3, and 5 for query4, which is labeled on top of each query node in Figure 3. Now we shall calculate the costs of di erent view materialization strategies. Suppose there are some materialized intermediate nodes. For each query, the cost of query processing is query frequency multiplying the cost of query access from the materialized node(s). The maintenance cost for materialized view is cost used for constructing this view (here we assume that re-computing is used whenever a update of involved base relation occurs). For example, if tmp2 is materialized, the query processing cost for query1 is 10  0:1k. The view maintenance cost is 35:25k. The total cost for an MVPP is the sum of all query processing and view maintenance costs. Our goal is to nd a set of nodes to be materialized, so that the total cost is minimal. 6

relation Product Division Order Customer Part ProductJoinDivision Product1Division1Part Order1Customer Product1Division1Order1Customer

size of relation 30k records 5k records 50k records 20k records 80k records 30k records 80k records 25k records 25k records

s or js js = 1=30k s = 0:02 js = 1=5k s = 0:5 js = 1=20k

no. of blocks 3k 0.5k 6k 2k 10k 5k 20k 5k 5k

Table 1: Sizes of Relations and Statistical Data Materialized views Pd, Div, Pt, Ord, Cust tmp2, tmp4, tmp6 tmp2, tmp6 tmp2, tmp4 Q1, Q2, Q3, Q4

Cost of query processing 95.671m 85.237m 25.506m 25.512m 7.25k

Cost of maintenance 0 12.583m 12.382m 12.065m 62.653m

Total cost 95.671m 97.82m 37.888m 37.577m 62.66m

Table 2: Costs for di erent view materialization strategies In Table 2, we list some materialized view design strategies based on the above example, and their costs. From this table, we have the following observations:

 materializing all the application views in the data warehouse can achieve the best

performance at the highest cost of maintenance;  leaving all the application views virtue will have the poorest performance but the lowest maintenance cost;  if we have the intermediate results of some operations materialized, and some virtual, especially when there are some shared operations on common data involved, then we can achieve an optimal result with both performance and maintenance taking into account (e.g., materializing tmp2 and tmp4 are the best among all the listed strategies).

3 Speci cations & Analysis of Materialized View Design 3.1 Formal description of materialized view design problem Materialized view design can be achieved with the help of Multiple View Processing Plan (MVPP). An MVPP speci es the views that the data warehouse will maintain 7

(either materialized or virtual). As will be de ned formally below, the MVPP is a directed acyclic graph (dag) that represents a query processing strategy of warehouse views (see Figure 2). The leaf nodes correspond to the base relations in the member databases, and the root nodes correspond to warehouse queries. Analogous to query execution plans, di erent MVPPs for the same view speci cation may be appropriate under di erent query update characteristics of the applications. We now present the de nitions of MVPP and annotations for them. Formally, a MVPP is a labeled dag M = (V; A; R; Ca; Cm; fq ; fu) where V is a set of vertices, A is a set of arcs over V , such that

 for every operation in a relational algebra for a query, create a vertex;  for v 2 V , R(v) is the result relation generated by corresponding vertex v;  for any leaf vertex v, (e.g., no-in-edge vertex), R(v) corresponds to a base relation   

in the member databases, and is depicted using the 2 symbol. Let L be a set of leaf nodes. For any vertex v 2 L, fu (v) represents the update frequency of v; for any root vertex v, (e.g., no-out edge vertex), R(v) correspond to a global query, and is depicted using the  symbol. Let R be a set of root nodes. For every vertex v 2 R, fq (v) represents the query frequency of v. if R(u) is used in R(v), introduce an arc u ! v; for every vertex v, let S (v) denotes for the source nodes which have edges pointed to v; for any v 2 L, S (v) = ;. Let S fvg = S (v) [ f[v 2S v S fv0gg be the set of descendants of v. for every vertex v, let D(v) denotes for the destination nodes to which v is pointed; for any v 2 R, D(v) = ;. Let D fvg = D(v) [ f[v 2D v D fv0gg be the set of ancestors of v. for v 2 V , Ca(v) is the cost for producing R(v) from the base relations in the member databases; Cm(v) is the cost for maintaining v if v is materialized; if v is a leaf vertex, then Ca(v) = 0 and Cm(v) = 0; 0



0



( )

( )

Now the problem for materialized view design can be described as: 1. nd a pair of u; v 2 V , such that S (u) = S (v) and R(u) = R(v), then R(u) and R(v) are common subexpressions and can be merged (e.g., nodes tmp2 of query1 and query2 in Figure 2 (a) are common subexpressions and can be merged), 2. determine a set of vertices in V , such that if 8v 2 V , and R(v) is materialized, the cost of query processing and view maintaining is minimal. In the next subsection, some existing research which can be applied to the issues addressed here will be analyzed. 8

3.2 Analysis The rst issue discussed previously has been examined in the past in various contexts. [13, 14] used heuristics to identify common subexpressions, especially within a single query. They use operator trees to represent the queries and a bottom-up traversal procedure to identify common parts. [15] discussed the problem of common subexpression isolation. It presents several di erent formulations of the problem under various query language frameworks such as relational algebra, tuple calculus, and relational calculus. In the same paper, the author also described how common expressions can be detected and used according to their type (e.g., single relation restrictions, joins, etc). A lot of research has been in the area of multiple-query processing (MQP), which is related to the second issue discussed in the previous section. The e ort in this area is to nd an optimal execution plan for multiple queries executed at the same time based on the idea of the temporary result sharing should be less expensive compared to a serial execution of queries. In [16], the authors described the optimization of sets of queries in the context of deductive databases and proposed a two-stage optimization procedure: during the rst stage ("Preprocessor"), the system obtains at compile time information on the access structures that can be used in order to evaluate the queries; at the second stage, the "Optimizer" groups queries and executes them in a group instead of one at a time. In [17], the authors proposed an algorithm based on the construction of integrated query graphs. Using integrated query graphs, the authors suggested a generalization of the query decomposition algorithm. [12, 18] suggested a heuristic algorithm to solve the MQO problem. The algorithm performs a search over some state space de ned over access plans. What distinguishes our problem from common subexpression and MQP is as following:

 MQP is to nd an optimal execution plan for multiple queries executed at the same

time by sharing some temporary results which are common subexpressions, while our problem is to nd a set of relations (which can be any intermediate result from query processing), to be materialized so that that the total cost (query accessing plus view maintenance) is optimal;  In MQP, a global access plan derived from the idea of temporary result sharing should be less expensive compared to a serial execution of queries. However, this cannot be true for any database state. For example, sharing temporary result may prove to be a bad decision when indexes on base relations are de ned. The cost of processing a selection through an index or through an existing temporary result clearly depends on the size of these two structures. While in our MVPP, if an intermediate result is materialized, we can establish a proper index on it afterwards if necessary. Therefore it is guaranteed that there is a performance gain if an intermediate result is materialized. If the intermediate result happens to be a common subexpression which can be shared by more than one query, then there is a view maintenance gain as well;  In MQP, the ultimate goal is to achieve the best performance, while our problem has to take consideration of both query and view maintenance cost. 9

 In MQP, the input is a set of queries and the output is a global optimal plan; while

in our problem, the inputs are: a set of global queries and their access frequencies, and a set of base relations and their update frequencies.

In summary, some of the techniques used in common subexpression and MQP can be applicable to MVPP, however our problem is more general and thus more complicate than MQP.

4 Cost Analysis and Algorithms for Materialized View Design 4.1 Cost analysis Let M be a set of intermediate nodes in an MVPP to be materialized, fq i; fu j stand for frequency of executing query i and frequency of updating member database j , respectively, C mv!r and C l!mv for mv 2 M , r 2 R and l 2 L, stand for query access and materialized view update cost respectively. Then the query processing cost will be P n Cqueryprocessing = i fq iC mv!r (

)

(

)

(

=1

i)

The materialized view maintenance cost will be P m Cmaintenance = j fuj C l!mv (

=1

j)

Therefore the total cost Pis P n Ctotal = i fq iC mv!r + mj fu mC l!mv =1

(

i)

=1

(

j)

Our ultimate goal is to nd the set M so that if the members of M are materialized then the value of Ctotal will be minimal among all the possibilities. It is obvious from the last formula that the determination of M depends on four factors: (1) frequencies of global query access, (2) frequencies of member database relation update, (3) costs of query processing from materialized view(s), and (4) costs for materialized view maintenance. In the following subsections, we present algorithms for generating multiple MVPPS and determining the set of intermediate nodes in an MVPP to be materialized so that Ctotal is minimal. Note that in the distributed date warehouse environment, the cost C should incorporate with the costs of date transferring among di erent sites as well.

10

4.2 An algorithm for multiple MVPP design Normally for one query, there are several processing plans, among which there is one optimal plan. Therefore we will have multiple MVPPs based on di erent combinations of individual plans. In order to reduce the search space, we start from individual optimal plans, and order them based on the query access frequencies and costs. Once the order of the query plans is xed, we pick up the rst optimal plan, and incorporate the second one with it based on the idea of using the common subexpressions if there is any. After the rst two are merged, the next one is picked up to incorporate with the merged plan. We keep doing it until all the plans are merged. Then we repeat this procedure by incorporate all other plans with the second expensive plans, so on and so forth, until all the plans have been incorporated with. If there are k number of global plans, we will end up having k MVPPs. For every MVPP generated, we run another algorithm (to be described in next subsection), compare the total cost of each MVPP, and select the one with the lowest cost. The basic ideas of our algorithm for generating an MVPP are as following: (1) for every individual optimal plan, if there is a join operation involved, push the select and project operations up along the tree; (2) for two such modi ed optimal query plans, nd the common subexpressions for the join operations if they share the same source relations, and then merge them; (3) push down all the select and project operations as deep as possible; if there are more than one queries sharing one join operation, and these queries have di erent select conditions on the attributes of two base relations of the join operation, then the select condition for a base relation attribute is the disjunction of all the select conditions on that attribute; the attributes which should be projected for one base relation should be the union of the projection attributes of queries which shares the common join operation, plus the join attribute(s). The algorithm for generating multiple MVPPS is presented in Figure 4. In this algorithm, step 4.3 is to merge the current MVPP with the elements of list l based on the join pattern of current MV PP . The idea is to reserve the join pattern of of current MV PP , try to nd the join operation nodes in the MV PP which can be used in the individual optimal query plan op. If there is any such nodes, evolve the op; otherwise the join pattern in the op shall be used. Figure 5 presents four optimal processing plans for four queries, denoted as op ; op ; op ; op4 respectively. We rst transfer these plans into a form the all the select and project operations are pushed up, and the order of the leaf nodes is the same as the way they are joined. Based on the data in Figure 3, the values of fq  Ca for these queries are: 10  35:37k, 0:5  50:082m, 0:8  12:595m, and 5  12:044m respectively. Therefore list l =< op ; op ; op ; op > Initially. Based on the order in l, we should keep op as it is, and merge the rest of the plans with it in the order of that in the list l. The result MV PP (1) is presented in Figure 6 (a). For simplicity, we ignore the select and project operations. We start with MV PP (1) = op . When op is merged with MV PP (1), we simply add these two plans together since there is no overlapping between the leaf nodes of these two plans. When op is merged, we can divide the leaf nodes of op into two sets: fCustomer, Orderg and fProduct, Divisiong, which elements are already joined in MV PP (1). So a new node, which will be the source node of Query 3, is introduced as a join operation between the 1

4

2

3

1

4

4

2

3

3

11

2

3

begin 1. for each query qi, generate an optimal query processing plan op; 2. for any query involving join operations, push up all the select and project operations; 3. create a list l =< op ; op ; : : :; opk >, in which the elements are in the descending order of the values of fq (opi )  Ca(opi ); 4. for n = 1 to k do 4.1. pick up the rst element from l, l(1), maintain the order of the joins in (1); 4.2. MV PP (n) := l(1); 4.3. for m = 2 to k do 4.3.1. divide the leaf node set of opm into several disjoint subsets, according to the following order of (1) the set of leaf nodes that are already joined conjunctively in MV PP (n), and one of the leaf node in this set is the rst node of the join; (2) the set of leaf nodes that are not joined in MV PP (n), but joined in opm ; 4.3.2. nd the common ancestor node of elements of each subset either in MV PP (n) or in opm , create new node(s) to join these ancestors nodes, replace the nal join operation node in opm with the root node of these new node(s), delete all the un-used nodes and associated edges in opm ; 4.3.3. MV PP (n) := MV PP (n) [ l(m); 4.3.4. m := m + 1; 4.4. n := n + 1; 4.5. move l(1) to the end of the list; 5. for every leaf node v 2 L of every MV PP , nd all the relevant select conditions of queries which are members of R \ S fvg, take the disjunction of select conditions, push it down to v; 6. for every leaf node v 2 L of every MV PP , nd all the relevant project operation of queries which are members of R \ S fvg, take the union of the relevant project attributes, plus the join attribute(s), push it down to v; end; 1

2

Figure 4: Algorithm for Generating Multiple MVPPs

12

Query1 Result 1

Query 2

Result 2

Pd.name (tmp4)

Pt.name (tmp6) tmp6 tmp4 tmp5

tmp4 tmp2 tmp3 tmp4 tmp2 tmp3 tmp3 Did (tmp1) tmp3

Did (tmp1)

tmp 2

tmp5

tmp1 city="LA"(Div)

name, Did(Pd)

name, Did, Pid,(Pd)

Product

Division

Product

Pid (Pt)

tmp1 name="Re"

tmp 2

Division

Part

Query 3

Result 3 Cust.name,Pd.name,quantity (tmp9)

tmp7 tmp4 tmp6

Query 4 Result 4 city,date(tmp4)

tmp9 tmp7 tmp8

tmp4 tmp1 tmp3

tmp4 tmp1 tmp3

tmp3

tmp1 name,Did,Pid

Did (tmp5)

tmp2 date>1/7/96(Ord)

Product

tmp3

tmp6

Pid, Cid quantity

Order

Pid, Cid tmp8

tmp5 city="LA"(Div)

Division

Did,name

Customer

tmp1 Did

Customer

Figure 5: Individual Optimal Query Processing Plans

13

tmp2 date>1/7/96(Ord)

Order

Query 3

Query 4

Query 3 Query 4

Query 2 Query 1

Query 2 Query 1

Customer

Order

Product

Division

Part

Product

Division

Part

(a)

Order

Customer

(b)

Query 2 Query 3

Query 1

Query 1

Query 4

Product

Order

Division

Customer

Part

Product

Query 3

Division

Query 2 Query 4

Customer

Order

Part

(d)

(c)

Figure 6: Multiple MVPPs results: Customer1Order and Product1Division in MV PP (1). Then link this new nodes with Query 3 node and remove all other nodes and associated edges below the Query 3 node. When op is merged, the leaf nodes of op are already joined in MV PP 1, so we link the node of Product1Division in MV PP (1) to Query 1 node, remove all the join operation nodes in op . After we generate the rst MV PP , the rst element of l is moved to the end of the list l, so the list l becomes l =< op ; op ; op ; op >. The MV PP for this list is presented in Figure 6 (b). We repeat this procedure until all the ops have been the rst element of l once. After all the MVPPs are derived, we have to optimize each MVPP by pushing down the select and project operations as far as possible. What di erentiate MVPP optimization with traditional heuristic query optimization is that in MVPP, several queries can share some intermediate nodes, therefore there can be several irrelevant select conditions on base relations. Our approach to solve this problem is to take the union of select conditions for a base relation, which is shared by multiple queries. The way to push down the project operations is similar to traditional approach, i.e., take the union of project attributes of queries including the join attributes. Figure 7 is one of the MVPPs we have constructed after merging the individual plans. To optimize it, we again push down all the select and project operations as far as possible. For Division relation, for example, all the selections it involves are city = "LA", name = "Re", and city = "SF ", which are from Query 1, Query 2 and Query 3 respectively. Therefore we can push the select condition city = "LA"_city = "SF "_name = "Re" down 1

1

1

2

3

1

14

4

Query2

Query1

Result 1

Result 3

Result 2

Pd.name (tmp2)

Pt.name (tmp4)

tmp7 date>7/1/96 (tmp6) city="SF"

Div.name="Re"(tmp3) tmp3 tmp1

Result 4

(tmp7) Cust.name Pd.name, quantity

tmp4 tmp2 city="LA"(tmp1)

Query4

Query3

tmp6 tmp1

Pt

tmp1 Div Pd

city,date(tmp8) tmp8 quantity>100 (tmp5)

tmp5 tmp5 Ord

Division

Product

Part

Order

Cust

Customer

Figure 7: An MVPP Before Optimization to Division node. For Product, the projected attributes are fnameg [ fDidg [ fPidg. The nal optimal MVPP is presented in Figure 8. A follow-up step next is to choose the best MVPP among all the derived ones, through a straightforward analysis and comparison of the (optimized) MVPPs. For example, in Figure 6, we can see that (a) and (b) are equivalent, and each has kept three join patterns from op , op , and op . Therefore the optimal plans of Query 1, 2, 4 are still maintained in MVPP (a) or (b). On the other hand, (c) is not a desirable MVPP since op 's join pattern is reserved which has the longest join path and cannot be shared with others. Note that the algorithm in Figure 4 still may not guarantee us that the optimal MVPP can always be obtained (and not missed), since only a subset of the possible MVPPs (for a given set of queries) has been considered. Nevertheless we believe it captures a reasonable subset of MVPPs, out of which a satisfactory (and balanced) solution can be found out with an acceptable eciency. 1

2

4

3

4.3 Algorithm for selecting intermediate nodes to be materialized Given a MVPP, we shall nd a set of materialized views such that the total cost for query processing and view maintaining is minimal by trying comparing the cost of every possible combination of nodes. Suppose that there are n nodes in MVPP excluding leaf nodes, then we have to try 2n combinations of nodes. However we can use some heuristic to reduce the search space. Before we present our heuristic algorithm, we shall introduce all the notations used in this algorithm:

 Ov denotes the global queries which uses v, Ov = R \ Dfvg; where D fvg is the set 15

Query1

Query2

Result 1

Query3

Result 2

Pd.name (tmp4)

Query4

Result 3

Result 4

name,quantity (tmp9)

Pt.name (tmp6) tmp6 tmp4 tmp5

tmp9 tmp4

city,date(tmp10)

tmp8

tmp4 tmp2 tmp3

tmp3 Did (tmp)

tmp2

tmp1 city="LA"(Div) city="SF" name="Re"

(Pd) name,Did,Pid

tmp8 tmp7

tmp5 name, Pid (Pt)

Division

Product

Part

Cust

tmp7 date>7/1/96 (Ord) quantity>100

Order

Customer

Figure 8: An MVPP After Optimization

 

of ancestors of v. Iv denotes the base relations which are used to produce v, Iv = L \ S fvg; where S fvg is the set of descendants of v. w(v) denotes the weight of a node, which is calculated as w(v) = Pq 2O fq (qi)  Ca(v) ? Pb 2I fu (bi)  Cm(v). The rst part of this formula indicates the saving if node v is materialized, the second part indicates the cost for materialized view maintenance. LV is the list of nodes based on descending order of w(v); Sv = S fvg is the set of nodes (leaf nodes and intermediate nodes) which are used to produce v; Suppose there are k queries; Let M be the set of materialized views. i

   

v

i

v

The algorithm in Figure 9 for determining M is based on the following idea: whenever a new node is considered, we calculate the saving it brings if it is materialized in accessing all the queries it involves, subtracting the cost for maintaining this node. If this value is positive, then this node will be materialized and added into M . In Step 5, the rst part of Cs is access saving if v is to be materialized. The second part is additional view maintenance cost for v. Cs > 0 indicates that there is cost gain if P v is materialized. u2S ^M Ca(u)) is the replicated saving in case of some descendants of v

16

begin 1. M := ;; 2. create list LV for all the nodes (with positive value of weights) based on the descending order of their weights; 3. pick up the rst one v from LV ; 4. create Ov , Iv , and Sv ; 5. calculate Cs = Pq 2O ffq(qi)  (Ca(v) ? Pu2S ^M Ca(u))g ? Pb 2I ffu(bj )Cm(v)g; 6. if Cs > 0, then 6.1. insert v into M ; 6.2. remove v from LV ; 7. otherwise remove v and all the nodes listed after v from LV who are in the same branch of v; 8. repeat step 3 until LV =; 9. 8v 2 M , if D(v)  M , then remove v from M ; end; i

v

v

j

v

Figure 9: Algorithm for Materialized View Design

v are already chosen to be materialized. After applying transformation, Cs becomes:

P P P Cs = PPq 2O fq (qi)  CP a(v ) ? b 2I fu (bi )  Cm (v ) ? q 2O fq (qi )( u2D ^M Ca (u)) = w(v) ? q 2O (fq (qi)  u2D ^M Ca(u)) v

i

i

i

v

v

v

i

v

v

For example, if v is a descendant of v , and w(v ) > w(v ), then the second part of the above formula for v and v are same. Therefore if materializing v will not gain anything, then de nitely there will not be any gain to materialize v . In this way we can save some search space. Now we run this algorithm with our example in Figure 3. Initially LV =< tmp4; result4; tmp7; tmp2; result1; tmp1 > (we ignore the nodes whose weight are negative), and M = ;. Let us start with tmp4: Otmp = fQuery3; Query4g, Itmp = fOrder; Customerg, Cs = (5 + 0:8)  12:03m ? 12:03m = 57:744m > 0, so M = ftmp4g; Next one from LV is result4: Oresult = fQuery4g, Iresult = fOrder; Customerg, Cs = 5  (12:043m ? 13:03m) ? 12:043m < 0 (note: result4 has a descendant tmp4 which is in M ), so result4 should not be materialized. Since tmp7 is in the same branch with result4, tmp7 is removed from LV . For tmp2, Cs = 363:075k > 0, so tmp2 is inserted into M . For result1, Cs < 0; for tmp1, Cs > 0, however since its parent tmp2 is already in M , tmp1 is ignored. As a result, tmp2 and tmp4 will be materialized. 1

1

2

1

2

2

1

2

4

4

4

17

4

5 Conclusions We have outlined in this paper the issues in the multiple materialized view design methodology, e.g., how to select the a set of intermediate results to be materialized so that the overall cost is minimal. We have also provided a systematic way of addressing the problem and designed an algorithm for it. The main motivation for performing inter global view analysis is the fact the common intermediate results may be shared among multiple global queries. The algorithm proposed for determining a set of materialized views is based on the idea of reusing temporary results from the execution of global queries with the help of Multiple View Processing Plan (MVPP). The cost model takes consideration of both query access frequencies and base relation update frequencies, both query access costs and view maintenance costs, which makes our problem much more complicated than such related work as Multiple Query Processing. The algorithm for generating multiple MVPPs uses the techniques from single query optimization, coupled with query tree merging techniques which aims to incorporate the individual optimal query plans as much as possible in the MVPP. The work presented here is the outcome of the rst stage of research in Materialize View Design project at HKUST. We are working on materialized view design for more complicated queries such as query with aggregation functions, recursive queries, which occur in the data warehousing environment. We are also working on the algorithm for choosing the best MVPP for materialized view design. Finally, we will focus on developing an analytical model for a multiple view processing environment. Using a good analytical model will allow us to simulate various environments with di erent view mixes.

References [1] IEEE Computer. Special Issues on Heterogeneous Distributed Database Systems, 24(12), December 1991. [2] Jennifer Widom. Research problems in data warehousing. Proc. of 4th Int'l Conference on Information and Knowledge Management (CIKM), Nov 1995. [3] R. Hull and G. Zhou. A framework for supporting data integration using the materialized and virtual approaches. SIGMOD Record, 25(2):481{92, June 1996. [4] K. Karlapalem, Q. Li, and C. Shum. Hodfa: An architectural framework for homogenizing heterogeneous legacy databases. SIGMOD RECORD, 24(1), March 1995. [5] M.J. Carey et al. Towards heterogeneous multimedia information systems: The garlic approach. technical Report RJ 9911, IBM Almaden Research Center, 1994. [6] J.C. Franchitti and R. King. Amalgame: a tool for creating interoperating persistent, heterogeneous components. Advanced Database Systems, pages 313{36, 1993. [7] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proc. ICDE Conf., pages 251{60, 1995. 18

[8] R. Ahmed et al. The pegasus heterogeneous multidatabase system. IEEE Computer, 24:19{27, 1991. [9] W. Litwin, L. Mark, and N. Roussopoulos. Interoperability of multiple autonomous databases. ACM Computing Surveys, 22:267{293, 1990. [10] W. Kim et al. On resolving schematic heterogeneity in multidatabase systems. Distributed and Parallel Databases, 1:251{279, 1993. [11] A. Gupta and I.S. Mumick. Maintenance of materialized views: problem, techniques, and applications. IEEE Data Engineering Bulletin, Special Issue on Materialized Views and Data Warehousing, 18(2):3{18, June 1995. [12] T.K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems, 13(1):23{53, March 1988. [13] P.V. Hall. Common subexpression identi cation in general algebraic systems. Tech. Rep. UKSC 0060, IBM United Kingdom Scienti c Centre, Nov. 1974. [14] P.V. Hall. Optimization of a single relation expression in a relational data base system. IBM J. Res. Dev. 20, 3, pages 244{257, May 1976. [15] M. Jarke. Common subexpression isolation in multiple query optimization. Query Processing in Database Systems, pages 191{205, 1984. [16] J. Grant and J. Minker. Optimization in deductive and conventional relational database systems. Advances in Data Base Theory, 1, 1981. [17] U.S. Charkravarthy and J. Minker. Processing multiple queries in database systems. Database Engineering, 5(3):38{44, Sep 1982. [18] K. Shim, T. Sellis, and D. Nau. Improvements on a heuristic algorithm for multiplequery optimization. Data & Knowledge Engineering, 12:197{222, 1994.

19

Suggest Documents