Semantic Integration of Semi-Structured Data Seung-Jin Lim Yiu-Kai Ng Computer Science Department Brigham Young University Provo, Utah 84602, U.S.A. Email: fng,
[email protected] Abstract
As data access beyond traditional intranet boundary is popular on the Internet these days, the demand for an integrated and uniform method for accessing through the Web existing data sources that are dierent in structures and semantics is increasing. This demand is partly driven by users who want to access more diverse information, such as up-to-date information on stock market, entertainment, news, and science. The demand is also partly driven by information providers who provide information service to customers on the Web. In this paper, we present an approach to integrate semistructured data sources and structured data sources through automated structure resolution. The structure resolution approach can easily be adopted to i) integrate existing relations in the relational database model into semi-structured data sources, and ii) merge sets of semi-structured data that have dierent structures with no human intervention. The integration of multiple data sources by using our approach results in the uni ed view (UV) of the data sources, which is presented in an XML DTD format. UV can be used for query optimization on heterogeneous data sources.
Keywords: semi-structured data, integration, uni ed view, heterogeneous/hierarchical data, XML, relational database constraints
1 Introduction The Web can be perceived as a global network of heterogeneous data sources. As Web technologies have been driving the way of accessing data sources beyond the intranet boundaries, we witness an increasing demand on integrated and uniform methods for accessing data available on the Internet, regardless of the dierences in data models and structures underneath where the data are stored1 . There are a wide spectrum of approaches to meet One evidence is the advent of widely used powerful technology, such as servlets with JDBC, for accessing databases more often than using the CGI technology in a platform-independent manner. Prior to the advent of this kind of technology, database access from the Web was signi cantly limited. 1
1
the demand. At one end of the spectrum is the integration of dierent data sources by providing a precompiled global view of the data [LC94]. Data warehouse repository [Wid95] is such an approach where data of interest from dierent sources are maintained at a central location as materialized views. `World View' in the Information Manifold [LRO96] is another example where a set of information sources are `tagged' for ecient information searching before the information is served to the user. In [BCV99, CA99], semi-structured and structured data sources are analyzed beforehand and the data elds that are declared using dierent terminology but carry the same or closely related information are manually integrated together to provide an integrated view to the user. [LN99] propose an empirical algorithm to convert HTML tables (of which data are presented in a two-dimensional tabular form) into semi-structured data without human intervention. At the other end of the spectrum is the approach which processes queries against heterogeneous data sources onthe- y. In this approach, queries are decomposed, and the decomposed subqueries are sent to relevant data sources. The answers to subqueries are merged afterwards. The approach in [Wie92] falls into this category. Comparatively, there are fewer research work adopting this approach since in general processing queries on demand against heterogeneous data sources is more time consuming than processing queries against pre-con gured repositories, and the integrated systems based on this approach may suer lagging. Between these two dierent approaches, there are hybrid approaches where data that do not change over time are pre-con gured onto a repository, whereas data that frequently change are fetched on demand [MW97]. In either approach of integrating heterogeneous data sources, the architecture of the integrated systems tends to be a typical 3-tier architecture: i) the user interface at the top which composes and initiates queries, ii) heterogeneous data sources at the bottom, and iii) the middle-tier which processes queries and controls data ows. The middle-tier is typically called integrator or meditator, which incurs human intervention for the analysis of data sources and the re nement of integrated views of data until the system becomes operational. In addition, integrated views of underlying data are often given in a propriety format, such as OEM [PGMW95] and ODLI 3 [BCV99]. In this paper, we propose an automated approach to generate global views, called uni ed views, of semi-structured and structured data sources. During the process of creating uni ed views: i) we integrate semi-structured and structured data sources by using our semi-structured data model (as de ned in Section 2) which is suitable for capturing both types of data. There are two potential problems due to structural dierences that make the integration process challenging: a) data sources inherently have dierent structures if they come from dierent data models, e.g., data in the relational data model are conformed to a tabular structure, whereas data in the semi-structured data model are restricted to a graph structure. We call the process of structure resolution of data sources from dierent data model inter-structure resolution. b) Even data sources of the same (semi-)structured data model are unlikely to have the same graph structure in terms of the presentation of data if they have been created by dierent designers. We call the process of structure resolution of data sources from the same data model intra-structure resolution. ii) Besides the integration of heterogeneous data sources, we provide the uni ed view of the integration using the XML grammar [BPSM98], in particular the DTD format, rather than using a proprietary 2
format. Since XML is a W3C standard and is widely used in electronic data exchange, the XML grammar is well-understood and well-accepted by the database/software community. XML is also platform independent and widely supported and thus enhances the adoptability of uni ed views that are written in XML by other data managing applications on various platforms. Uni ed views can be integrated into a system to resolve the structural dierences among semi-structured data and relations in the relational data model, or among sets of semi-structured data of dierent structures. iii) We eliminate human intervention in generating uni ed views so that the process of computing uni ed views can be fully automated and hence can easily be adopted into a system which requires integration of semi-structured data and/or structured data, such as MOMIS [BCV99]. We proceed to present our results as follows. In Section 2, we introduce the hierarchical data graph which is our data model for encapsulating semi-structured and structured data. In Section 3, we discuss our automated approach for generating the uni ed view of dierent data sources. In Section 4, we provide a few query examples to illustrate the usage of uni ed views. In Section 5, we give the concluding remarks.
2 Preliminaries Data are conceived as semi-structured data if they do not have a rigid structure/schema or are unstructured. Examples of semi-structured data are data in LATEX les, BIBTEX les, UNIX environment les, HTML and XML documents, etc. We represent semi-structured data as a directed graph, called hierarchical data graph (HDG). A core construct of HDG is the \consist of" relationship. In this paper, we say that A consists of B if B is a member of A, A is \more generic" than B , or A is \conceptually" higher than B . When A consists of B , we call A the container and B the content.
De nition 1 Given the semi-structured data D, the hierarchical data graph HDG =
(V; E; g) of D, denoted HDGD , is a rooted graph, where VR 2 V is the root node, labeled by D, and every other node in V represents a data item in D and is labeled by the name of the respective data item. E is a nite set of directed edges. g: E ! V V is a function such that g(e) = (v1 ; v2), denoted v1 ! v2 , if the data item represented by v2 consists of the data item represented by v1 . 2 Intuitively, an HDG represents a set of data items that are hierarchically related. For instance, a tricycle consists of a body frame and three wheels, and hence in the HDG that represents a tricycle, three edges exist from `wheel' to `tricycle' and another edge exists from `frame' to `tricycle.' Thus, `tricycle' is the container of `frame' and `wheel.' We impose the uniqueness constraints upon the node names (i.e., labels) in an HDG. Hence, the name of a node is unique from other nodes in an HDG. If there are more than one sibling node with the same name o, then we distinguish one from another by suxing its corresponding data item name with a consecutive number, such as o[1], o[2], and so 3
(a) Consumer nance
(b) Consumer nance
(c) Creidt card
Figure 1: Three semi-structured data sources forth, from left to right, according to the order of their appearance in the respective HDG. (i in o[i] is called the ordering speci er.) We surround a node name by a pair of single quotes to avoid potential confusion on the name boundary whenever a node name includes spaces.
Example 1 Suppose there are three consumer nance consulting companies, and each
consults on the issues of checking accounts, VISA card, and Master card, respectively, and they are supposed to be merged. The three semi-structured data sources, given in HDGs and as shown in Figure 1, are the data sets in the respective semi-structured model. The HDG in Figure 1(a) describes `consumer finance' information consisting of `checking account' information which in turn consists of dierent customer information. Since there are three customer information, each one is tagged as customer[1], customer[2], and customer[3], respectively, and each one consists of name, `social security' and optional city and salute. 2 As illustrated in Figure 1, if a data item consists of more than one other data items, the corresponding node in an HDG has more than one incoming edge. We employ the notion of full name to denote the data hierarchy of a node in an HDG.
De nition 2 Given a hierarchical data graph G, the full name of a node vn in G is the path from vn to the root node v1 of G, denoted .v1 .v2 .: : :.vn (n 1), where a dot `.' is used
as the delimiter between adjacent nodes in the full name of vn, v1 is the root node, and vi is the parent of vi+1 (1 i n ? 1). Any subpath of a full name from vn to vi (1 < i n), inclusively, i.e., vi .vi+1.: : :.vn?1.vn, is called a partial name of vn and is represented without the leading dot. 2 Note that the full name of node v uniquely distinguishes v from other nodes in an HDG. Since the length of a partial name of v can be one by De nition 2, the node name of v is itself a partial name of v in our data model. A full name (partial name, respectively) is also called a full branch (partial branch, respectively). We adopt the notation :A:(A1; : : : ; An), called the compound full name, for a list of full names (:A:A1 ; : : : ; :A:An). The same notation can 4
Figure 2: Uni ed view generation process be adopted for a list of partial names as well. Using this notation, we can denote a (sub)tree in an HDG using a string. If a list of full (partial, respectively) names is a singleton list, we may omit the outer parentheses. From now on, we simply use the term \name" for a full name or a partial name, unless stated otherwise.
Example 2 Consider Figure 1(a) again. By De nition 2, the full name of
is ,
frank .`consumer finance'.`checking account'.customer[1].name.frank s1 .`consumer finance'.`checking account'.customer[1]. `social security'.s1 frank s1 delta .`consumer finance'.`checking account'.customer[1].(name.frank, `social security'.s1, city.delta)
, and that of
a social security number, is . The full names of , , and , a city name, can be denoted as in a compound form, where the outer parentheses are omitted since there is only one compound full name in the list. As illustrated in the gure, each full name forms a branch in its HDG. 2
3 Uni ed View In this section, we present our approach for generating the uni ed view of a set of semistructured and structured data sources. Our discussion on uni ed views includes i) automated transformation of relations in the relational data model to HDGs, which is the rst step of the structure resolution process, and which takes the advantage of the semistructured data model, ii) the automated second step in the structure resolution, which resolves structural dierences among the semi-structured data items in an HDG, and iii) automatic generation of the DTD of the resulting HDG computed in step ii) to yield the uni ed view of the given data sources. The entire process of the uni ed view generation is depicted in Figure 2. We use the Consumer finance data set and Credit card data sets in Figure 1 as semistructured data sources in dierent examples throughout this paper. As structured data, the four relations, customer, has account, account, and address in the database checking account, as shown in Figures 3 and 4, will be used. Note that the social security (i.e., SSNO) attribute in relation customer is the primary key of customer, a foreign key in has account and address. Accnt number (i.e., Acct#) in account is the primary key of account and a foreign key in has account. 5
Figure 3: Relatioships among relations customer, has account and account name SSNO Frank s1 John s7
SSNO Acct# since Acct# balance SSNO street city s1 a1 1999 a1 80 s1 12 so roy s1 a3 1998 a2 30 s7 34 no elm s7 a2 1997 a3 50 Figure 4: Relations customer, has account, account, and address in the database checking account 3.1
Resolution of structures at the data source level
The uni ed view of a set of semi-structured and structured data sources is a semi-structural view of the source data since data integrated from multiple sources tend to be irregular and incomplete [PGMW95]. Transforming semi-structured data, such as HTML documents and HTML tables, into an uni ed view and representing in graphical form as an HDG have been discussed in [LN99, LN01]. In this paper, we rst consider transforming structured data, in particular relations in the relational model, to an HDG. In transforming a relation to an HDG, we consider the following dierences between the two data structures: Scheme. Relationships among data items in a relation are speci ed by the corresponding relation scheme, whereas relationships among data items in an HDG are self-describing and determined by the container-content relationships among data items. Hence, we consider how to convert a relation scheme S into hierarchical data that are related by container-content relationships in order to transform S into the resulting HDG. Attributes and tuples. In a relation, there are notions of attributes and tuples, whereas there is no such distinct notion in an HDG. Hence, the integration of attributes and tuples in a relation into an HDG is non-trivial. Name distinction. Let ai;j be the data item of the j th attribute Aj in the ith tuple of a relation r in database D. If Aj is a superkey, then ai;j is unique among all the data items under Aj in r. Even if Aj is not a superkey, ai;j can still be distinguished from any other data items that reside outside r since each relation has a unique name in D. In an HDG, a data item is uniquely identi ed by its full name. Hence, we consider integrating the database names, relation names, and attribute names into full names of data items in an HDG. Integrity constraints. During the process of transforming relations to HDGs, we preserve the domain, entity, and referential integrity constraint of the source relations. 6
3.1.1 Unit transformation The dierences between relations and HDGs discussed earlier are taken into consideration in the following de nition, and the de nition is applied to each participating relation transformed into an HDG.
De nition 3 (Unit Transformation). Given an n-ary relation r(R) on the relation scheme R = (A1 , : : :, An) with m tuples f(a1;1, : : :, a1;n), (a2;1 , : : :, a2;n ), : : :, (am;1 , : : :, am;n)g in database D, the hierarchical data graph HDG = (V; E; g) of r, denoted HDGr , is a directed graph, where
VR 2 V is the root node of HDGr , named after D, and each node v 2 V , other than VR, denotes 8 >> a tuple r[i] if v is the ith child node of VR; >< Aj , 1 j n if v is an internal node which is neither VR nor a child node of VR , >> a non-null a if v is a leaf node. i;j >: 1 i m, 1 j n
E is a nite set of directed edges. g: E ! V V is a function such that v2 ai;j (1 i m, 1 j n) hold. 2
v1 if g(e) = (v1 ; v2), and D
r[i]
Aj
Since a database D is often considered as the `container' of relations, views, and so forth in commercial relational database management systems such as Oracle, and since D tends to be named to indicate what the content of D is about, we choose D as the root node of the corresponding HDGr , i.e., D is chosen to be the top-most `container' of the data items in HDGr . Similarly, we choose tuples in r as the immediate contents of the root node D, and hence HDGr contains subtrees rooted at each r[1], : : :, r[m] with the constraints D r[i] Aj (1 i m, 1 j n), where r[i] denotes the ith tuple in r. D r[i] because relation r is one of the relations in D and r is a set of tuples. Furthermore, r[i] Aj ai;j because each tuple of r satis es the relation scheme of r and serves as the container of a set of attributes A1, : : :, An, and attribute Aj has data item ai;j as its content. Figure 5 shows a relation r and its corresponding HDG constructed according to Definition 3. As shown in the gure, the relation scheme of r is integrated in each subtree rooted at r[i] (1 i m) in the HDG. Since an HDG is not in a tabular form, there are no such notions as tuples and attributes. However, by De nition 3, each tuple and attribute of a relation is mapped to an HDG in a consistent manner. As shown in Figure 5, the ith tuple in r is mapped to a set of sibling leaf nodes ai;1, : : :, ai;n of the subtree rooted at r[i], whereas every attribute of the relation is replicated over the entire tree such that the value of attribute j in tuple i appears as the j th leaf node of the ith subtree rooted at r[i]. The following example demonstrates how De nition 3 is applied to the relations in Figure 4. 7
Figure 5: A relation r in database D and its corresponding hierarchical data graph HDGr
Example 3 Consider relation customer in Figure 4. By De nition 3, the database name
forms the root of the resulting HDG, and the attribute names name and social security are replicated in the subtree rooted at customer[i] (1 i 2). Also, each attribute value is attached to its corresponding attribute in the respective subtree. For example, frank is attached as a child node of name in the subtree rooted at customer[1]. The resulting hierarchical data graph of customer is shown in Figure 6(a). Relations has account, account, and address are converted into HDGs accordingly as shown in Figures 6(b), 6(c), and 6(d), respectively. 2 `checking account'
3.1.2 Referential integrity among source relations Applying De nition 3 to source relations such as relations in Figure 4 yields HDGs that are homogeneously structured data sets with contingent dierent container-content relationships. This structural resolution process includes transforming schemes, tuples, and attributes into their respective HDGs in a consistent manner. In this section, we discuss how referential integrity constraints among the participating relations are preserved during our data source-level structural resolution process. The referential integrity constraint is a signi cant design issue among relations in the relational data model. This constraint arises when changes (i.e., addition, deletion, and modi cation) are made to a referenced/referencing relation such that a referencing relation references its referenced relation(s). In the semi-structured data model, the notion of decomposition and referencing among the decomposed components is weak or rare2. Since semi-structured data are self-descriptive in nature, all the related data and their constraints tend to be contained in the same set of semi-structured data. We preserve the referential integrity constraint in relations by integrating those data that are involved in the references into the same HDG in order to avoid any potential violation of the referential integrity. As an illustration, consider relations customer and address where social security in customer is a foreign key in address. Suppose node customer[2] is deleted from the HDG in Figure 6(a) and hence the subtree rooted at customer[2] is removed from HDGcustomer . Then, the subtree rooted at address[2] in HDGaddress will become `dangling.' Hence, we must propagate the deletion in the referenced relation (i.e., customer) to the referencing relation (i.e., address) in order to preserve the referential integrity constraint. XML documents, which are semi-structured in nature, support referencing from one element to another (internal or external) element by using the ID attribute and namespace. 2
8
(a) HDGcustomer
(b) HDGhas account
(c) HDGaccount
(d) HDGaddress
Figure 6: HDGs created by the unit transformation of customer, has account, account and address We dierentiate three distinct cases of preserving the referential integrity constraint during the transformation process from relations to HDGs as follows: 1. No reference. If relation r does not reference any other relations, simply apply the unit transformation (De nition 3) to r. 2. Single reference. Let Hra be the HDG of an m-ary relation ra with relation scheme fA1, : : :, Am g and primary key K fA1, : : :, Amg. Also, let Hrb be the HDG of an n-ary relation rb with relation scheme fB1 , : : :, Bng. Suppose subset fB1, : : :, Bng references the primary key K of ra and references no other relation. Then, for any i; j such that the value of K under ra [i] of Hra is the same as the value of under rb[j] of Hrb , move the subtree rooted at rb[j] in Hrb to Hra so that rb [j] becomes a new child node of ra[i], and delete the subtrees rooted at under rb[j]. Furthermore, delete K and its value from the relocated subtree rooted at rb[j ] to avoid replication of K and its value under ra[i]. Repeat this process for any relation with a single reference from rb and remove the root node of Hrb at the end of the process. Example 4 Consider relations customer and address in Figure 4 and their corresponding HDGcustomer and HDGaddress in Figure 6. Note that social security in customer is a foreign key in address, and address dose not reference any other relations besides customer. Since the values of `checking account'.customer[1]. 9
in HDG and in HDG are the same (i.e., `s1'), we move the subtree rooted at to HDG under . As a result, there is a new subtree rooted at in HDG . Note that we no longer need the ordering speci er for node in this subtree because there is no other sibling node of with the same name. In addition, we delete the subtree rooted at . Similarly, we move the subtree rooted at in HDG to in HDGcustomer . Part of the resulting HDGcustomer is shown in Figure 7. We now see that if the value of social security of customer[i] is changed, this change automately propagates to the address of customer[i]. Note that after the migration the root node `checking account' is the only remaining node in HDGaddress and thus is removed. social security `checking account'.address[1].social customer security address address[1] customer[1] customer `checking account'.customer[1].address customer address address `checking account'.customer[1].address.social security address[2] customer[2] address
3. Multiple references: Occasionally, a relation references more than one other relation. Whenever there is any change in any referencing subtree S , we retain HDGc of the referencing relation as it is, rather than moving S from HDGc to HDGp of the referenced relation. Hence, we create new edges between HDGp and HDGc as follows: Let Hra be the HDG of an m-ary relation ra with relation scheme fA1, : : :, Am g and primary key K fA1, : : :, Am g, and let Hrb be the HDG of an n-ary relation rb with relation scheme fB1 , : : :, Bng. Suppose fB1 , : : :, Bng references the primary key K of ra and the primary key of another relation besides ra . Then, for any i; j such that the value of K under ra [i] of Hra is the same as the value of under rb [j] of Hrb , create an edge from rb [j] in Hrb to ra [i] in Hra , and remove the subtrees rooted at under rb [j] (to eliminate replicated primary key value under ra[i]). Repeat this process for any other relation referenced by rb and remove the root node of Hrb . One might consider to duplicate the referencing subtrees in Hrb to each HDG of the relations that rb references in a similar manner as we do for single reference. We are reluctant to do so for various reasons, such as synchronization among the duplicated subtrees in an HDG. Note that we do not expect there are changes on a regular basis at the scheme level since a relational database scheme is relatively status compared with the content of the corresponding database. For this reason, we do not consider changes applied to the scheme level, even though it should be relatively easy to handle.
Example 5 Consider relations customer, account, and has account in Figure 4 and their corresponding HDGs HDGcustomer , HDGaccount, and HDGhas account in Figure 6 again. Note that social security in customer and accnt number in account are foreign keys in has account. Since the value of `checking account'.customer[1].social security in HDGcustomer and the value of `checking account'.has account[1].social security in HDGhas account are the same (i.e., `s1'), we create a new edge from has account[1] in HDGhas account to `checking account'.customer[1] in HDGcustomer , and the subtree rooted at social security, which is a child node of has account[1], is removed. Also, since 10
Figure 7: Integrated HDGs by references among the source relations the value of `checking account'.account[1].accnt number in HDGaccount and the value of `checking account'.has account[1].accnt number in HDGhas account are the same (i.e., `a1'), we create a new edge from has account[1] in HDGhas account to `checking account'.account[1], and the subtree rooted at accnt number, which is a child node of has account[1], is removed from HDGhas account. Eventually, the resulting HDG is shown in Figure 7. 2 3.2
Resolution of structures at the branch level
By the structure resolution process performed at the data source level as discussed in Section 3.1, any relations in the relational data model and semi-structured data sources can be converted into their corresponding HDGs and hence the data sources are homogeneous in structure after the transformation. In this section, we present the next processing step which involves structure resolution among branches in dierent HDGs by merging branches if they satisfy certain criterion. Whether two branches can be merged or not depends on their structures and node names. If branches can be merged, the nal uni ed view of these branches is structurally more concise and semantically more tightly coupled in terms of the number of branches and nodes in the nal view. Furthermore, we are interested in merging any two branches of arbitrary height ( 0)3. Alternatively, we consider a simpler problem, i.e., can two branches with height 0 (i.e., simple nodes) be merged? Our goal is to merge as many branches as possible. The axiom merging condition (Axiom 1 as de ned below), which is based on the notion of partial ordering relation contains, clearly state our strategy for branch-level structure resolution. Prior to presenting the axiom, we give the de nition of contains. De nition 4 Given two names n1 and n2, n1 contains n2 , denoted n1 n2 , i) if n1 is a hypernym of n2 i.e., n2 is a hyponym4 of n1 , or ii) if n1 is an ancestor of n2 in an HDG. 2 3 The height of a branch b is the number of edges from the leaf node to the root node of b, and b with height 0 is called a simple node since there is no node other than the root node of b. 4
A large set of hypernyms and hyponyms can be found in WordNet [Mil95].
11
Example 6 Consider a set of two branches S = f.a.b.c.d, .b.eg in an HDG. By Definition 4, we obtain a partially ordered set (S; ) = f(a; b), (a; c), (a; d), (b; c), (b; d), (c; d), (b; e)g, which includes transitive pairs of elements in the two branches. In an-
other set T which consists of elements `bronco,' `mustang,' and `pony,' the ordered pairs f(mustang; bronco); (pony; bronco); (pony; mustang)g on T satisfy the \containment" constraint, i.e., , according to WordNet. This ordering is derived from the fact that a bronco is an unbroken or imperfectly broken mustang, a mustang is a small hardy range horse of the western plains, and a pony is a range horse in the western United States. 2
Axiom 1 (Merging condition) Any two branches :a1 : : : : :am and :b1 : : : : :bn (m; n 1) of dierent HDGs can be merged to yield a single branch i) :a1 : : : : :am if m = n and their full names are identical, i.e., ai = bi (1 i m), or ii) :a1: : : : :am :b1 : : : : :bn (:b1 : : : : :bn:a1 : : : : :am , respectively) if am b1 (bn a1 , respectively). 2 Axiom 1 implies that in order to merge any two branches in dierent HDGs, the branches should either be identical in terms of their structures and node names, or the partial ordering contains holds between the two branches. Cyclic partial ordering is considered as an unacceptable ordering in our integration process. Note that we consider branches of dierent HDGs in Axiom 1. We assume that each data source was properly modeled, and hence if there appear multiple data items with the same name in an HDG, we assume that they are meant to be dierent and should not be merged. For instance, customer[1], customer[2], and customer[3] in Figure 1(a) are meant to be dierent, and they are distinguished one from another using numerical labels attached to their (same) item name. They cannot be merged because it does not make sense to combine the identity of dierent customers with one another.
Axiom 2 (Contradiction) Two branches :a1 : : : : :am and :b1 : : : : :bn (m; n 1) of dierent HDGs cannot be merged if am b1 and bn a1 . 2 We now present a few lemmas that can be inferred from Axioms 1 and 2. These lemmas yield the constraints on branch-level structure resolution of dierent HDGs.
Lemma 1 (Identity) A simple node a1 and another simple node a2 can be merged to yield a node a if name(a1) = name(a2) = a.
Proof Suppose a1 and a2 are simple nodes. Since a1 and a2 are simple nodes, they are
branches with height 0 and hence are structurally identical. If their names are identical, then the two nodes have identical full name, that is .a. Hence, the two nodes can be merged to yield :a by Axiom 1. 2
Lemma 2 (Inclusion) A simple node a1 and another simple node a2 , where name(a1 ) 6= name(a2), can be merged to yield a1 a2 , if a1 a2 but a2 6 a1 . Proof Suppose a1 and a2 are two simple nodes with dierent names. Since a1 and a2 are
simple nodes, they are branches with height 0 and hence are structurally identical. Since a1 a2 but a2 6 a1 , a1 and a2 can be merged to yield a branch a1 a2 by Axioms 1 and 2. 2 12
Lemma 3 (Exclusion) Given a simple node a1 and another simple node a2 , where name(a1) 6= name(a2 ), a1 and a2 cannot be merged if neither a1 a2 nor a2 a1 . Proof Suppose a1 and a2 are simple nodes. Since a1 and a2 are simple nodes, they are
branches with height 0 and hence are structurally identical. However, they are of dierent names and do not satisfy `' since :(a1 a2 _ a2 a1 ) = (a2 a1 ^ a1 a2 ). Hence, by Axiom 2, a1 and a2 cannot be merged. 2 For merging branches with height greater than 0, we apply Lemmas 1, 2, or 3 to each node in the branches iteratively from the top level to the bottom level of the branches. Axiom 3 (Termination condition) Given two branches b1 = :a1 : : : : :am and b2 = :a01: : : : :a0n (either m > 1 and n 1 or m 1 and n > 1) of dierent HDGs, suppose each pair of (ai; a0i) in :a1 : : : : :ak1 and :a01: : : : :a0k2 (1 k1 m, 1 k2 n, 1 i min(k1; k2)) satisfy Lemma 1 and neither ak1 +1 a0k2 +1 nor a0k2 +1 ak1 +1 holds, then merge only :a1 : : : : :ak1 and :a01: : : : :a0k2 in b1 and b2 , respectively. 2 Axiom 3 can be thought of as an upside-down zipper which is broken somewhere. We cannot zip it further down as soon as the broken point is reached, although the perfectly matched portion is located prior to the broken point. In merging any arbitrary branches, the following lemma has to be satis ed. Lemma 4 (Integrity) When two branches b1 and b2 are merged to yield a single branch b3 with two leaf nodes, no pair of nodes in b3 should violate the partial ordering \" as speci ed in b1 and b2 . Proof By Axioms 1, 2 and 3. 2 We now give a few examples for branches merging in HDGs which satisfy the lemmas presented earlier, starting with branches including very succinct node names, such as a and b. Note that whenever we say \a node p is moved from one location to another," it means that \the entire subtree rooted at p is moved from one location to another." Consider branches .a.b, .a.c, .b.c and .b.a of height 1. The following cases illustrate the process of merging these branches, where symbol `' denotes the merging operation: 1: :a:b :a:b ?! :a:b (By Lemma 1) 2: :a:b :a:c ?! :a:(b; c) if :(b c _ c b) (By Lemmas 1 & 3) 3: :a:b :a:c ?! :a:b:c if b c (By Lemmas 1 & 2) 4: :a:b :b:c ?! :a:b:c (By Lemma 1) 5: :a:b :b:a ?! :a:b; :b:a (By Axiom 2) In the following example, we demonstrate how to algorithmically examine if there is no contradiction, in terms of the `' relation, on branches. Example 7 Consider branches b1 = .a.b.c.d and b2 = .c.d.b.a. From b1 , we obtain the poset (b1 ; ) = f(a; b), (a; c), (a; d), (b; c), (b; d), (c; d)g, and (b2 ; ) = f(c; d), (c; b), (c; a), (d; b), (d; a), (b; a)g. Several pairs of nodes in (b1 ; ) and (b2 ; ) contradict each other, such as (a; b) in (b1 ; ) and (b; a) in (b2; ). Hence, b1 and b2 cannot be merged since the merging violates Lemma 4. 2 13
Figure 8: Five dierent ways of merging .a.b.c.d and .b.e The following example shows merging of branches of dierent heights.
Example 8 Consider branches b1 = .a.b.c.d and b2 = .b.e. Hence, (b1 ; ) = f(a; b), (a; c), (a; d), (b; c), (b; d), (c; d)g as shown in Example 7, and (b2 ; ) = f(b; e)g. There are no contradiction in the `' relation among any pairs of nodes in (b1 ; ) and (b2 ; ). Thus, we can apply Lemmas 1, 2, and 3 to merge b1 and b2 in ve dierent ways. 1: 2: 3: 4: 5:
:a:b:c:d :b:e ?! :a:b:(c:d; e) if :(c e _ e c) :a:b:c:d :b:e ?! :a:b:c:(d; e) if c e ^ :(d e _ e d) :a:b:c:d :b:e ?! :a:b:e:c:d if e c :a:b:c:d :b:e ?! :a:b:c:e:d if c e ^ e d :a:b:c:d :b:e ?! :a:b:c:d:e if d e
(By Lemmas 1 & 3) (By Lemmas 1 & 3) (By Lemmas 1 & 2) (By Lemmas 1 & 2) (By Lemmas 1 & 2)
Figure 8 illustrates the ve dierent ways in merging .a.b.c.d .b.e. 2 As a result of the branch-level structure resolution, an HDG can be converted into a simpler HDG.
Example 9 Consider the set S of three HDGs in Figure 1. Suppose the poset (S , ) = f(`consumer finance0 ; `credit card0); (`credit card0; visa)g is prede ned for S . Let's consider the HDGs in Figures 1(b) and 1(c) rst. By the assumption (`consumer nance', `credit card') and Lemma 2, the root `Credit card' in Figure 1(c) can be attached to the HDG in Figure 1(b) to yield (a merged) HDG0 = .`consumer finance'.(VISA.: : :, `Credit card'.Master.: : : ). Next, subtrees rooted at VISA and `Credit card' in HDG0 should be examined further if there is any violation on for merging. By Lemma 2 and the assumption (`credit card', visa), they can be modi ed to yield .`consumer finance'.`Credit card'.(VISA. : : :, Master.: : :), which yields HDG". We now consider the HDG in Figure 1(a) and HDG". Since HDG in Figure 1(a) and HDG" have the same root node `consumer finance' and neither `checking account' `Credit card' nor `Credit card' `checking account' holds. They can be merged at the 14
Figure 9: Integrated consumer
finance
root node level to yield .`consumer finance'.(`checking account'.(: : :), `Credit card'.(VISA.: : : , Master.: : :)) by Lemma 1. Since the `' relationship does not hold between checking account and Credit card, by Axiom 3, no further merging can be performed. The nal HDG is as shown in Figure 9. 2 3.3
Generation of the DTD
As the nal step of the uni ed view generation of heterogeneous data sources, we create an XML document type de nition (DTD in short) for any uni ed view computed earlier by structure and name resolution. The bene ts of using DTD as a wrapper of the uni ed view include i) DTD consists of markup declarations that provide a grammar for a class of documents and hence can be used for describing the data structure of a uni ed view, ii) DTD is adaptive to describe semi-structured data because it comes from XML, which is semi-structured in nature, and iii) XML is widely used and the syntax of DTD is wellunderstood. A DTD, which can be considered as an XML document, consists of any number of element declarations, attribute list declarations, entity declarations, notation declarations, processing instructions for the document to contain instructions for applications, and comments. Among them, processing instructions and element declarations are the required components to describe an uni ed view as speci ed in a DTD. Other components can be considered as optional. Processing instructions are used for specifying instructions for applications in an XML document, which begin with `.' For example, is a typical processing instruction, where "UTF-8" can be replaced by other encoding names such as "UTF-16," "ISO-10646-UCS-2," or "ISO-10646-UCS-4." 15
A processing instruction is placed at the beginning of a DTD. An element declaration, on the other hand, is of the form , where Name is the name of an element e that appears in the corresponding document class, and contentspec is the placeholder for the content speci cation of e. Contentspec is replaced by either i) (EMPTY), if e is an empty element, ii) (#PCDATA), if e contains a data content, or iii) the list of the names of e's children, if e has child elements. Hence, contentspec is being updated and re ned during the process of generating a DTD for any given uni ed view. We further discuss what should be considered in updating contentspec when e has child elements. Assume that a node with item name node exists in a uni ed view and there are more than one child node of node with the same name, such as cnode[1], : : :, cnode[n], 1 < n (as discussed in Section 2). With that cnode in contentspec of can be immediately followed by an occurrence symbol `?', `', or `+', which denotes that the preceding content particle, i.e., node, of the respective symbol may occur once or more times (+), zero or more times (), or at most once (?), respectively. Note that the implications of these three symbols are not disjoint. For instance, if cnode occurs just once, either , , , or even is a valid declaration. In creating the DTD for a uni ed view, (cnode) which is tagged with dierent occurrence symbols will be appended to the resulting DTD such that 8 >< (cnode) if cnode occurs just once; if cnode occurs at most once, and >: (cnode)? (cnode) if cnode occurs more than once. Besides updating contentspec in an element declaration, we consider attribute list declarations of an element e as well if any attribute of e exists. When there exist attributes of e in the form attri = attrV ali (i 1), this information is added to the DTD (after the element declaration of e) as for each attri , where AttType is CDATA, which denotes that the corresponding attribute is of string type, and DefaultDecl is #IMPLIED, which denotes that no default value is provided for the attribute. Note that attri may appear more than once with dierent values val1, val2, : : :, valn, 1 n. In such a case, (attrV ali ) is rst bound to (val1), and to (val1 j val2) when val2 is identi ed, and so forth while the uni ed view is being processed. Eventually, (attrV ali) is bound to (val1 j val2 j : : : j valn). When we transform a node in an HDG to an element in a DTD, we replace all the spaces in the name of the node to underscores (` ') since the XML speci cation does not allow any spaces to be included in a name. Example 10 Consider the HDG H shown in Figure 10 where data are presented in a top-to-bottom, left-to-right order with the root node `student information' at the top. (A dotted line denotes that only the root node of a subtree is shown in the HDG.) We rst add to the DTD D of H . Then, each node v 16
Figure 10: Portion of the Integrated Student
Information
HDG
in H is traversed in preorder, and we generate an element declaration for v if v is not a leaf node. The contentspce of an element declaration is updated when its child nodes are visited. In this example, the rst visited node is `student information' and hence is added to D as the root element. The next visited node in H is `student[1],' i.e., student, which results in adding to D. At the same time, the contentspec of the element student information is replaced by (student). Since the node student will be visited four times, student in the contentspec of student information will eventually be tagged as \student*." Note that some `student' nodes have `major' and `advisor' as their child nodes while others do not, and `major' and `advisor' is speci ed at most once for a `student.'. (These constraints are not shown in the HDG in Figure 10 due to space limit.) Hence, the contentspec of student is . A completed DTD is shown below. 2
4 Querying from uni ed views One of the motivations for integrating multiple data sources is to enhance query processing by avoiding unnecessary query decomposition and merging of values retrieved by subqueries afterwards. Furthermore, the integration reduces the overhead imposed on the mediator, if any exists, for nding optimal data sources for a particular subquery. For example, suppose the original data sources `student info' (a semi-structured data), which consists of information about who is the advisor of a particular student, and `student information' (a 17
relation), which consists of information about which student is taking which course taught by which particular instructor, are not integrated together and the following query is posted: (Q1) retrieve the names of faculty who advise students taking classes taught by instructor `rob'. In order to answer Q1 , an option for the query processor is to rst decompose the given query into two subqueries: (Q11 ) group the set of students S who are taking classes taught by instructor `rob', and (Q12 ) retrieve the names of advisors who advise one of the students in S . Hereafter, Q11 is sent to the database `student information.' Upon retrieving the answers from the database `student information,' the query processor submits Q12 to the semi-structured data source `student info' for further processing and generating the answers to Q1 . (Q12 can be submitted simultaneously with Q11 if concurrent query processing strategy is adopted.) Using the uni ed view (UV) (see the DTD in Example 10) of the two data sources `student info' and `student information,' Q1 can be written as an SQL query and submitted directly to UV as select S.advisor from student information.student S, student information.is taking I, student information.taught by T where T.instructor = `rob' and T.course = I.course and I.sid = S.sid
5 Concluding remarks In this paper, we present an automated approach for integrating semi-structured data sources with either relations in the relational data model or other semi-structured data into a semi-structured data graph, called hierarchical data graph (HDG). In this integration process, we resolve two problems: i) structural dierences at the data source level, and ii) structural dierences among the branches in HDGs obtained from dierent data sources by a process called structure resolution. The structure resolution is fully automated. The two dierent resolution methods contribute to yield a more homogeneous and concise view of the given heterogeneous data. In addition, we enhance the portability of the end result of our approach, called uni ed view, by presenting an HDG in an XML DTD format instead of using a propriety format. The data source-level structure resolution process can be adopted by existing relational database systems and semi-structured data so that they can be integrated into a semistructured data source with no human intervention. The branch-level structure resolution can be used for automating the integration of sets of hierarchical data that have dierent structures. An uni ed view is given in the well-known XML grammar and can be used for optimizing queries that access multiple data sources. 18
References [BCV99]
S. Bergamaschi, S. Castano, and M. Vincini. Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record '99, 28(1):54{ 59, 1999. [BPSM98] T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible Markup Language (XML) 1.0 W3C Recommendation 10-February-1998. http://www.w3.org/ TR/1998/REC-xml-19980210, February 1998. [CA99] S. Castano and V. De Antonellis. Building Views Over Semistructured Data Sources. In Proceedings of the 18th International Conference on Conceptual Modeling (ER'99), pages 146{160. Springer, 1999. LNCS 1728. [LC94] W.-S. Li and C. Clifton. Semantic Integration in Heterogeneous Databases Using Neural Networks. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 1{12, 1994. [LN99] S.-J. Lim and Y.-K. Ng. An Automated Approach for Retrieving Hierarchical Data from HTML Tables. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM '99), pages 466{474, November 1999. [LN01] S.-J. Lim and Y.-K. Ng. An Automated Change-Detection Algorithm for HTML Documents Based on Semantic Hierarchies. To appear in Proceedings of the 17th International Conference on Data Engineering (ICDE 001), April 2001. [LRO96] A. Y. Levy, A. Rajaraman, and J. J. Ordihe. Querying Heterogeneous Information Sources Using Source Descriptions. In Proceedings of the 22nd International Conference on Very Large Data Bases, pages 251{262, 1996. [Mil95] A. G. Miller. WordNet: A Lexical Database for English. Communications of the ACM, 38(11):39{41, November 1995. [MW97] J. McHugh and J. Widom. Integrating Dynamically-Fetched External Information into a DBMS for Semistructured Data. SIGMOD Record '97, 26(4):24{31, December 1997. [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object Exchange Across Heterogeneous Information Sources. In Procedeeings of the Eleventh International Conference on Data Engineering, pages 251{260, March 1995. [Wid95] J. Widom. Research Problems in Data Warehousing. In Proceedings of the 4th International Conference on Information and Knowledge Management. ACM Press, November 1995. [Wie92] G. Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer, 25(3):38{49, March 1992. 19