Querying XML data using keyword-based search has been widely studied in literature [1â7], however ..... We used Oracle Berkeley DB [15] as a tool for creating ...
K-Graph: Selecting Top-k Data Sources for XML Keyword Queries Khanh Nguyen and Jinli Cao Department of Computer Science and Computer Engineering La Trobe University, Melbourne, Australia {tuan.nguyen, j.cao}@latrobe.edu.au
Abstract. Existing approaches on XML keyword search mostly focus on querying over single data source. However, searching over hundreds or even thousands of (distributed) data sources by sequentially querying every single data source is extremely high cost, thus it can be impractical. In this paper, we propose an approach for selecting top-k data sources to a given query in order to avoid high cost of search in numerous, potentially irrelevant data sources. The proposed approach can efficiently select top-k mostly relevant data sources without querying over the data sources. We propose a ranking function for measuring the strength of correlation between keywords in a data source and summarize the data sources as keywords correlation graphs (K-Graphs). The top-k relevant data sources will be selected by estimating the relevance of corresponding K-Graphs to the query. Experimental results show that the approach achieves good performance with a variety of experimental parameters.
1
Introduction
Extensible Markup Language (XML) has become a de facto standard for representing and exchanging data, resulting in the proliferation of XML documents distributed over the internet. Traditionally, XML data are retrieved using structured query languages such as XPath and XQuery, in which users have to learn both data schema and query languages in order to effectively issue queries. Since the data schema and the query languages may be complex, retrieving XML data using XPath/XQuery languages is usually limited to advanced users. In that context, keyword-based search over XML data has been proposed as a mean to liberate users from the learning curve of the structured query languages, thus attracted significant attention of researchers from both fields of information retrieval and databases. Querying XML data using keyword-based search has been widely studied in literature [1–7], however most of existing approaches focus on query processing over single data source. Searching through hundred or even thousands of data sources by sequentially querying each data source is extremely expensive cost and may not be practical, while efficient query processing even in single data source is a challenging problem [8–12]. Efficient query processing over a system which integrates numerous data sources is definitely much more challenging.
Thus, how to address the problem of query processing over multiple XML data sources is a challenging issue in practice.
0 db1 1 inproceeding 2 article 3 author
8 inproceeding
…… 9 article
5 article 4 title
6
7
10 author
author
title
Liu
“XML keyword search”
12 article 11 title
13
14 title
author
(a) T1 0 db2 1 inproceeding 2 article 3 author
8 9 article
5 article 4 title
6 author
“Keyword search “Liu” in relational DBs”
inproceeding
……
7 title
10 author
“XML updates”
12 article 11 title
13 author
“Probabitistic XML data”
14 title “XML query optimization”
(b) T2
Fig. 1: An example of bibliographic data sources. In the context of information retrieval, selecting most useful data sources from a large number of sources has been studied [23–25]. The common approach is to summarize each data source as term statistics (e.g., term frequency and inverse document frequency). Given a query, the system can select the most appropriate data sources by measuring the relevant degree between the summarized statistics and the query. In this case, the statistics act as data sources’ summaries for fast filtering non-promising sources which in turn accelerates the overall query processing. However, applying IR techniques in the context of XML data may be inadequate for following reasons. First, although using term statistics is effective in IR, the term statistics solely are not effective to measure the relevance of an XML source to a given query. This is because the structures of XML data (or schema) convey rich semantics, hence they should be considered when measuring the relevance of a data source, besides the term statistics. Second, the term occurrences solely in an XML data source do not guarantee the appearance of relevant results in that data source. In other words, the relevance of an XML data source also depends on how closer the relationship of query keywords in
that data source. For demonstration, let us consider two data sources T1 and T2 in Figure 1 and a user’s specific search intention “search for article(s) of Liu about XML keyword ” expressed by query Q = {Liu, XM L, keyword}. We can observe that DB1 has one “relevant” result to query Q. On contrast, T2 does not contain any “relevant” result to Q. Table 1: KF-Summary for T1 and T2 keyword Liu XML keyword ...
frequency T1 T2 1 1 1 3 1 1 ... ...
However, if we approximate the relevance of the two data sources to query Q based on the keyword frequency summaries (denoted as KF-summary in this paper), T2 will be selected over T1 . This is because, from Table 1, the frequency of the query keywords in T2 is higher than the frequency of these keywords in T1 . For further illustration, we will deeply study the low precision of KFsummary for data sources selection in the experimental section. For the above reasons, we can conclude that the relevance of an XML data source is not only decided by keyword frequency, but more importantly it depends on how closer the relationship between the query keywords in each data source. In this paper, we propose an approach for processing a keyword query over multiple data sources. Our approach selects top k relevant data sources in which the query will be forwarded, where k is an users’ selected parameter. The contributions in this paper are summarized as follows: – We propose an approach to select top-k XML data sources for keyword queries without querying the data sources. To obtain this aim, we first propose a method for evaluating the relationships between keywords and summarize the data sources as a K-Graphs which maintain those keyword relationships. – We define criteria for ranking the relevance of the data sources to a given keyword query by estimating the correlation of query keywords in the corresponding K-Graphs of the data sources in order to fast select top-k ranked data sources. – We conducted experiments using real-life data set for evaluating the performance of our approach in two evaluation metrics recall, precision. The experimental analysis shows that our approach has good performance in a variety of experimental parameters. The rest of this paper is organized as follows. Section 2 is an overview of our approach. Section 3 presents our techniques for summarizing data sources. In section 4, we propose our ranking functions for estimating the relevance of a data source to a given query and top-k data source selection. We study the experimental results in section 5. Section 6 discusses related work. Finally, conclusions are given in section 7.
2 2.1
Overview of Approach Data Models and Queries
We consider a data source as an XML document and we model each data source as a tree. An XML data tree is defined as T = (VT , ET ) where VT is a finite set of nodes, representing elements and attributes of the data tree T ; ET is set of directed edges where each edge e(v1 , v2 ) represents the parent-child relationship between the two nodes v1 , v2 ∈ VT . We assume that all values appear in the leaf nodes. For instance, Figure 1 represents two XML data trees T1 and T2 containing the information about publications. A keyword query is a set of different terms, denoted by Q = {k1 , k2 , . . . , kq }. We consider the AND-semantics for the query. A query result must contain at least one occurrence of each term ki ∈ Q. 2.2
Overview of Approach
We consider a set of XML data sources T = {T1 , . . . , TN }. Each of them is modeled as an undirected label tree. Given a query Q = {k1 , . . . , kq }, we would like to rank the data sources in T based on their usefulness to the query Q. Basically, the usefulness of a data source can be computed as the total score of all its results to query Q. However, this approach can overestimate data sources containing numerous results over those consisting of high quality results. To be balanced, the relevance of a data source Ti to query Q is frequently evaluated as the total score its top k results, where k is a selected parameter from users. score(Ti , Q) =
k ∑
score(Ri , Q),
(1)
i=1
where Ri is the i-th top result of Q in Ti and score(Ri , Q) is the relevant score of Ri to Q. Ideally, the data sources should be ranked in descending order of their scores calculated according to Equation 1. Then, the most useful data sources are selected for which the query will be forwarded. However, to calculate the ideal scores of the data sources, the system needs to execute the query over all the data sources. Because the number of data sources being searched can be very high and the data sources can be very large, searching through all those data sources can be very time consuming, thus it may be impractical. The aim of our work is to present an approach which can efficiently and effectively select top-k data sources amongst potentially numerous data sources without querying over the data sources. To obtain this aim, we construct summaries for the data sources off-line and select the useful data sources by calculating the relevant scores of the summaries to the query online. Each summary (namely K-Graph in this paper) of a data source stores the relationships between keywords appearing in that data sources, where the keyword relationships are evaluated by our ranking function considering both content relevance and structure relevant factors. Finally, we present two methods for estimating the relevance of data sources to a given query based on the constructed summaries.
3
Keyword Correlation Graph (K-Graph)
We summarize a data source as a keyword correlation graph (or K-Graph for short). The K-Graph measures the correlation between keywords in the data source. Nodes of the graph are keywords appearing in the data source. The edge between two keywords ki and kj is marked by distinct integer numbers which indicate lengths of the paths connecting ki and kj in the data source. As two keywords can be connected through paths with different distances, we will present, in the following subsection, our method for evaluating the correlation between keyword ki and kj at specific distance d as well as the correlation between ki and kj in the K-Graph.
Liu 2 XML
Liu 2
0
2,4 keyword
(a)
XML
4 4,6
keyword
(b)
Fig. 2: K-Graphs of keywords {Liu, XM L, keyword} in T1 (a) and T2 (b) Given two nodes ni and nj which contain two keywords ki and kj , we define the score of ki and kj with respect to ni and nj as score(ki , kj , ni , nj ) =
weight(ni , ki ) + weight(nj , kj ) dist(ni , nj ) + 1
(2)
where dist(ni , nj ) is the distance between nodes ni and nj which measures how strong the relationship between the two nodes, in the sense that the closer distance between the two nodes indicates their stronger relationship. weight(ni , ki ) measures the content relevance of node ni with respect to keyword ki . In order to calculate weight(ni , ki ), we employ the standard tf ∗ idf from the information retrieval field. The tf ∗ idf measures the content relevance of a document to a keyword query using both term frequency (i.e., how many times a term appears in a document) and inverse document frequency (i.e. inverse of how many documents contain the term). In order to apply it to XML data scenarios, we make some following adaptations: firstly, term frequency (tf ) of a term ki in node ni is the number of occurrences of ki in node ni and as [13] we assume that term frequency (tf ) is always equal to 1; secondly, we adaptively define inverse element frequency (ief ) of a term t as the total number N of element in the XML N data tree over the number Nt of elements that contain the term t, i.e. ieft = N . t Based on the above adaptations, the weight of keyword ki in node ni is calculated as weight(ni , ki ) = log2 (1 + tfki ) log2 iefki = log2 iefki
(3)
where tfki = 1. Considering data tree T1 , for instance, we have weight(7, “XM L”) = weight(7, “keyword”) = log2 (
15 ) = 3.91 1
Similarly, from data tree T2 we have weight(7, “XM L”) = weight(11, “XM L”) = weight(14, “XM L”) 15 = log2 ( ) = 2.32 3 weight(4, “keyword”) = log2 (
15 ) = 3.91 1
Thus, from the data tree T1 we have: weight(7, “XM L”) + weight(7, “keyword”) 0+1 = 3.91 + 3.91 = 7.82
score(“XM L”, “keyword”, 7, 7) =
Similarly, from the data tree T1 we have: weight(7, “XM L”) + weight(4, “XM L”) 4+1 3.91 + 2.32 = = 1.25 5
score(“XM L”, “keyword”, 7, 4) =
weight(11, “XM L”) + weight(4, “XM L”) 6+1 2.32 + 3.91 = = 0.89 7
score(“XM L”, “keyword”, 11, 4) =
weight(14, “XM L”) + weight(4, “keyword”) 6+1 3.91 + 2.32 = = 0.89 7
score(“XM L”, “keyword”, 14, 4) =
3.1
Keyword Correlation at Specific Distance d
Let two nodes Si and Sj are the sets of nodes containing keywords ki and kj respectively. We define the correlation between the two keywords ki and kj at specific distance d in a data tree T as corr(ki !d kj ) =
∑ ni ∈Si ,nj ∈Sj :dist(ni ,nj )=d
score(ki , kj , ni , nj ) fd (ki , kj )
(4)
where score(ki , kj , ni , nj ) is calculated as Equation 2 and fd (ki , kj ) is the number of d-distance paths connecting the two keyword ki and kj .
In this way, we measure the correlation between the two keywords ki and kj at distance d as the average score of all paths connecting ki and kj at distance d. For illustration, the correlation between two keywords “XML” and “keyword” in the data tree T1 at distance 0 can be computed as: corr(“XM L” !0 “keyword”) = score(“XM L”, “keyword”, 7, 7) = 7.82 Similarly, from the data tree T2 we have corr(“XM L” !4 “keyword”) = score(“XM L”, “keyword”, 7, 4) = 1.25 corr(“XM L” !6 “keyword”) score(“XM L”, “keyword”, 11, 4) + score(“XM L”, “keyword”, 14, 4) = 2 0.89 + 0.89 = = 0.89 2 Given any two keywords ki and kj in the XML data tree T , we can see that the maximum distance, dmax between ki and kj can be as twice as the hight of the data tree T , i.e., dmax ≤ 2 ∗ h(T ) where h(T ) is the hight of the data tree T . For instance, both data trees T1 and T2 have the height of 4, thus the maximum distance between any two keywords in those data trees is always less than or equal to 8. 3.2
Keyword Correlation
From the K-Graphs in Figures 2 we can see that two keywords in a data source can be connected at different distances with different correlated strengths, as measured by Formula 4. Now we define the total correlation between two keywords ki and kj in a K-Graph as follows. Definition 1 (Keyword Correlation). Let ω is the allowed maximum length of the path between any two keywords and k be the maximum number of results expected from an XML tree T . For each distance d, fd (ki , kj ) is the frequency of d-distance paths connecting the pair of keywords ki and kj . A keyword correlation between ki and kj represents strength of the∑relationship between keywords ki and ω kj in the data tree T with respect to k. If d=0 fd (ki , kj ) ≤ k, corr(ki ! kj ) =
ω ∑
score(ki !d kj ) ∗ fd (ki , kj )
(5)
d=0
∑ω ∑ω′ Otherwise, if d=0 fd (ki , kj ) ≥ k, we have ∃ω ′ ≤ ω, d=0 fd (ki , kj ) ≥ k and ∑ω′ −1 d=0 fd (ki , kj ) ≤ k, corr(ki ! kj ) =
′ ω∑ −1
score(ki !d kj ) ∗ fd (ki , kj )
d=0
+score(ki !
ω′
kj ) ∗ (k −
′ ω∑ −1
d=0
fd (ki , kj ))
(6)
In other words, the keyword correlation measures the total scores of up to top-k correlations for each pair of keywords as querying an XML data source. A data source with a higher relationship score for a given pair of keywords will generate better results. The reason we set the upper-bound of the number of results k, is to enable a user to control the quality of one data source. Let k = 2 be the maximum number of expected results, we can calculate the correlation between keyword pair (“XM L”, “keywords”), for instance, in the data tree T1 as corr(“XM L” ! “keyword”) = corr(“XM L” !0 “keyword”) = 7.81 Similarly, the correlation of that pair of keywords in the data tree T2 is corr(“XM L” ! “keyword”) = corr(“XM L” !4 “keyword”) + corr(“XM L” !6 “keyword”) = 1.25 + 0.89 = 2.14 From that we can see that the correlation between keywords “XML” and “keyword” in the data tree T1 is much stronger than that correlation in the data tree T2 , despite the higher frequency of those keywords in the data tree T2 in comparison with that frequency in the data tree T1 . 3.3
Reducing size of K-Graphs
We aware that indexing all pairs of keywords at all possible distances can result in extremely large K-Graphs. In addition, some work [1] in literature has pointed out that not all pair of keywords in an XML tree are meaningful related, especially those keywords which appear far way from each other. Thus, to reduce the size of the K-Graphs, we allow users, i.e., system administrators limit the maximum allowed distance of between keywords in the K-Graphs, or they can define the meaningful relationship between keywords to be indexed, i.e., two keywords are meaningful related if the path connecting them does not contain two nodes with the same label [1]. We plan the study of this issue as future work.
4
Top-k Data Source Selection
In this section, we present our strategy for measuring and selecting appropriate data sources for a given query. 4.1
Estimating Relevant Scores of Data Sources
We estimate the relevance of a data source to a given keyword query, considering AND semantics which was popularly used in existing work [1–7]. This semantics requires each result must contain all query keywords.
Given a multi-keywords query Q = {k1 , k2 , . . . , kq }, and a set of XML data sources T = {T1 , T2 , . . . , Tn }, we estimate the relevance of each data source T in T based on one of the following equations: ∑ CORR-S(T, Q) = corr(ki ! kj ) (7) {ki ,kj }⊆Q,i