Structured Data Retrieval using Cover Density Ranking Joel Coffman
Alfred C. Weaver
University of Virginia Department of Computer Science Charlottesville, VA
University of Virginia Department of Computer Science Charlottesville, VA
[email protected]
ABSTRACT Keyword search in structured data is an important problem spanning both the database and IR communities. As social networking, microblogging, and other data-driven websites store increasing amounts of information in relational databases, users require an effective means for accessing that information. We present structured cover density ranking (a generalization of an unstructured ranking function) as an effective means to order search results from keyword search systems that target structured data. Structured cover density ranking was designed to adhere to users’ expectations regarding the ranking of search results—namely, results containing all query keywords appear before results containing a subset of the search terms. Our evaluation shows that our ranking function provides better results than other state-ofthe-art ranking functions across 3 different datasets and 150 distinct information needs.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models
General Terms Design
Keywords keyword search, structured cover density scoring
1.
INTRODUCTION
Despite the wide-ranging success of Internet search engines in making information accessible, searching structured data remains a challenge. Structured data is typically either semi-structured documents (e.g., XML) or information stored in a relational database. Both present significant challenges that defy solutions developed for web search. For instance, the correct granularity of search results must be
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KEYS’10, June 6, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0187-9/10/06 ...$10.00.
[email protected]
reconsidered. A complete XML document might contain a single element that is pertinent to a given query along with many unrelated elements. The DBLP1 bibliographic database’s XML dump contains more than 1.3 million publications; searching for a particular paper should return only the information about that paper and not the complete bibliography. Identifying relevant results is further complicated due to the fact that a physical view of the data often does not match a logical view of the information. For example, relational data is normalized to eliminate redundancy. The schema separates logically connected information, and foreign key relationships identify related rows. Whenever search queries cross these relationships, the data must be mapped back to a logical view to provide meaningful search results. As shown in Figure 1, the information need “who played Professor Henry Jones in Indiana Jones and the Last Crusade” requires data from all four physical relations and is answered by the bottommost logical view. Recombining the disparate pieces of data into a unified whole makes searching structured data significantly more complex than searching unstructured text. For unstructured text, the granularity of search results is defined to be a single document, and each document may be indexed prior to searches. In contrast, attempting to index all the possible logical views of structured data creates an explosion in the size of the index because the number of possible combinations is limited only by the data itself. For example, Su and Widom [19] indexed a small portion of the possible logical views of a database and found that the index exceeded the size of the original data between two and eight times. Searching structured data continues to grow in importance as modern websites serve increasing amounts of data on-demand in response to user actions and to personalize webpages. This data is normally stored in a relational database, and Bergman [1] estimates this data to be orders of magnitude larger than the static web. The explosive growth of social networking and microblogging websites contributes to the ever increasing amount of data hidden to traditional search engines. In this paper, we investigate the use of cover density ranking to score structured data search results. Clarke et al. [3] proposed cover density ranking as an alternative to traditional information retrieval (IR) scoring functions. Cover density ranking targets short (1–3 term) queries and enforces users’ preferences for coordination matching—both are important for Internet searches. Our evaluation specifically 1
http://dblp.uni-trier.de/
addresses the effectiveness of ranking schemes. We consider efficiency to be largely an orthogonal issue but show in Section 4.3 that under AND semantics existing query processing algorithms are not asymptotically better than exhaustive search. Our contributions are summarized as follows: • In contrast to existing ranking techniques described in the literature, we propose ranking results first by coordination level (i.e., the number of terms in common with the query) and then by term co-occurrences. Our ranking scheme directly follows user expectations regarding the order of search results [17] (see also Section 2). • We generalize cover density ranking for structured documents. Unlike previous work in structured document retrieval, this task is complicated by the uniqueness of cover density ranking, which precludes the use of previous techniques for adapting an unstructured ranking function to handle structured documents. • We follow accepted practice in evaluating IR systems. Our evaluation includes 50 information needs for each of 3 distinct datasets. We are unaware of any other evaluation reported in the literature that meets this standard. Our results contradict the purported effectiveness of previous work, which suggests the need for standardized frameworks to evaluate similar search systems. • We show that our ranking scheme is more than twice as effective as previous work in this field for the most normalized dataset in our evaluation (the Internet Movie Database (IMDb)) and also provides modest improvement for other datasets. In the next section, we review related work. Section 3 presents cover density ranking and our generalization that handles structured documents. We evaluate the effectiveness of our ranking scheme in Section 4 and compare it to three other state-of-the-art ranking schemes. We conclude in Section 5.
2.
RELATED WORK
Keyword search systems for structured data generally target either relational or semi-structured data even though the differences between the two are only superficial. In addition, there is further division between systems using IR techniques to rank results and systems using proximity search algorithms. IR-techniques are most similar to our work, and we compare against these systems in our evaluation. DISCOVER [8] proposed the general system architecture that most IR approaches follow. Search results are networks of tuples that contain all search terms. The candidate network generator enumerates all networks of tuples that are potential results although efficient enumeration requires that a maximum candidate network size be specified. Hristidis et al. [7] later improved DISCOVER’s naive ranking function by adapting pivoted normalization scoring [18] to score collections of tuples. Liu et al. [12] proposed four additional normalizations to pivoted normalization scoring; these normalizations explicitly adapt pivoted normalization to a relational context. SPARK [13] improved upon both of these systems by returning to a non-monotonic score aggregation function [16]. Hristidis et al. [7] and Luo et al. [13] both propose efficient query processing algorithms.
Wilkinson et al. [21] found that traditional IR ranking functions perform poorly with short queries. Most user queries on the web contain fewer than three terms and the average number of terms per query is between two and three [9]. Clarke et al. [3] explicitly designed cover density ranking for short (1–3 term) queries, ensuring that it adheres to users’ expectations by initially ranking by coordination level [17]. No previous system—including SPARK [13], which explicitly considers this issue—is guaranteed to enumerate results in decreasing order of coordination level. The systems can evaluate queries using AND semantics, which comes closer to enumerating results by coordination level than OR semantics, but AND semantics nullifies the advantages of the proposed query processing algorithms and gives these systems the same asymptotic complexity as exhaustive search (see Section 4.3). In contrast to these IR-inspired techniques, proximity search systems find results that minimize the distance between search terms in the data graph. BANKS [2] introduced the backward expanding search heuristic, which was later improved by bidirectional search [10]. BLINKS [6] uses a bidirectional index to improve performance. Ding et al. [4] proposed a dynamic programming algorithm to find the minimum group Steiner tree and to generate additional results in approximate order. Golenberg et al. [5] guarantee a polynomial delay when enumerating results but initially rank results by height rather than weight. The greatest downside to these systems is the need to maintain an external index of the data graph.
3.
COVER DENSITY RANKING
This section reviews cover density ranking and generalizes it for structured documents. We also present a minor modification that normalizes documents by their length.
3.1
Unstructured documents
Clarke et al. [3] proposed cover density ranking in response to the findings of Wilkinson et al. [21] and Rose and Cutting [17]. Both previous efforts acknowledged users’ preferences for coordination matching and introduced a ranking scheme that blended coordination level with a more traditional similarity measure. In contrast, cover density ranking eschews existing similarity measures altogether and focuses on ranking documents within coordination levels. Documents that contain all search terms are ranked first, followed by documents that contain all but one search term, etc. Initial ranking of documents by coordination level produces |Q| document sets, C|Q| , C|Q|−1 , . . . , C1 , where d ∈ Ci ⇒ |{t|t ∈ d and t ∈ Q}| = i and all documents in Ci appear before Cj in the results if i > j. Cover density ranking orders the documents within each coordination level by a measure of term co-occurrence. Following Clarke et al., we define a document as a sequence of terms, i.e. d = (t1 , t2 , . . . , t|d| ) where |d| is the number of terms in d. A document extent is a sequence of terms (tp , . . . , tq ) from d and is represented by the ordered pair (p, q) such that 1 ≤ p ≤ q ≤ |d|. An extent satisfies a set of terms T if all the terms in T appear in the extent. For example, if a document d contains all the terms in T , the extent (1, |d|), which represents the complete document, satisfies T . An extent is a cover for T if and only if it satisfies T and does not also contain a smaller extent that also satisfies T . The set C is the set of all covers for T in a document d.
Physical views
Logical views
Person
Person ← Cast → Character
Person
id
name
id
Person.name
id
Person.name
Character.name
d10 d11
Ford, Harrison Connery, Sean
D10 D11
Ford, Harrison Connery, Sean
D7,10 D9,11
Ford, Harrison Connery, Sean
Indiana Jones Professor Henry Jones
Person ← Cast → Movie
Character id
name
id
Person.name
Movie.title
d7 d9
Indiana Jones Professor Henry Jones
D10,18 D10,19 D11,19
Ford, Harrison Ford, Harrison Connery, Sean
Raiders of the Lost Ark Indiana Jones and the Last Crusade Indiana Jones and the Last Crusade
Movie Person ← Cast
id
title
d18 d19
Raiders of the Lost Ark Indiana Jones and the Last Crusade Cast personId
characterId
movieId
d10 d10 d11
d7 d7 d9
d18 d19 d19
id
Person.name
D7,10,18
Ford, Harrison
D7,10,19
Ford, Harrison
D9,11,19
Connery, Sean
→ Character → Movie
Character.name Movie.title Indiana Jones Raiders of the Lost Ark Indiana Jones Indiana Jones and the Last Crusade Professor Henry Jones Indiana Jones and the Last Crusade
Figure 1: Physical vs. logical views of relational data. Physical views are normalized database relations while logical views of the information are not. Logical views form (potentially) relevant search results that present related information as a unified whole. The id fields uniquely identify rows and are referenced by later examples. As an example, consider the query “indiana jones crusade” and the IMDb physical views shown in Figure 1. In this and later examples, we only consider textual attributes,2 i.e. the title field of the Movie relation, and we use the id field of each relation to uniquely identify documents, e.g. d10 is the person tuple “Harrison Ford.” Ranking by coordination level alone gives C3 = {d19 }, C2 = {d7 }, and C1 = {d9 } where Ci is the set of all documents that match i query terms. The cover set of document d7 (the character “Indiana Jones”) is C7 = {(1, 2)}. Similarly, C9 = {(3, 3)} and C19 = {(1, 6)}. Scoring cover sets follows two intuitions: 1) “the more covers contained within a document, the more likely the document is relevant” and 2) “the shorter the cover, the more likely the corresponding text is relevant” [3]. Clarke et al. suggest the following formula for scoring a cover set C = {(p1 , q1 ), (p2 , q2 ), . . . , (pn , qn )}: score(C) =
n X
score(pj , qj )
(1)
j=1
where score(p, q) =
H q−p+1 1
if q − p + 1 > H
(2)
otherwise
In this formula, H ∈ [1, ∞) is a tuning parameter. Covers smaller than H receive a score of 1, and longer covers receive scores proportional to the inverse of their length. For our previous example, let H = 4. Then score(C7 ) = 1, score(C9 ) = 1, and score(C19 ) = 64 = 32 . The final ordering of results is d19 , d7 , d9 because documents are ordered first 2
Our implementation allows efficient searches over both text and non-text attributes (e.g., the year of a movie’s release— not shown but present in our database).
by coordination level and then by cover density within each coordination level. Cover density ranking does not measure the frequency of individual search terms but rather the frequency and proximity of the search terms’ co-occurrences in a document. A document may contain many instances of each search term and yet have a single cover (e.g., all the search terms appear in discrete groups). Hence, cover density ranking not only considers term matches but also forms a crude approximation for phrase matching. When searching for information, a phrase approximation can be particularly useful when the order of the information is unknown (e.g., the name “Harrison Ford” appearing as “Ford, Harrison”). Clarke et al. provide an efficient way to generate cover sets based on the fact that no two covers of a document may share the same start or end position.
3.2
Structured documents
Scoring structured documents differs significantly from scoring unstructured documents. Wilkinson et al. [20] showed that retrieval systems benefit from indexing the constituent parts of a document individually, particularly when a small portion of the document should be returned as a result. Typically, the challenge is properly defining a function that combines the scores of the individual fields that compose the complete document. Ad-hoc definitions (e.g., summing the individual scores) are simple to apply but may damage the original properties of the ranking function (see Robertson et al. [16]). Thus, our primary goal when generalizing cover density ranking to structured documents is retaining its original scoring intuitions. Unlike many other similarity measures that merely consider the frequency of query terms within a document, cover density ranking considers the position of terms within the
document. Previous work assumes traditional similarity measures: Robertson et al. [16] preserved the properties of the original scoring function by concatenating the various fields that compose the structured document and scoring the concatenation. However, this approach is not applicable because we do not want to create new covers when we consider the structured document. As an example, consider a structured document D comprising two fields f1 = (. . . , t1 ) and f2 = (t2 , . . . ) where t1 and t2 are query terms. Concatenating these fields produces D = f1 f2 = (. . . , t1 , t2 , . . . ), but note that whereas each field previously contained a cover with a coordination level of 1, the concatenation contains a cover with a coordination level of 2 (i.e., CD = {(|f1 |, |f1 | + 1)}). Hence, concatenating the individual fields can create new covers not present in the original text. Furthermore, it is trivial to prove that the order in which the fields are concatenated can affect the covers of the structured document. Following the original definitions used for cover density ranking, we define analogues for structured cover density ranking. We model a structured document D as a set of fields where each field is an unstructured document, i.e. D = {d1 , d2 , . . . , d|D| }. A structured extent ES is a subset of these fields, i.e. ES ⊆ D. A structured extent satisfies a term set T if all the terms of T appear in ES . (Note that the individual terms may be present in difference fields.) A structured cover is a structured extent that satisfies T and does not contain a smaller subset of fields that also satisfies T . We denote the set of all structured covers of a structured document D as CS . Like the unstructured version, we initially rank structured documents by coordination level; structured cover density orders the documents within each coordination level. Consider the query “Ford Jones” and the logical views in Figure 1. A few of the structured documents matching at least one term include D10 = {d10 }, D7,10 = {d7 , d10 }, D10,19 = {d10 , d19 }, and D7,10,19 = {d7 , d10 , d19 }. D10 has a single cover C10 = {d10 } that satisfies the term set {ford}. In contrast, the remaining structured documents all contain both query terms. The structured documents D7,10 and D10,19 have the covers C7,10 = {{d7 , d10 }} and C10,19 = {{d10 , d19 }}. The document D7,10,19 has two different covers, C7,10,19 = {{d7 , d10 }, {d10 , d19 }}. Scoring structured covers follows the unstructured scoring definition, i.e. X score(CS ) = score(ES ) (3) ES ∈CS
where HS score(ES ) = |E | 1 S
if |ES | > HS
(4)
otherwise
Like H in the original ranking function, HS ∈ [1, ∞) is a tuning parameter. Decreasing HS rewards documents that contain smaller subsets of fields that completely satisfy a query’s term set. Using the previous example, assume HS = 1. Then score(C7,10,19 ) = 12 + 12 = 1, score(C7,10 ) = 12 , score(C10,19 ) = 1 , and score(C10 ) = 1. In the final ordering of results, 2 D7,10,19 appears first, D10 is last, and D7,10 and D10,19 tie for second. When multiple documents score identically, ties are broken using the average score of the fields’ (unstruc-
Dataset
Relations
Tuples
Size (MB)
28 6 6
17,115 1,673,074 206,318
9 516 550
Mondial IMDb Wikipedia
Table 1: Characteristics of the evaluation datasets. tured) extents to capture term proximity within each field. Scoring structured covers may appear intractable because Karp [11] proved Set-Cover to be NP-complete. While the general problem is intractable, the limited number of fields in a relational database allows exhaustive enumeration of all possible covers. In fact, most database fields do not contain any query terms and may be ignored.
3.3
Density normalization
To produce results that better adhere to traditional IR heuristics, we normalize the density component of the scoring function to account for document length. Length is not considered in the original incarnation of cover density, but it becomes important in our setting. The data contained within different fields varies widely. One field may be extremely short (e.g., a name) whereas a different field might be considerably longer. Normalizing by length does not alter cover density’s properties but does distinguish these two cases. We believe users will prefer rankings using this normalization because it prefers results that are more specific to the query. Density normalization does not alter the original properties of cover density ranking. For example, consider the query “Indiana Jones” and documents d7 and d19 . Without applying density normalization, both documents receive an identical score for the query. Density normalization makes d7 (the character “Indiana Jones”) the preferred result.
4.
EVALUATION
We compare our ranking function with the other IRinspired ranking functions described in the literature for keyword search in relational databases.3 Our implementations target relational data, but the ranking functions also apply to semi-structured data. “Efficient” [7] is the successor of DISCOVER. “Effective” [12] and SPARK [13] both use slightly different ranking functions. We set the tuning parameters of these systems as suggested by their authors. Our implementations use Java 1.6, and the database backend is PostgreSQL 8.3. Our Java implementations communicate with the database via JDBC. We set the maximum candidate network size to 5 for each system. For structured cover density ranking, we set H = |Q| (that is, the number of terms in the query) and HS = 1. We do not tune these parameters for our datasets and queries because such tuning would overstate the effectiveness of our ranking scheme and bias the results in our favor. Proper tuning would undoubtedly improve our ranking scheme’s effectiveness.
4.1
Datasets
We evaluate the effectiveness of the systems using the three datasets shown in Table 1. The IMDb database is 3 We refer the reader to Hristidis et al. [7] for additional details related to system architecture and query processing.
Efficient
Effective
SPARK
CD
top-1 relevant reciprocal rank
21 0.514
22 0.495
27 0.607
36 0.804
IMDb
Efficient
Effective
SPARK
CD
top-1 relevant reciprocal rank
4 0.139
3 0.072
3 0.115
13 0.313
Wikipedia
Efficient
Effective
SPARK
CD
top-1 relevant reciprocal rank
16 0.485
23 0.594
19 0.565
32 0.674
Table 2: Number of top-1 relevant results and reciprocal rank for each system. Higher scores are better. The maximum number of top-1 relevant results for each system and dataset is 50; the range of reciprocal rank is between 0.0 and 1.0. a subset of one created by IMDbPY 4.1. Wikipedia is the 5500 articles from the 2008–2009 Wikipedia schools DVD. For each dataset, we create a variety of information needs (e.g. “the author of Pride and Prejudice”) and then derive queries from these information needs. We provide binary relevance judgments for the results where—in keeping with standard IR practice—we assess relevance with regard to the original information need and not its derived query. We do not use queries from search engine logs because the original information need (which is critical to assessing relevance) is unclear for all but the simplest queries. Each dataset has 50 distinct information needs, which is generally considered the minimum for evaluating retrieval systems [14]; we are unaware of any existing evaluation of effectiveness that includes this number of different information needs.4 The average number of terms per query is 2.86, which is comparable to the average for Internet search queries [9].
4.2
Results
We compare the various systems using three metrics: the total number of top-1 relevant results, reciprocal rank, and mean average precision (MAP). The number of top-1 relevant results is the number of queries for which the first result is relevant. Reciprocal rank is similar—it is the reciprocal of the position of the first relevant result. Thus, if the first result is relevant, the reciprocal rank is 1.0; if the second result is the first one that is relevant, the reciprocal rank is 0.5. MAP is a single value measure of precision across different recall levels. Both reciprocal rank and MAP are averaged across the information needs. Top-1 relevance and reciprocal rank are particularly important for web search engines as users expect high precision. In contrast, MAP considers the complete set of results and is important for understanding the overall effectiveness of a system. We retrieve the top 1000 results from each system to calculate each metric. Table 2 presents the number of top-1 relevant results and reciprocal rank for each system and dataset. “CD” refers to our ranking function, structured cover density with density normalization. Figure 2 graphs the MAP of each system and dataset. Both the table and figure show structured
cover density ranking outperforming the other three ranking schemes. The advantage of structured cover density is particularly pronounced for reciprocal rank on the IMDb dataset where structured cover density tends to rank a relevant result third and the other systems rank their first relevant result tenth. All three of the previous systems tend to have similar scores for each metric. The effectiveness of the systems is comparable for the Mondial and Wikipedia datasets but effectiveness drops precipitously for IMDb. The previous systems use a modification of pivoted normalization scoring to rank results; this commonality accounts for their similar performance. The small differences may be attributed to their different normalizations, and our datasets reveal the strengths of each of these systems. Due to its harsh size penalty (it prefers small results), Efficient outperforms Effective and SPARK on IMDb because many of the information needs are addressed by a single tuple. Effective does well on the Wikipedia dataset—the mixture of short and long fields (e.g., page titles and article text) plays to its unique normalizations. SPARK provides better results than either of these two systems for the Mondial dataset; we posit that its approximations of various parameters prevents it from providing the best results of these systems on all three datasets. The drop in retrieval effectiveness for IMDb is particularly disturbing because this dataset is the most structured—few fields contain lengthy, unstructured text. In part, the drop in effectiveness undoubtedly mirrors the increased size of the dataset and the original design of pivoted normalization scoring (i.e., very verbose queries for unstructured documents). In contrast, the drop in effectiveness for structured cover density is less pronounced, and structured cover density significantly outperforms the other systems: it triples the number of top-1 relevant results and more than doubles reciprocal rank and MAP.
4.3
Efficiency
One of the major concerns for any keyword search system is its efficiency. Our reimplementations of other systems have not been optimized, which makes a performance comparison unfair. Nevertheless, it is not difficult to show that existing query processing algorithms cannot guarantee
Mean Average Precision (MAP)
Mondial
1
Efficient Effective SPARK CD
0.8 0.6 0.4 0.2 0 Mondial
IMDb
Wikipedia
Dataset
4
Liu et al. [12] include fifty queries in their evaluation, but many of their queries are reformulations of a single information need.
Figure 2: MAP for each system. Higher scores are better; MAP is always in the range 0.0–1.0.
asymptotically better performance than exhaustive search when evaluating a query under AND semantics. Both Hristidis et al. [7] and Luo et al. [13] present query processing algorithms that improve efficiency under OR semantics. Evaluating a query under AND semantics, which partially follows users’ expectations,5 is handled by postprocessing: the query is first evaluated under OR semantics and then results that do not contain all search terms are discarded. Exhaustive enumeration is not asymptotically worse than evaluation under AND semantics. Existing query processing algorithms rank tuples by their score [7] or by an estimate of the result’s maximum score [13]. Neither of these techniques is guaranteed to process all tuples that satisfy all search terms before processing tuples that satisfy only a subset of the search terms. Consider a candidate network that joins tables T and T 0 . Furthermore, given the keyword query {k1 , k2 }, let the first half of T ’s tuples contain k1 and the second half of T ’s tuples contain k2 and vice versa for T 0 . In this case, neither the tuples’ scores nor any results’ maximum score is different (assuming length normalizations, etc. are not a factor). Thus, up to half of the possible results may be processed before identifying a result that is guaranteed to contain both query keywords and to satisfy the query. While this situation is unlikely to occur in practice, it shows that exhaustively processing the set of possible results is not guaranteed to be asymptotically worse than the best known query processing algorithms.
5.
CONCLUSIONS AND FUTURE WORK
In this paper we present a generalization of cover density ranking that enables it to handle structured documents. Structured cover density ranking has two advantages over traditional IR similarity measures: 1) it was designed for the short, ambiguous queries commonly submitted to web search engines and 2) it explicitly adheres to users’ preferences regarding ranking by coordination level. Our evaluation shows that structured cover density ranking outperforms ranking functions based on traditional IR similarity measures across a variety of metrics. We note that our evaluation stands in marked contrast to the purported effectiveness of these systems. Doubtless the selection of datasets and queries play a large role in this discrepancy, but it suggests the need for standardized evaluation techniques for these systems. Section 4.3 shows that no existing query processing algorithm is asymptotically better than exhaustive enumeration when evaluating queries under AND semantics. Clearly much work needs to be done to improve the state of the art in this area. As an initial step, cover density ranking enables a wide variety of heuristics that retain ranking by coordination level yet relax ordering within each coordination level. We believe these heuristics will perform comparably to the existing query processing algorithms while still providing higher quality search results.
6.
ACKNOWLEDGMENTS
We thank Andrew Jurik and Michelle McDaniel for their feedback regarding drafts of this paper. 5 Users expect search results that match all search terms to be ranked before results that match only a subset of the search terms. AND semantics fails to return the latter results (i.e., those omitting one or more terms that are present in the query) but does uphold the former expectation.
7.
REFERENCES
[1] M. K. Bergman. The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing, 7, August 2001. [2] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword Searching and Browsing in Databases using BANKS. In ICDE ’02, pages 431–440, February 2002. [3] C. L. A. Clarke, G. V. Cormack, and E. A. Tudhope. Relevance ranking for one to three term queries. Information Processing and Management, 36(2):291–311, 2000. [4] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding Top-k Min-Cost Connected Trees in Databases. In ICDE ’07, pages 836–845, April 2007. [5] K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword Proximity Search in Complex Data Graphs. In SIGMOD ’08, pages 927–940, June 2008. [6] H. He, H. Wang, J. Yang, and P. S. Yu. BLINKS: Ranked Keyword Searches on Graphs. In SIGMOD ’07, pages 305–316, June 2007. [7] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-style Keyword Search over Relational Databases. In VLDB ’03, pages 850–861, September 2003. [8] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. In VLDB ’02, pages 670–681. VLDB Endowment, August 2002. [9] B. J. Jansen and A. Spink. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management, 42(1):248–263, 2006. [10] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional Expansion For Keyword Search on Graph Databases. In VLDB ’05, pages 505–516, August 2005. [11] R. M. Karp. Reducibility Among Combinatorial Problems. Xomplexity of Computer Computations, 43:85–103, 1972. [12] F. Liu, C. Yu, W. Meng, and A. Chowdhury. Effective Keyword Search in Relational Databases. In SIGMOD ’06, pages 563–574, June 2006. [13] Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k Keyword Query in Relational Databases. In SIGMOD ’07, pages 115–126, June 2007. [14] C. D. Manning, P. Raghavan, and H. Sch¨ utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, 2008. [15] W. May. Information Extraction and Integration with Florid: The Mondial Case Study. Technical Report 131, Universit¨ at Freiburg, Institut f¨ ur Informatik, 1999. Available from http://dbis.informatik.uni-goettingen.de/Mondial. [16] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 Extension to Multiple Weighted Fields. In CIKM ’04, pages 42–49, November 2004. [17] D. E. Rose and D. R. Cutting. Ranking for Usability: Enhanced Retrieval for Short Queries. Technical Report 163, Apple, 1996. [18] A. Singhal. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, pages 35–42, December 2001. [19] Q. Su and J. Widom. Indexing Relational Database Content Offline for Efficient Keyword-Based Search. In IDEAS ’05, pages 297–306, July 2005. [20] R. Wilkinson. Effective Retrieval of Structured Documents. In SIGIR ’94, pages 311–317, August 1994. [21] R. Wilkinson, J. Zobel, and R. Sacks-Davis. Similarity Measures for Short Queries. In TREC-4, pages 277–285, November 1995.