A Hidden Markov Model Approach to Keyword-based Search over Relational Databases Sonia Bergamaschi1 , Francesco Guerra1 , Silvia Rota1 , and Yannis Velegrakis2 1
Universit`a di Modena e Reggio Emilia via Universit`a 4, 41121 Modena
[email protected] 2 DISI - Universit`a di Trento Via Sommarive 14, 38123 Povo (TN)
[email protected]
Abstract. In this paper we present KEYRY (from KEYword to queRY) a system for translating keyword queries over relational databases into SQL queries with the same intended meanings. In contrast with the majority of the current keywordbased searching tools, our approach does not assume any knowledge about the data instance. It rather adopts a probabilistic approach, based on a Hidden Markov Model, for computing the top-k mappings of user keywords into database terms (names of tables and attributes, domains of attributes). Some heuristic rules have been developed for allowing the keyword searching even without any training. The execution of SQL queries generated from the computed mappings allows the system to return the answers to the users. The experimental results conducted over a set of databases show good performance in terms of precision, recall and execution time.
1
Introduction
Keyword searching is becoming the de-facto standard for searching over information sources. For textual documents, keyword query answering has been well-studied, especially in the area of information retrieval [14]. However, for structured data, it has only recently received considerable attention. The existing techniques for keyword searching over structured sources (partially introduced in section 6) heavily rely on a-priori instance-analysis that scans the whole data instance and constructs some index, a symbol table or some structure of that kind which is later used during run time to identify the parts of the database in which each keyword appears. This limits the application of these approaches to cases where direct a-priori access to the data is possible. There is a large amount of structured data sources that do not allow any direct access to their own contents. Mediator-based systems, for example, typically build a virtual integrated view of a set of data sources. The mediator only exposes the integrated schema and accesses to the contents of the data sources only at the query execution. Deep web databases are another example of sources that do not generally expose their full content, but offer only a predefined set of queries that can be answered, i.e., through web forms, or expose only their schema information. Even when the access to the data instance is allowed, in a web-scale environment, it is not practically feasible to retrieve all the
contents of the sources in order to build an index. The scenario gets even worst when the data source contents are subject to frequent updates, as it happens, for instance, in e-commerce sites, forums/blogs, social networks, etc. In this paper we present an approach, implemented in the KEYRY (from KEYword to queRY) prototype, for keyword searching that does not assume any knowledge about the data source contents. The only requirement is that the data source should be able to provide a schema, even a loosely described, for its data. Semantics, extracted directly from the data source schema and provided by freely available knowledge bases (i.e., public ontologies and thesauri), is used to discover the intended meaning of the keywords in a query and to express it in terms of the underlying source structures. The result of the process is an interpretation of the user query in the form of structured query, formulated in the native source language, that will be executed by the DBMS managing the source. Of course, there are many interpretations of a keyword query in terms of the underlying data source schemas, some more likely and other less likely to capture the semantics that the user had in mind when she was formulating the keyword query. As usual in keyword based searching systems, assigning a ranking is a critical task since it avoids the users to deal with uninteresting results. One of the innovative aspects of our approach is that we adopt a probabilistic approach based on a Hidden Markov Model (HMM) for mapping user keywords into database terms (names of tables and attributes, domains of attributes). Modeling the problem as a HMM allows us to combine two independent aspects of the searching process: the importance of the order of the keywords (this is represented by means of the HMM transition probability) and the fact that not all the database terms have the same likelihood of matching a user keyword (by means of the HMM emission probability). This leads to more meaningful interpretations and, at the same time, significantly reduces the search space. The HMM typically has to be trained in order to achieve good results. In this paper we propose a method providing keyword searching capabilities even without any training data. In particular, we developed some parametric heuristic rules (the user may modify the default parameters on the basis of the database schema) that applied to semantic information directly extracted from the data source schema and from auxiliary knowledge bases provides some initial transition and emission probability scores. Moreover, we developed a variation of the HITS algorithm [9], typically exploited for ranking documents on the basis of the links among them, for computing an authority score for each database term, i.e. the initial state probability required by the HMM. The evaluation of these parameters (see section 7) demonstrates that good results may be obtained even without any training. More specifically, our key contributions are the following: (i) we proposed a new probabilistic approach based on a HMM for keyword based searching over databases that does not require to build data over the data instance; (ii) we developed a method for providing the HMM initial parameters thus allowing the keyword searching without any training data; (iii) we extended the Viterbi algorithm for decoding the HMM in order to obtain the top k results and (iv) we experimentally evaluated our approach on real and synthetic scenarios.
Author_P Person_id InProc_id
Person Person_id Name
Editor
InProceeding
Publisher
InProc_id Title Pages URL Proc_id
Publisher_id Name
Person_id InProc_id
Author_J Person_id InProc_id
Journal Journal_id Title Volume Number URL Year ISSN Journal
Proceeding Proc_id Title Year ISBN URL Booktitle Series_id Publisher_id
Series Series_id Title URL
Fig. 1. A fragment of the DBLP schema
This is the outline of the paper: in Section 2 a motivating example is introduced and in Section 3 we formally define the problem. Section 4 proposes our approach for matching keywords to database terms and in Section 5 we show the process for the query formulation. In section 6 some related works are depicted, in Section 7 we provide an experimental evaluation of KEYRY and finally in Section 8 we sketched out some conclusions and future work.
2
KEYRY at a glance
When a keyword query is issued to KEYRY, it first finds the structures that contain those keywords. A keyword in a query may in principle be found in any database term (i.e., as the name of some schema element or a value in its respective domains). This gives rise to a large number of possible configurations (i.e., combinations of each keyword in a query with a data source element). Since no access to the data source instance is assumed possible, selecting the best top-k configurations is a challenging task. Let us suppose that a user is formulating the query “Garcia-Molina journal 2011” over a relational data source obtained integrating data about researchers and their publication in the DBLP database (see figure 1, where a fragment of the DBLP database is shown). The keywords in the query may be associated to different data source elements, thus approximating with different semantics the intended meaning of the user query. For example, the user could be looking for the papers written by Garcia-Molina and published in a journal on 2011. Consequently, the keywords journal should be respectively mapped into the table Journal, 2011 a value of the attribute Year in the Journal table and Garcia-Molina should be a value of the attribute Name of the table Person. Other semantics are also possible, e.g. 2011 could be the number of a page or part of the ISSN journal number. Not all the keywords may be mapped into all the database terms: certain mappings are actually more likely than other. Since KEYRY does not have any access to the data instance, we exploit the HMM emission probability to rank to the likelihood of each possible mapping between
a keyword and a database term. In our approach we approximate the emission probability by means of similarity based on semantics directly extracted from the source schema (e.g. attribute domains, regular expressions) and external knowledge basis. Moreover, based on known patterns of human behavior[8], we know that keywords related to the same topic are typically close in a query. Consequently, adjacent keywords are typically associated to close database terms, i.e. terms that are part of the same table or belong to tables connected through foreign keys. For example, in the previous query the mapping of the keyword journal in the table Journal increases the likelihood that 2011 may be mapped into the domain of an attribute of the table Journal. The HMM transition probability, that we estimate on the basis of heuristic rules applied to the database schema, allows KEYRY to model this aspect. The Keyword Matcher, described in section 4 is the component in charge of computing the top-k configurations that better approximate the intended meaning of a user query. In the second step, the possible paths joining the data structures in a configuration are computed. Different paths correspond to different interpretations. For example, let us consider the query “Garcia-Molina proceedings 2011” and the possible configuration that maps proceeding into the table Proceeding, 2011 into a domain value of the attribute Year in the Proceeding table and Garcia-Molina into a value of the attribute Name of the table Person. Two possible paths may be computed for this configuration, one involving the tables InPorceeding and Author P, with the meaning of retrieving all the proceedings where Garcia-Molina appears as an author, and the second involving the table Editor and returning the proceedings where Garcia-Molina was an editor. The path computation is the main task of the Query Generation process described in section 5. Different strategies have been used in the literature to select the most prominent one, or provide an internal ranking based on different criteria, such as the length of the join paths. In KEYRY we computed all the possible paths and we rank them on the basis of their length.
3
Problem statement
The translation of keyword queries into SQL queries may be formalized as follows. Definition 1. Given a database D composed by n tables containing in total m attributes, the database vocabulary of D, denoted as VD , is the set VD = Vt ∪ Va ∪ Vd , where Vt = {Ri , ∀ i = 1, . . . , n} is the set of all relational table labels Ri in D, Va = {Aj , ∀ j = 1, . . . , m} is the set of all attributes labels Aj in D, and Vd = {Dj , ∀ j = 1, . . . , m} is the set of all domain labels Dj in D. We refer to each label in the database vocabulary VD as database term. In the rest of the paper we use the dot notation to compose labels, for example: tablei .attributej .domain is the label for the domain of attribute j in table i. We further divide the database vocabulary in two subsets: the schema vocabulary VSC = Vt ∪ Va represents the set of all schema labels, and the domain vocabulary VDO = Vd is the set of all labels concerning the database instance. Definition 2. A keyword query Q is an ordered l-tuple of keywords (k1 , k2 , . . . , kl ). Each keyword is a specification given by the user about an element of interest.
Definition 3. A configuration fc (Q) of a keyword query Q on a database D is an injective function from the keywords in Q to database terms in VD . In other words, a configuration is a mapping that describes each keyword in the original query in terms of database terms. Saying that fc (Q) is injective, we are making the following assumptions: – Each keyword cannot have more than one meaning in the same configuration, i.e., it is mapped to only one database term. – Two keywords can not be mapped to the same database term in a configuration, based on the fact that overspecified queries are only a small fraction of the queries that are typically met in practice [8]. – Every keyword is relevant to the database content, i.e., keywords always have a correspondent database term, which can be a relational table, an attribute, or a value in a domain. Furthermore, while modelling the keyword-to-database term mappings, we also assume that every keyword denotes an element of interest to the user, i.e., there are no stop-words or unjustified keywords in a query. In this paper we do not address query cleaning issues. We assume that the keyword queries have already been pre-processed using well-known cleansing techniques. Answering a keyword query over a database D means finding the set of SQL queries that best represent the user intended meaning. Each SQL query includes all the database terms that belong to the image3 of a specific configuration. Such an SQL query is referred to as an interpretation of the keyword query, since it provides a possible meaning of the keyword query in terms of the database vocabulary. In the current work, we consider only select-project-join (SPJ) interpretations, that are typically the queries of interest in similar works [2, 7]. Nevertheless, interpretations involving aggregations [10] are also in our list of future extensions. Definition 4. An interpretation of a keyword query Q = {k1 , k2 , . . . , kl } on a database D using a configuration fc (Q) is an SQL query: select A1 , A2 , . . ., Ao from R1 JOIN R2 JOIN . . . JOIN Rp where A01 =v1 AND A02 =v2 AND . . . AND A0q =vq such that ∀ ki ∈ Q one of the following is true: – fc (ki ) ∈ Vt – fc (ki ) ∈ Va – fc (ki ) ∈ Vd and hA0i , ki i ∈ {hA01 , v1 i, hA02 , v2 i, . . . , hA0q , vq i} To eliminate redundancies, we require any use of a database term in an interpretation to be justified either by belonging to the image of the configuration, or by participating in a join path connecting two database terms that belong to the image of the configuration. Note that even with this restriction, due to the multiple join paths in a database 3
Since a configuration is a function, we can talk about its image.
D, it is still possible to have multiple interpretations of a keyword query Q given a certain configuration fc (Q). We use the notation I(Q, fc (Q), D) to refer to the set of these interpretations. and I(Q, D) for the union of all these sets for all the possible configurations of the query Q. Definition 5. An answer to a keyword query Q over a relational database D is the set ans(Q) ={t | t ∈ evalD (Q0 ) ∧ Q0 ∈ I(Q, D)}, where evalD (Q0 ) denotes the evaluation of the SQL query Q0 on the database D. Note that since there is no restriction on the number of attributes in the select clause of an answer, the answer set of a keyword query will naturally be heterogeneous. Since each keyword in a query can be mapped to a table name, an attribute name or n an attribute domain, there are 2∗Σi=1 |Ri | + n different configurations, with |Ri | denoting the arity of the relation Ri and n the number of tables in the database. Based on this, and on the fact that no two keywords can be mapped to the same database term, for a |! possible configurations. Of course, not query containing l keywords, there are (|V|VDD|−l)! all the interpretations generated by these configurations are equally meaningful. Some are more likely to represent the intended keyword query semantics. In the following sections we will show how different kinds of meta-information and inter-dependencies between the mappings of keywords to database terms can be exploited in order to effectively and efficiently identify these meaningful interpretations and ranks them higher.
4
Keyword Matcher
[TODO: this introduction refers to a component in the tool architecture, while the rest of the section talks about modeling. This part has to be moved somewhere else.] The Keyword Matcher is the main KEYRY component and implements a HMM, where the observations represent the keywords and the states the data source elements. The main advantages of this representation are that HMM takes into account both the likelihood of single associations keyword - data source element, and the likelihood of the match between the whole keyword sequence in the query to the data source structure. In this way, the assignment of a keyword to a data source element may increase or decrease the likelihood that another keyword corresponds to a data source element. Moreover, the HMM can be trained, i.e. the domain-independent start-up parameters may be specialized for a specific data source thanks to the users’ feedback. 4.1
Modeling keyword search using a HMM
In section 3 we defined a configuration as a generic injective function that maps keywords to database terms. We are now going to describe how we compute the configurations in our approach. In a first, intuitive attempt to define the configuration function we can divide the problem of matching a whole query to database terms into smaller subtasks. In each sub-task the best match between a single keyword and a database term is found. Then the final solution to the global problem is the union of the matches found in the sub-tasks. This approach works well when the keywords in a query are independent
to each other, meaning that they do not influence the match of the other keywords to database terms. Unfortunately, this assumption does not hold in the real cases, on the contrary, inter-dependencies among keywords meanings are of fundamental importance while disambiguating the keyword meanings themselves. In order to take into account these dependencies, we model the matching function as a sequential process where the order is determined by the keywords ordering in the query. In each step of the process a single keyword is matched against a database term, taking into account the result of the previous keyword match in the sequence. This process has a finite number of steps, equal to the query length, and is stochastic since the matching between a keyword and a database term is not deterministic: the same keyword can have different meanings in different queries and hence being matched with different database terms; vice-versa different database terms can match the same keyword in different queries. This type of process can be modeled effectively using a HMM, that is a stochastic finite state machine where the states are hidden variables. A HMM models a stochastic process that is not observable directly (it is hidden), but it can be observed indirectly through the observable symbols produced by another stochastic process. The model is composed of a finite number, say N, of states. Assuming a time-discrete model, at each time step a new state is entered based on a transition probability distribution, and an observation is produced according to an emission probability distribution that depends on the current state, where both these distributions are independent of time. At time t = 1 the process starts from an initial state based on an initial state probability distribution. In the rest of the paper we will consider first order HMMs with discrete observations. In these models the Markov property is respected, meaning that the transition probability distribution of the states at time t + 1 depends only on the current state at time t and it does not depend on the past states at time 1, 2, . . . , t − 1. Moreover the observations are discrete, meaning that there exist a finite number, say M, of observable symbols, hence the emission probability distributions can be effectively represented using multinomial distributions dependent on the states. The model is formalized as follows: states in the model: S = {si }, 1 ≤ i ≤ N observation symbols: V = {vj }, 1 ≤ j ≤ M transition probability distribution: A = {aij }, 1 ≤ i ≤ N , 1 ≤ j ≤ N where X aij = P (qt+1 = sj |qt = si ) and aij = 1 0