This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author’s benefit and for the benefit of the author’s institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution’s administrator. All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open internet sites, your personal or institution’s website or repository, are prohibited. For exceptions, permission may be sought for such use through Elsevier’s permissions site at: http://www.elsevier.com/locate/permissionusematerial
ARTICLE IN PRESS
Information Systems 32 (2007) 560–574 www.elsevier.com/locate/infosys
py
Enabling soft queries for data retrieval
Hwanjo Yua,, Seung-won Hwangb, Kevin Chen-Chuan Changc a
co
Computer Science Department, University of Iowa, Iowa City, IA 52242, USA Department of Computer Science and Engineering, POSTECH, Pohang, Korea c Computer Science Department, University of Illinois at Urbana- Champaign, Urbana, IL, USA b
Received 28 October 2005; received in revised form 26 January 2006; accepted 5 February 2006
al
Recommended by P. Loucopoulos
on
Abstract
r's
pe
rs
Data retrieval finding relevant data from large databases — has become a serious problem as myriad databases have been brought online in the Web. For instance, querying the for-sale houses in Chicago from realtor.com returns thousands of matching houses. Similarly, querying ‘‘digital camera’’ in froogle.com returns hundreds of thousand of results. This data retrieval is essentially an online ranking problem, i.e., ranking data results according to the user’s preference effectively and efficiently. This paper proposes a new rank query framework, for effectively incorporating ‘‘user-friendly’’ rank-query formulation into ‘‘data base (DB)-friendly’’ rank-query processing, in order to enable ‘‘soft’’ queries on databases. Our framework assumes, as the ‘‘back-end,’’ the score-based ranking model for expressive and efficient query processing. On top of the score-based model, as the ‘‘front-end,’’ we adopt an SVM-ranking mechanism for providing intuitive and exploratory query formulation. In essence, our framework enables users to formulate queries simply by ordering some sample objects, while learning the ‘‘DB-friendly’’ ranking function F from the partial orders. Such learned functions can then be processed and optimized by existing database systems. We demonstrate the efficiency and effectiveness of our framework using real-life user queries and datasets: our results show that the system effectively learns quantitative ranking functions from qualitative feedback from users with efficient online processing. r 2005 Elsevier B.V. All rights reserved.
1. Introduction
th o
Keywords: Soft queries; Data retrieval
Au
As we move toward a digital world, information abounds everywhere—retrieving desired data thus becomes a ubiquitous challenge. In particular, with the widespread of the Internet, myriad databases have been brought online, providing massive data Corresponding author. Tel.: + 1 319 335 0734.
E-mail addresses:
[email protected] (H. Yu),
[email protected] (S.-w. Hwang),
[email protected] (K.C.-C. Chang).
through searchable query interfaces. (The July 2000 survey of [1] claims that there were 500 billion hidden ‘‘pages,’’ or data objects, in 105 online sources.) While our databases provide well-maintained, high-quality structured data, with the sheer scale, users are facing the hurdle of searching and retrieving. This data retrieval problem— that of finding relevant data from large databases — has thus become a clear challenge. (By ‘‘retrieval,’’ we intend to stress the relevance-based matching, even for structured ‘‘data’’ — much like text retrieval for
0306-4379/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2006.02.001
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
co
py
561
Fig. 1. Examples of online search facilities for supporting data retrieval.
on
al
mostly ‘‘canned transactions’’ by application developers, data retrieval system must accommodate ordinary users who are not able to express their implicit preference by formulating a query or function. Second, DB-friendliness: The system should be ‘‘DB-friendly,’’ to be compatible with existing relational DBMS, so that it can be executed and optimized by any DBMS. Note that data retrieval, with many interesting scenarios online, must essentially achieve responsive processing. While there has been existing work on supporting ranking in both databases and machine learning communities (discussed in Section 6), due to their different aspects of interests, there has been no efforts ventured for enabling soft queries for data retrieval: On one hand, the databases community has studied rank query processing [3–6]. However, they clearly lack the support for intuitively formulating ranking in the first place, to accommodate everyday users (as Section 2 will discuss). On the other hand, the machine learning community has focused on learning or formulating ranking from examples [7,8]. However, such ranking functions are hardly amenable to relational DBMS for efficient processing. This paper develops the ‘‘bridging’’ techniques of database and machine learning, to provide systematic solutions for data retrieval. We proposes a new framework such that: (1) to achieve user-friendliness, it allows users to qualitatively and intuitively express their preferences by ordering some sample, (2) to achieve DB-friendliness, we learn a quantitative global ranking function which is amenable to existing relational DBMS. In summary, our framework seamlessly integrates the front-end machine learner with a back-end processing engine to evaluate the learned functions.
Au
th o
r's
pe
rs
finding relevant documents.) To illustrate, Fig. 1 shows several example scenarios. Consider user Amy, who is looking for a house in Chicago. She searches realtor.com with a few constraints on city, price, beds, baths, which returns 3581 matching houses. Similarly, when Amy searches froogle.com for "digital camera", she is again overwhelmed by a total of 746,000 matches. She will have to sift through and sort out all these matches. Or, Amy may realize that she must ‘‘narrow’’ her query — However, on this extreme, and equally undesirable, she may as well get no hits at all. She will likely manually ‘‘oscillate’’ between these extremes, before eventually managing to complete her data retrieval task, if at all. Relational databases offer little support for such retrieval tasks. Traditional Boolean-based query models like SQL are based on ‘‘hard’’ criteria (e.g., priceo $100,000) while users often employ ‘‘soft’’ criteria for their specific senses of ‘‘relevance’’ or ‘‘preference.’’ Unlike flat Boolean results, these fuzzy criteria naturally calls for ranking, to indicate how well the results match. Such ranking is essential for data retrieval, by ordering answers according to their matching ‘‘scores.’’ Thus, on one hand, there will not be too many matches, since ranking focuses users on the best matches. On the other hand, neither will there be no hits, since ranking will return even partial matches. While such ranking has been the norm for ‘‘text’’ retrieval [2] (e.g., search engines like Google), it is critically missing in relational database systems for supporting similar ‘‘data’’ retrieval. To enable such soft queries for data retrieval, we observe two major barriers: First, user-friendliness: The data retrieval system should be ‘‘user friendly,’’ for ordinary users to easily express their preference. Note that, unlike traditional data management with
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
562
specified (e.g., cheap below), each as a soft predicate returning a matching score.
The new contributions of this paper are summarized as follows.
py
Query Q: select h.id, h.address from House h where h.city ¼ ‘‘"Chicago"’’ order by minðf 1 : cheap(h.price), f 2 : large(h.size), f 3 : safe(h.zip)) The task in this score-based model is ranking a database of n objects D ¼ fu1 ; . . . ; un g (e.g., "House" in Example 1). For each object u, some m soft predicates f 1 , . . ., f m evaluate u to scores in ½0 : 1, which are then aggregated by some ranking function F, i.e., Fðf 1 ; . . . ; f m ½u ¼ Fðf 1 ½u; . . . ; f m Þ ½uÞ. All objects are then ranked, highest first, by their ranking scores Fðf 1 ; . . . ; f m Þ½u or F½u for short. For instance, Query Q in Example 1 uses minðf 1 ; f 2 ; f 3 Þ as the ranking function.
pe
rs
We motivate and describe the architecture of our framework (Section 2) and present the component techniques (Section 3). We demonstrate the efficiency and effectiveness of our framework using real-life queries and data sets (Section 4). We discuss valuable lessons we learned from our user study and further challenges to build the data retrievalintegrated relational system (Section 5). We discuss related work in Section 6.
To rank results in the order of preference, she may formulate a query, combining features with min as the ranking function. Such query is expressible in SQL, using order by clause for ordering results by user-specified ranking criteria.
co
Predicate - cheap (h.price): If (h.price4500; 000) Then Return h:price 1:0 MAX _PRICE Else Return 1.0
al
We develop the duality of ranking and classification view in Section 3.1, in order to connect the ‘‘user-friendly’’ query formulation (i.e., learning ranking from relative orderings) with the ‘‘DBfriendly’’ query processing (i.e., processing ranking from absolute scores). We provide an intuitive interpretation of the SVM ranking solution [8], by using the duality and presenting Corollaries 1 and 2 and Remark 1 in Section 3.2.2. We develop top sampling method which (1) provides an ‘‘exploratory’’ interface to users; (2) further enhances the SVM performance for ranking; and (3) is efficiently expressed in SQL and thus facilitate the integration with RDBMS. We experimentally show that the top sampling method is efficient and reduces the amount of user feedback to achieve a high accuracy.
on
r's
2. Overview: Bridging rank formulation and processing
Au
th o
This section motivates and introduces our approach—Our goal is to seamlessly integrate user-friendly rank formulation with DB-friendly rank processing. As Section 1 explained, such ‘‘mix’’ is critical for enabling soft queries for data retrieval. 2.1. DB-friendly rank processing: Score-based model First, we argue the score-based ranking model is both amenable and expressive for query processing: To see why, consider a data retrieval scenario, where queries capture preference. Example 1. Amy, who is looking for a house, prefers those somehow cheap, large, and in a safe area. Assume these ‘‘predicates’’ or ‘‘features’’ are
This view is (1) amenable and (2) expressive to enable effective query processing: First, such a ranking function is amenable to existing relational DBMS and thus already expressible in SQL (e.g., as the Query Q in Example 1). Second, it is simple yet expressive, by determining a global ordering with a single formula. (Such score-based models have served IR well, e.g., the tf/idf scoring function for ranking.) (Other emerging ranking models are discussed in Section 6.) 2.2. User-friendly rank formulation: Machine learning approach While the score-based model is expressive and efficient, formulating such ranking functions is challenging to users. It is far from trivial for the user to articulate how she evaluates each and every object into an absolute numeric score, that is, to express her preference by defining the soft predicates and function. Note that, unlike typical relational queries usually formulated by application developers or DB administrator, common users for
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
co
py
training examples, our ‘‘learning machine’’ will infer the desired ranking function. However, unlike a conventional learning problem of classifying objects into groups, our learning machine must learn a global ranking function F that outputs a ranking score of each data object so that it is adoptable in the score-based ranking model for efficient processing. The learning machine must also learn from partial orders to provide the intuitive formulation. Additionally, the dynamic nature of online querying poses strict constraints on response time and user intervention—this ranking function must be learned instantly with minimal user intervention. 2.3. The RankFP framework
on
al
Putting together, we develop the RankFP framework, aiming at integrating a ‘‘front-end’’ for learning-based rank query formulation to a ‘‘backend’’ for score-based rank query processing. As Fig. 2 illustrates, first, with the ‘‘iterative learning’’ front-end (at the top), our RankFP framework supports users to formulate queries in an exploratory (as the system iteratively shows database sample objects) and intuitive (by specifying only partial ordering on the sample) process. Second, with the score-based rank processing back-end (at the bottom), our framework supports integrated query processing to return ranked answers efficiently. Section 3 will present the techniques for each component. Note that, unlike typical document retrieval tasks, users in data retrieval tasks are often willing
th o
r's
pe
rs
data retrieval tasks are ordinary people like Amy. Thus, to accommodate such users, the formulation of rank criteria must be essentially supported— without which ranking is not usable. To enable effective rank formulation, we believe the framework should be both intuitive and exploratory: first, preference often stems from relative ordering without explicit absolute scores. Thus, while scoring is an underlying ‘‘computational machinery’’ to capture a desirable preference, explicit scoring is non-intuitive and overly-demanding to most users. To be intuitive, the framework should allow users to specify only relative ordering or partial orders (but not absolute scores)—it is up to the system to infer the underlying ranking function from a few given examples. Second, ranking often requires context knowledge—of what objects are available in the database to be ranked. However, data retrieval is inherently exploratory; users are exploring an unfamiliar database for what they want, and thus such context knowledge is often lacking. Thus, the framework should present what are available in the database, and let users focus on only those presented. These examples on one hand serve as a ‘‘guided’’ tour of D and on the other hand provide a sufficient context for user interaction. Together, both requirements lead us to pursue an interactive ‘‘rank-by-examples’’ paradigm for rank formulation—consequently, the critical ability of ‘‘inference by examples’’ (for finding the implicit ranking criteria) clearly suggests a machine learning approach. With interactive sampling and labeling of
563
Au
ad-hoc ranking R*
5
4
3
Rank Formulation Learning Machine 2
ranking R* over S
sample S (unordered)
1
Over S: F RS R*? no
ranking function F yes
Function Learning: learn new F
database D
Sample Selection: generate new S Rank Processing Top-k Query Processing
results top ranked reultss
Fig. 2. Framework RankFP: Rank formulation and processing for data retrieval.
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
564
to perform many iterations to further refine the ranking functions; A document retrieval task usually ends as soon as the user finds a few satisfying documents. However, users in data retrieval tasks often want to retrieve many possible candidate before they make decisions, because the decision often involves a high cost. For instance, users searching for for-sale houses or digital camcorders do not easily finish their tasks by retrieving a few good examples.
(Sections 3.2 and 3.3 will present the techniques for the rank formulation.)
Rank formulation: The rank formulation module (Fig. 2, top) iteratively interacts with the user to learn the desired ranking function. This process operates in rounds, as Fig. 2 illustrates (in the top) and Fig. 3 shows in details. In each round, the learning machine selects a sample S of a small number of l objects (for l jDj; e.g., l ¼ 5 in our study). The user orders these examples by her desired ranking R ; thus she ‘‘labels’’ these examples as training data. The learning machine will thus construct a function F from the training examples so far; let RF S be the induced ranking over the latest sample S. At convergence, i.e., when RF S is sufficiently close to R (i.e., when F is accurate on S), the learner will halt and output F as the learned ranking function. (In particular, we measure such convergence with the Kendall’s t metric, the most widely used measure for similarity between two orderings such as R and RF S [9–11].) This ‘‘learning-till-convergence’’ mechanism seems rather simple and elegant to satisfy our goal of intuitive and exploratory query-formulation.
3. The RankFP framework: enabling rank formulation and processing online
co
py
Rank processing: The rank processing module (Fig. 2, bottom) carries out the learned function F for online query processing over the entire database. Section 3.1 will present, through Theorem 1, how to connect the learned function into the score-based ranking model for efficient and integrated query processing.
3.1. Duality of ranking and classification view As argued in Section 2, the score-based ranking model, viewing ranking as induced by a ranking function F, is amenable and expressive for query
Au
th o
r's
pe
rs
on
al
In this section, we present the techniques for realizing the RankFP framework (Fig. 2). First, Section 3.1 starts with developing how we ‘‘connect’’ score-based ranking view, which is effective for processing back-end, with classification view, effective for learning front-end. Second, Section 3.2 then investigates SVM as the learning machine (Step 3a in Fig. 3). Finally, Section 3.3 develops techniques to enable rank formulation and processing to be ‘‘online’’, e.g., selective sampling for effective online learning with minimal user intervention (Step 3b in Fig. 3).
Fig. 3. The front-end rank formulation: ‘‘learning-till-convergence’’.
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
Theorem 1 (duality). Let R ¼ ðf~1 ; f~2 ; . . .Þ be the global ordering determined by pairwise function F such that Fðd~ij Þ40 for every ðf~i ; f~j Þ pair satisfying ioj. Then, Fðf~i Þ4Fðf~j Þ if ioj.
processing. For now, let’s assume the rank function F is linear, i.e., Fðf 1 ; . . . ; f m Þ½u w1 f 1 ½u þ þ wm f m ½u. (We will show that it is generalizable to nonlinear function in Section 3.2.) Let the weights ~ ðw1 ; . . . ; wm Þ be the weight vector (which is what w the learner will infer). Also, let the ‘‘features’’ f~i ðf 1 ; . . . ; f m Þ½ui be the feature vector of data object ui — In our learning framework, these features are simply the attributes of each database tuple (e.g., price, city as in Example 1). We can thus write ~ f~i , which maps an object ui to its Fw~ðf~Þ½ui ¼ w score F½ui by weighting its various features. As our hypothesis, suppose there exists such a ranking function F that is consistent with the desired ranking R . Our ranking problem is thus to induce an order of objects by comparing their scores, and our goal in rank formulation is to find such an Fw~ ~)such that: (or the weight vector w
py
co
This duality tells that F—a classifier function learned from the pairwise difference vectors — can be used for the global ranking function generating a score per object, which allows us to seamlessly integrate with existing relational database. 3.2. Incorporating SVM learning for rank formulation
ð1Þ ð2Þ
j
pe
rs
We denote ui XR uj or ðui ; uj Þ 2 R when ui is ranked higher than uj according to an ordering R. To automatically infer ranking function Fw~ by applying a machine learning method as a formulation front-end, we rewrite Eq. (2) into the following Eq. (3).
Our formulation of duality (Theorem 1) enables us to adopt any linear binary classification method (e.g., Perceptron, Winnow, SVM, etc)1 for learning ranking function F. Among those, Support Vector Machines or SVMs [12–14] have been recently most actively developed in the machine learning community, as they have shown to demonstrate high generalization performance by the margin maximization property2— That is, they learn an accurate classification function that generalizes well beyond training data. (Generalization performance denotes the performance of the learned function on ‘‘unseen’’ data.) Applying SVM on Eq. (4) becomes essentially equivalent to the solution proposed in 8 which learns an ordinal regression function from Eq. (2). In this section, we provide an intuitive interpretation of the solution by using the duality and presenting Corollary 1 and 2 and Remark 1, which explains how SVM can also improve the generalization of ranking. That justifies, our framework, by adopting SVM as the learning machine, can learn Fw~ that is concordant with the given training data (i.e., partial orders from R ) and also generalizes well to rank unseen data with respect to R . We will also use our analyses in this section to justify our top sampling method in Section 3.3.
on
i
Proof 1. From Eq. (3) and (4), 8d~ij 2 R ; Fw~ðd~ij Þ ~ðf~i f~j Þ403~ 403Fw~ðf~i f~j Þ40 3 w w f~i 4~ w ~ ~ ~ f j 3Fw~ðf i Þ4Fw~ðf j Þ.
al
ui XR uj () Fw~ðf~Þ½ui XFw~ðf~Þ½uj ~ f~ X~ () w w f~ .
565
~ðf~i f~j ÞX0. w
(3)
th o
r's
Now, the learning problem becomes a binary classification problem on pairwise ordering: Let d~ij f~i f~j be the feature-difference vector between ui and uj . Then it is formulated as the following binary classification problem that determines if ðui ; uj Þ 2 R : ~ d~ij X0. ðui ; uj Þ 2 R () Fw~ðd~ij ÞX0() w
(4)
Au
Thus, our rank formulation problem can be formulated as the following problem: Let R be a ranking over database D. Given training data or partial orders as a set fððui ; uj Þ; yij Þg, where ui 2 D, uj 2 D, and yij ¼ þ1 if ðui ; uj Þ 2 R or else yij ¼ 1, our goal is to learn a function Fw~ for classifying every pair of objects ðui ; uj Þ from D with respect to R , as Eq.(4) defines. Thus, we formally develop the duality of the classification and ranking view through the following theorem.
1
Given a set of (positive and negative) training points, a linear ~ (and thus the binary classifier will find the weight vector w ranking function Fw~ ), which defines a ‘‘hyperplane’’ separating the positive and negative examples. 2 SVMs compute the classification boundary of the highest margin that is the distance between the boundary and the closest data points (i.e., support vectors) in the feature space.
ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574
566
F
3.2.1. SVM classification Let us first overview SVM classification. Suppose there exists such a function Fw~ that Eq. (4) holds for some partial ordering R0 R , then we can ~ such that the following Eq. (5) holds for rescale w that partial orders. ~ d~ij X1 8ðui ; uj Þ 2 R0 : w
(d
)>
F
)=
w·
d
F
=
(d
)