Query Evaluation for Source Selection and Ranking George A. Mihailay
Louiqa Raschidz March 9, 1999
Mara-Esther Vidalx
Abstract
The World Wide Web has the potential to become the preferred mediumfor the dissemination of information in virtually every domain of activity. Standards and formats for data interchange have addressed the issue of interoperability among heterogeneous sources. However, access to data is still hindered by the challenge of locating data relevant to a particular problem. Further, after a set of relevant sources have been identi ed, one still must decide which source is best suited for a given task and appropriately rank these sources. Sources may cover dierent domains and they may dier with respect to a variety of \quality of data" parameters which include completeness, recency of update, granularity, etc. In order to solve this problem of source selection and ranking using \quality of data" parameters, we maintain metadata about the content and \quality of data" for sources. We present a query language for source selection and ranking that supports both strict and fuzzy matching on \quality of data" metadata. We then discuss techniques for the ecient evaluation of source selection and ranking queries. 1
Introduction
The World Wide Web has the potential to become the preferred medium for the dissemination of information in virtually every domain of activity. While most of this information is textual, we are witnessing an increasing interest in using this medium as a platform for publishing structured and semi-structured data, such as scienti c datasets in various disciplines. For example, large collections of data about the environment are publicly available in online repositories [GHC, GPC, WDC, NGD]. In order to facilitate data exchange, standard interchange formats such as XML-Data have been adopted. Numerous data extraction tools generically known as wrappers, use various techniques to extract information from Web Sources and present them using complex objects, relational, or semi-structured data models. After a set of relevant sources have been identi ed, one must still rank these sources as best suited for a given task. Sources containing relevant data may dier with respect to the domain and quality of data. Criteria for judging the quality of data in a source have been proposed in the literature [HLW98]. In [MRV99], we identi ed four quality of data (QoD) parameters: completeness, recency, frequency of updates, and granularity and presented a solution to publish source descriptions using XML and XML-Data. We also proposed a query language for the selection and ranking of sources based on their QoD parameters. In this paper, we discuss the use of a partially ordered set of source content quality descriptions (scqd's) to assist in source selection and ranking. We also consider the situation where a combination of scqd's are used in a query and propose techniques to extend the partially ordered set towards a (partial) This research was partially sponsored by the National Science Foundation grant IRI9630102 and the Defense Advanced Research Projects Agency grant 01-5-28838 y Department of Computer Science, University of Toronto. E-mail:
[email protected] z Smith School of Business and UMIACS, University of Maryland. E-mail:
[email protected] x UMIACS, University of Maryland and Universidad Sim on Bolvar, Venezuela. E-mail:
[email protected]
1
lattice representing these combinations. We show that there is a trade-o between the compact representation in the partially ordered set and the accuracy of QoD information when scqd's are
combined in the (partial) lattice. These issues are discussed using examples.
1.1 Motivating Example
Consider a collection of data sources containing meteorological data such as: temperature, air pressure, rainfall, etc. We assume a schema with the following types: Temperature(time, city, value), and Rainfall(time, city, value), etc. Individual data sources will typically contain data only for a subset of these types for dierent domains. For example, source S1 may have all the temperature data for Toronto since 1990, source S2 may have 80% of all the temperature and rainfall data for Kingston for the current year, and source S3 may have half of the rainfall and temperature data for Canada since 1950. Sources also dier on the time granularity of their measurement. A source may record one measurement every hour, or two measurements each day, etc. Dierent sources may contain more (or less) recent data, and may be updated at various rates (daily, weekly, twice a month, etc.). Thus, we need to obtain source content and quality of data metadata for sources, to assist the user in selecting and ranking sources best suited for some query. For example, a scientist who wants to study the evolution of temperature in Toronto over the past fty years should be able to identify source S3 as containing most of the relevant data. 2
Describing, Qualifying and Querying Data Sources
In this section we introduce a metadata model for source content quality descriptions (scqd's). We then present a query language that exploits scqd's to select among a collection of data sources and rank them.
2.1 Model for Source Content Quality Descriptions
We consider the following model: T1 , T2, ..., Tn are relational types, each type Ti has attributes Ai1 , Ai2, , Aik . Every attribute Aij is associated with a domain Dij . A source S contains data for a subset of T1, T2 , ..., Tn A source S may have several source content quality descriptions (scqd's) describing its contents. An scqd is a tuple (t,cd,c,r,f,g), where t is a type and cd is a content description that speci es domains for some of the attributes of t. The parameters c,r,f,g correspond to the following Quality of Data (QoD) parameters: completeness, recency, frequency of updates and granularity, respectively. The QoD parameters qualify the data in the source described by the cd. They are as follows: c estimates the fraction of the data in the complete type 1 available in the source; r states how old is the data; f represents the length of the intervals when the data is updated; and g represents sampling granularity of the data. Example 2.1 The scqd's to describe the sources in Section 1.1 are as follows: Temperature(time,city,value) S1 scqd11 : (Temperature, [(city,fTorontog),(time,YearSince1990)], 1.0, 3 days, , 1 hour) S2 scqd21 : (Temperature, [(city,fKingstong),(time,CurrentYear)], 0.8, 2 days, , 12 hours) S3 scqd31 : (Temperature, [(city,CityInCanada),(time,YearSince1950)], 0.5, 1 day, , 24 hours) Rainfall(time,city,value) S2 scqd22 : (Rainfall, [(city,fKingstong),(time,CurrentYear)], 1, 2days, , 12 hours)] i
1
The complete type is a possibly virtual relation that contains all the relevant data for the type. 2
S3
scqd32 : (Rainfall, [(city,CityInCanada),(time,YearSince1950)], 0.5, 1 day, , 24 hours)
2.2 Queries for Selecting and Ranking Sources
We propose a query language that exploits QoD parameters to select among a collection of data sources and rank them. The language can express queries with both strict and fuzzy conditions on the QoD dimensions associated with speci c content descriptions. Strict conditions are comparison predicates. Fuzzy conditions, on the other hand, are proximity predicates allowing one to imprecisely specify a desired target value for a certain QoD parameter. The evaluation of a query returns a list of sources that support the speci ed content description and satisfy the strict QoD conditions. The sources are ranked according to the degree to which they satisfy the fuzzy conditions. We illustrate the features of this language by the following query:
Query 2.1 Find the best 5 sources that maintain information for the temperature in Toronto for the current year. Relevant sources must maintain 60% of all the data and the intervals of samples must be close to 1 hour.
select from where
best 5 s Source s, Scqd q in s q:type = \Temperature" and q:cd = [(city,fTorontog),(year,f1999g)] and q:completeness > 0.6 and q:granularity close to \1 hour";
The above query selects sources that contain an scqd matching the speci ed cd, whose completeness is better than the cuto value and returns an ordered list the ve sources whose granularity comes closest to 1 hour. In order to produce this ordered list, all the sources matching the cd and completeness QoD requirements for the granularity value are assigned a score in the interval [0; 1]. The score re ects the degree the source granularity x matches the target granularity value C. This score is computed by a function fC (x) that produces 1:0 for an exact match and approaches 0 for values of the QoD parameter away from the target value C. Examples of such functions are fC (x) = 1=(1 + abs(x=C ? 1)) and fC (x) = exp(?abs(x=C ? 1)). The query language also supports weighted combinations of fuzzy conditions. For example, suppose we are interested in sources that contain data approximately one week old and granularity close to 1 hour, and we care twice as much about the granularity being close to the target value as we care about the recency. Then, we can express this by including explicit weights for each fuzzy condition: select best 5 s from Source s, Scqd q in s where q:type = \Temperature" and (2/3)*(q:granularity close to \1 hour") and (1/3)*(q:recency close to \1 week"); The combined score is computed from the individual scores according to the following formula introduced by Fagin in the context of multimedia databases [Fag98]: score = (w1 ? w2)min(x1 ) + 2(w2 ? w3 )min(x1 ; x2) + + nwnmin(x1 ; x2; ; xn) 1 w1 w2 :::wn 0 where w1 ; ; wn are weights summing up to 1 and x1 ; :::; xn are the scores for each QoD parameter. 3
B21
Bucket
{(city,{Toronto})}
B22
recency: [S1,S2]
Bucket
{(city,{Kingston})}
include
B11
recency: [S2,S1]
SB2
include
Bucket
B12
{(city,{Toronto,Kingston})}
Bucket
{(city,{Montreal})} recency: [S3]
recency: [S3,(S1,S2),S2,S1,(S2,S1)]
SB1
Figure 1: The partially ordered set for the type 3
SuperBucket
SuperBucket
Pressure
Using a Partially Ordered Set of Scqd's
We group the scqd's and the sources into equivalence classes. The source and scqd pair (S,scqdi ), with content description cdi in scqdi is in the equivalence class [cdj ] if cdi includes cdj . We say that cdi includes cdj , where cdi=f(Atti1 ,Domi1), ,(Attin,Domin)g, cdj =f(Atti1 ,Domj 1), ,(Attin,Domjn)g, and Domj 1 Domi1, , Domjn Domin. An equivalence class is referred to as a bucket. Then, using the relation include between the cd that characterizes each bucket, we construct a partially ordered set of these buckets as seen in Figure 1. The directed arrow between the bucket B11 and B11 indicates that the cd in B11 includes the cd in B21 . A superbucket consists of all the buckets at the same layer of the partially ordered set that cannot be ordered by using the relation include. The most general content descriptions which provide a maximal coverage of the domain and the sources that support them are in the bottom-most superbucket. Note that a content description cdi is more general than the content description cdj if cdi includes cdj . Each bucket (equivalence class) with content description [cd] maintains an ordered set, one for each QoD parameter, for the sources that support this cd. In each ordered set, the sources are ordered based on (descending) values of one QoD parameter. To evaluate a query, we rst scan the partially ordered set of buckets in a top-down fashion to nd the rst superbucket that contains a relevant equivalence class (relevant bucket) for the condition that quali es the domain in the query, i.e., a bucket whose content description matches the cd condition in the query. Once we nd the relevant bucket(s), we choose the sources that satisfy the QoD conditions in the query. They will be ordered by descending QoD value, for each parameter. These ordered sets will be used to produce the graded sets needed for the algorithm presented in [Fag98] to evaluate the queries.
3.1 Computational Issues in Accessing Combination of Scqd's
Typically, a query will need to be answered by a combination of sources or scqd's. Such queries can be evaluated in several ways, each of which raises some issues of computational complexity. Example 3.1 Consider the following scqd's for the type Pressure for sources S1 , S2 and S3. S1 : scqd11 :(Pressure,f(city,f Torontog)g, ,2days, , ) scqd12 :(Pressure,f(city,fKingstong)g, ,6days, , ) S2 : scqd21 :(Pressure,f(city, fTorontog)g, ,4days, , ) scqd22 :(Pressure,f(city,fKingstong)g, ,5days, , ) S3 : scqd31 :(Pressure,f(city, fMontrealg)g, ,2days, , ) scqd32 :(Pressure,f(city,fToronto,Kingstong)g, ,4days, , ) Consider the following query: 4
select best s from Source s Scqd q in s where q.type=\Pressure" and q.cd=[(city,fToronto,Kingston,Montrealg)] and q.recency