Using Quality of Data Metadata for Source Selection and ... - CiteSeerX

Using Quality of Data Metadata for Source Selection and Ranking George A. Mihaila

Louiqa Raschid

Mar´ıa-Esther Vidal

Department of Computer Science University of Toronto

Smith School of Business and UMIACS University of Maryland

[email protected]

[email protected]

UMIACS, University of Maryland Universidad Simon ´ Bol´ıvar, Venezuela

ABSTRACT

The World Wide Web has become the preferred medium for the dissemination of information in virtually every domain of activity. Standards and formats for data interchange have addressed the issue of interoperability among heterogeneous sources. However, access to data is still hindered by the challenge of locating data relevant to a particular problem. Further, after a set of relevant sources have been identi ed, one must still decide which source is best suited for a given task and appropriately rank these sources. WWW sources typically may cover dierent domains and they may dier considerably with respect to a variety of quality of data (qod) parameters which include completeness, recency of update, granularity, etc. In order to solve this problem of source selection and ranking using qod parameters, we maintain metadata about source content and quality of data - or scqd metadata. We use a data model for representing scqd metadata similar to those used in a data warehouse environment. We present a query language for source selection and ranking that supports both strict and fuzzy matching of scqd metadata. We then discuss the ecient organization of the scqd metadata to support source selection. We discuss how scqd metadata can be organized in partially ordered sets (po-sets) to support ecient query processing. Some queries can only be answered by a combination of sources. To avoid enumerating an exponential number of possible combinations, we propose a heuristic that gradually extends the associated po-set towards a lattice. We outline the loss of accuracy of the qod metadata incurred by this approach. [section] [section] 1. INTRODUCTION

The World Wide Web has become the preferred medium for the dissemination of information in virtually every domain of activity. While most of this information is textual, we are witnessing an increasing interest in using this medium as a platform for publishing structured and semi-structured data, such as scienti c datasets in various disciplines. For example, large collections of data about the environment are publicly available in online repositories [3, 4, 11, 9]. In order to facilitate data exchange, standard interchange formats such as XML, and XML-Data have been adopted. Numerous data extraction tools generically known as wrappers, use This research was partially sponsored by the National Science Foundation grant IRI9630102 and the Defense Advanced Research Projects Agency grant 01-5-28838

[email protected]

various techniques to extract information from Web Sources and present them using complex objects, relational, or semistructured data models. After a set of relevant sources have been identi ed, one still must decide which source is best suited for a given task and appropriately rank these sources. WWW sources typically may cover dierent domains and they may dier considerably with respect to their contents and quality of data (QoD) parameters. Criteria for judging the quality of data in a source have been proposed in the literature [5]. In [8], we identi ed four quality of data (QoD) parameters: completeness, recency, frequency of updates, and granularity and presented a solution to publish this metadata using XML and XML-Data. These four parameters are a subset of the rich geo-spatial metadata standards that have been developed and widely accepted, including the ANZLIC (Australian New Zealand Land Information Council) standard [1] and the DIF (Directory Interchange Format) [2]. In order to solve this problem of source selection and ranking using QoD parameters, we maintain metadata about source content and quality of data - or scqd metadata. We present a data model for the source content quality descriptions (scqd's) that are used to express the source metadata. The data model is similar to those used in a data warehousing environment, and is based on a set of dimension attributes, a set of measure attributes, domains, and a set of QoD parameters. We present a query language for the selection and ranking of sources based on their QoD parameters. Then, we present some problems in ecient representation and selection techniques for the scqd's, to assist in online source selection and ranking. We consider the situation where a combination of one or more scqd's (or a combination of sources) are needed to answer a query. We discuss how partially ordered sets (posets) can be used to provide a compact representation for combinations of scqd's. We discuss loss of accuracy of QoD metadata, in this compact representation and discuss some performance trade-os. 1.1 Motivating Example

Consider a collection of data sources containing meteorological data such as temperature, air pressure, rainfall, etc. These correspond to the measure attributes in a data ware-

house. We can also consider some dimension attributes, e.g., Time and Location (City). Suppose we visualize the data in several sources using the following relations: Air(time, city, temperature, pressure), Precipitation(time, city, rainfall), etc. These sources record measurements in the form of time series data at some given granularity, say, one tuple every hour. Individual data sources will typically contain data only for a subset of these types for dierent domains. Sources also differ on the time granularity of their measurement. A source may record one measurement every hour, or two measurements each day, etc. Finally, for some domain and time granularity, source S1 may have all the temperature and pressure data for Toronto since 1990, while source S2 may have 70% of all the temperature data, 40% of the pressure data and 90% of the rainfall data for New York for the current year, and source S3 may have half of the rainfall and temperature data for Canada since 1950. Another facet of data quality is its timeliness. Dierent sources may contain more (or less) recent data, and may be updated at various rates (daily, weekly, twice a month, etc.). In many cases data quality degrades over time so consumers will seek the most recent data for their applications. Thus, we need to obtain source content and quality of data metadata for sources, to assist the user in selecting and ranking sources best suited for some query. For example, a scientist who wants to study the evolution of temperature in Toronto over the past fty years should be able to identify source S3 as containing most of the relevant data. 2. DESCRIBING AND QUERYING SOURCES

We introduce a metadata model for source content quality descriptions (scqd's) and a query language to select among a collection of data sources and rank them.

2.1 Model for Source Content Quality Descriptions

We consider the following model:

SODA is a set of dimension attributes (eg. city, time). SOMA is a set of measure attributes (eg. temperature,

pressure). T1 , T2 , ..., Tn are relational types, each type Ti has attributes Ai1 ; Ai2 ; ; Aiki 2 SODA [ SOMA. Every attribute Aij is associated with a domain dom(Aij ). A source S contains data for a subset of T1 , T2 , ..., Tn A source S may have several source content quality descriptions (scqd's) describing its contents.

An scqd is a triple (t, cd, qods), where:

t is a type cd is a content descriptor that speci es domains for some of the dimension attributes of t

qods is a set of qod descriptors. A qod descriptor is a tuple (lcd,c,r,f ,g,soma), where:

lcd is a content descriptor corresponding to the contents of some source;

the parameters c,r,f ,g correspond to the following Quality of Data (QoD) parameters: completeness, recency, frequency of updates and granularity, respectively;

soma SOMA is a subset of the measure attributes of t.

The QoD parameters qualify the soma attributes in the source described by the lcd. They are as follows: c estimates the fraction of the data in the complete type1 available in the source; r states how old is the data; f represents the length of the intervals when the data is updated; and g speci es the granularity of some dimension attributes. Example 1. The scqd's to describe the sources in Section 1.1 are as follows (we only show the scqd's qualifying the Air data, similar descriptors can be written for the Precipitation data):

Air(time,city,temperature,pressure) S1 scqd11 : (Air, [(city, CityInCanada), (time, YearSince1990)], fqod111 g) qod111 : ([(city,fTorontog),(time,CurrentYear)], 1.0, 3 days, , (time, 1 hour), ftemperature, pressureg) S2 scqd21 : (Air, [(city,CityInUSA),(time,CurrentYear)], fqod611 ; qod612 g) qod211 : ([(city,fNYCg),(time,CurrentYear)], 0.7, 3 days, , (time, 4 hours), ftemperatureg) qod212 : ([(city,fNYCg),(time,CurrentYear)], 0.4, 3 days, , (time, 12 hours), fpressureg) S3 scqd31 : (Air, [(city,CityInCanada),(time,YearSince1950)], fqod311 g) qod311 : ([(city,CityInCanada),(time,YearSince1950)], 0.1, 28 days, , (time, 24 hours), ftemperature, pressureg)

Sources will publish their metadata information using the WWW. In Figure 1, we encode this information in an XML le, using the WS-XML format [8]. In related research, we propose the WebSemantics system for locating data sources published using this encoding [7]. 1 The complete type is a possibly virtual relation that contains all the relevant data for the type.

... (source connection information) ...

Figure 1: Encoding QoD metadata in an WS-XML document 2.2 Queries for Selecting and Ranking Sources

We propose a query language that exploits QoD parameters to select among a collection of data sources and rank them. The language can express queries with both strict and fuzzy conditions on the QoD dimensions associated with speci c content descriptions. Strict conditions are comparison predicates. Fuzzy conditions, on the other hand, are proximity predicates allowing one to imprecisely specify a desired target value for a certain QoD parameter. The evaluation of a query returns a list of sources that support the speci ed content description and satisfy the strict QoD conditions. The sources are ranked according to the degree to which they satisfy the fuzzy conditions. We illustrate the features of this language by the following query:

Query 1. Find the best 2 sources that maintain information for the temperature in Toronto for the current year. Relevant sources must maintain 60% of all the data and the intervals of samples must be close to 1 hour.

select best 2 s from Source s, Scqd c in s:scqds, Qod q in c:qods where c:type = \Air" and q:soma f\temperature"g and q:lcd [(city,fTorontog),(time,CurrentYear)] and q:completeness > 0.6 and q:granularity close to \1 hour"; A relevant source for this query is a source that contains an scqd that is greater than the speci ed cd 2 , whose completeness is better than the cuto value. Note the overloaded use of the \" operator. In the predicate on q:soma, the \" operator means set inclusion. However, in the predicate on q:lcd, it means set inclusion of the domains of the corresponding dimension attributes, e.g., city and time. This query will return an ordered list of sources whose granularity comes closest to 1 hour, i.e., S1 (g = 1 hour); S3 (g = 24 hours). In order to produce this list, all the sources matching the lcd and completeness QoD are assigned a score in the interval [0; 1]. The score re ects the degree the source granularity x matches the target granularity value (1 hour). For lack of space, we omit the details of how scores are computed and combined and refer the interested reader to [6]. Suppose that no source covers the domain requested in the query. Consider for example the following query: Query 2. Find sources that maintain Air data for Toronto and New York for the current year.

select s from Source s, Scqd c in s:scqds, Qod q in c:qods where c:type = \Air" and q:lcd [(city,fToronto, NYCg),(time,CurrentYear)]; None of the three sources has data for both cities. Rather than giving an empty answer to this query, we consider combinations of sources as valid answers. In the next section, we discuss the computational issues involved in eciently searching scqd's and in deriving answers using combinations (unions) of sources. 3. EFFICIENT MANIPULATION OF SCQD’S 2 Each cd is represented by a literal [(attr1 ; dom1 ); ; (attrk ; domk )] in the query.

Given the rich metadata model for scqd's, we must provide ecient access structures to search the scqd's, and to identify the relevant sources. Then we consider computational issues in combining sources to answer queries. 3.1 Grouping Scqd’s Using Partially Ordered Sets We group scqd's using lcd - the content description for some source, and the set of measure attributes soma. The scqd's are grouped into maximal compatibility classes, where each class is induced by the equality relation applied to lcd (or soma) of the scqd. We refer to each class that is induced as a bucket.

Example 2. In addition to the scqd's presented in Example 1 consider the following:

Air(time,city,temperature,pressure) S4 scqd41 : (Air, [(city,CityInCanada), (time,YearSince1990)], fqod411 g) qod411 : ([(city,fTorontog),(time,CurrentYear)], 0.9, 5 days, , (time, 2 hours), ftemperature, pressureg) S5 scqd51 : (Air, [(city,CityInCanada),(time,CurrentYear)], fqod511 ; qod512 g) qod511 : ([(city,fKingstong),(time,CurrentYear)], 0.7, 7 days, , (time, 4 hours), ftemperatureg) qod512 : ([(city,CityInCanada),(time,CurrentYear)], 0.1, 28 days, , (time, 24 hours), fpressureg) S6 scqd61 : (Air, [(city,CityInCanada),(time,CurrentYear)], fqod211 ; qod212 g) qod611 : ([(city,fKingstong),(time,CurrentYear)], 0.8, 2 days, , (time, 12 hours), ftemperatureg) qod612 : ([(city,CityInCanada),(time,CurrentYear)], 0.1, 28 days, , (time, 24 hours), fpressureg) Table 1 presents the buckets for these scqd's induced by the relation equals. To simplify our example, we do not consider the soma at present. The bucket descriptor identi es the relevant scqd metadata for the bucket, and the contents identify the sources and their scqd's. Consider the following query: Query 3. Find sources containing Air data for Toronto for the current year.

select s from Source s, Scqd c in s:scqds, Qod q in c:qods where c:type = \Air" and q:lcd >= [(city,fTorontog),(time,CurrentYear)]

The relevant sources for this query will have a local content description lcd that is greater than [(city, fTorontog), (time, CurrentYear)]. We can identify the bucket B31 (and sources S1 and S4 ), which exactly match the lcd. Buckets B21 and B11 are also relevant, although the quality of data in the corresponding sources, (see qod311 (S3 ), qod511 (S5 ) and qod611 (S6 )), is much worse. While grouping the scqd's into buckets can reduce the search space, the number of buckets can still be very large. This is especially true as we combine sources to answer queries. The expression for the number of buckets will typically be dominated by the number of distinct combinations of values for lcd, the content descriptions for the sources. A further reduction in the search space of buckets can be obtained by partially ordering the buckets into a partially ordered set (po-set). We consider a relation includes on bucket descriptors. We say that bucket descriptor bdi =[(Atti1), ,(Attin)] includes bucket descriptor bdj =[(Attj1), ,(Attjn )], when the corresponding attributes and their domains, matched by subscripts, are as follows: Domj1 Domi1 , Domjn Domin . For example, the bucket descriptor B11 includes the bucket descriptor B31 in Example 3.1. Using this relation includes we can construct the po-set of Figure 2 for the buckets in Example 2. The directed arrow between the bucket B11 and B21 indicates that the bucket descriptor B11 includes the bucket descriptor B21 . A superbucket consists of all the buckets at the same layer of the po-set that cannot be ordered by using the relation includes. For example, superbucket SB1 has buckets B11 and B12 . Formally, P=(SB,includes) is a partially ordered set on a set of buckets SB under the relation includes. Denote by m the height3 of P. The po-set yields an ordered set where SB1 , ,SBm are named superbuckets, and correspond to a partition of SB. Each superbucket contains a set of incomparable buckets. The buckets characterized by the most general bucket descriptors and the sources that support them are in the last superbucket of the sequence (or bottom-most superbucket in Figure 2). Note that a bucket descriptor bdi is more general than the bucket descriptor bdj when bdi includes bdj . Using the po-set, we can search the superbuckets to nd the relevant buckets and sources for some predicate P , e.g., q.lcd [(city,fTorontog),(time,CurrentYear)]. We rst scan the po-set in a top-down fashion to nd the rst superbucket that contains a bucket with a bucket descriptor that is greater than or exactly equals this lcd. This is superbucket SB1 in Figure 1. Then the relevant buckets for P are all the buckets in SB1 [(city,fTorontog),(time,CurrentYear)] and all the buckets in the upper superbuckets, whose bucket descriptors are also this lcd. In Figure 1, the relevant buckets for this predicate are B11 , B21 and B31 .

3

The height of a po-set is the cardinality of the largest subset

SB of SB , such that, every two buckets in SB are comparable in P 0

0

Bucket

Bucket Descriptor (lcd descriptor)

Bucket Contents

B11 [(city,CityInCanada),(time,YearSince1950)] f(S3 ,scqd31 )g B12 [(city,fNYCg),(time,CurrentYear)] f(S2 ,scqd21 )g B21 [(city,CityInCanada),(time,CurrentYear)] f(S5 ,scqd51 ),(S6 ,scqd61 )g B31 [(city,fTorontog),(time,CurrentYear)] f(S1 ,scqd11 ),(S4 ,scqd41 )g B32 [(city,fKingstong),(time,CurrentYear)] f(S5 ,scqd51 ),(S6 ,scqd61 )g

Table 1: Buckets for the scqd's in Examples 2.1 and 3.1 3.2 Computational Issues in Accessing Combination of Scqd’s

In the previous section, we discussed how the po-sets can be used to nd relevant buckets. Using those techniques, if a single source cannot be found to be relevant, then the query will fail, and no sources will be selected. However, it is possible that a combination (union) of sources may be relevant. We describe a motivating example. We then discuss naive solutions to this problem. We then propose a heuristic that improves on these solutions. SuperBucket SB3

B31

SuperBucket SB2

Bucket include

include

B21 Bucket

SuperBucket SB1

include

B11 Bucket

Although a union of the sources S1 , S2 and S6 satis es the query, the partially ordered set of buckets in Figure 2 does not include a relevant bucket for the lcd [(city,fToronto,Kingston,NYCg)], and the query will fail. With n sources, there are 2n possible relevant combinations of scqd's that must be considered. Thus, the problem is generating some of the relevant combinations without exploring the whole space of combinations. We rst propose a naive solution to the problem of nding combinations of sources. We can use the po-set to nd all buckets that are associated in the includes relationship with the lcd [(city,fToronto,Kingston,NYCg)]. We can then generate the power set for these buckets, and determine if any element of the powerset produces a relevant bucket descriptor. We consider the powerset since we wish to enumerate all combinations of sources, but we wish to minimize the number of sources needed to provide an answer.

B32

Bucket

Kingston and NYC. However, they are not selected because they do not maintain at least 50% of the data.

B12 Bucket

Figure 2: A partially ordered set (po-set) for the buckets in Example 3.1 Consider the following query: Query 4. Find the best n sources that maintain information for the temperature in Toronto, Kingston and NYC. Relevant sources must maintain 50% of all the data.

select best n s from Source s, Scqd c in s:scqds, Qod q in c:qods where c:type = \Air" and q:soma = f\temperature"g and q:lcd = [(city,fToronto,Kingston,NYCg)] and q:completeness \0.5"; Using the scqd's in Examples 1 and 3, the union of sources S1 (qod111 ), S2 (qod211 ) and S6 (qod611 ), together, can satisfy the query. Note that the combinations of sources S2 (qod211 )and S5 (qod512 ) or S2 (qod211 ) and S6 (qod612 ) or S2 (qod211 ) and S3 (qod311 ), also contain data for the temperature in Toronto,

In our example, the buckets B31 , B32 and B12 , corresponding to lcds [(city,fTorontog)], [(city,fKingstong)] and [(city,fNYCg)], respectively, will be selected. Note that sources in buckets B11 and B21 , with lcds [(city,CityInCanada),(time,YearSince1950)] and [(city,CityInCanada),(time,CurrentYear)], respectively, should also be considered. However, they do not satisfy the requirement on completeness. To simplify our example, we do not consider these buckets. After generating the power set, the relevant sources will be the following combinations (union) of sources: fS1 ,S2 ,S6 g; fS1 ,S2 ,S5 g; fS4 ,S2 ,S6 g and fS4 ,S2 ,S5 g. The disadvantage of the naive solution to the problem is that a large number of descriptors will be derived. This number is exponential in the cardinality of the domains associated with the attributes in the original lcd. An alternative solution to this problem is to gradually extend the partially ordered set towards a lattice [10], by including buckets for additional scqd's that can be derived from the combination of some given scqd's and use this lattice to identify the relevant buckets. A (complete) lattice ensures that for any set of the buckets in superbucket i, there is a bucket with a more general bucket descriptor in superbucket i ? 1 and a bucket with a more speci c bucket descriptor in superbucket i + 1, if the scqd's and bucket descriptors associated with these buckets exist. The lattice solution would also allow us to nd combinations

of sources that are relevant for a query, but also could lead to an exponential explosion in the number of buckets in the lattice.

[5] K.-T. Huang, Y. W. Lee, and R. Y. Wang. Quality Information and Knowledge. Prentice-Hall, 1998. [6] G. Mihaila. Publishing, Locating, and Querying Networked Information Sources. PhD thesis, A second alternative is a heuristic that partially constructs University of Toronto, 2000. the (complete) lattice, i.e., it only adds some select buckets of the complete lattice to the original po-set of Figure [7] G. Mihaila, L. Raschid, and A. Tomasic. Equal Time 2. The heuristic is to add a bucket with a most general for Data on the Internet with WebSemantics. In bucket descriptor, where the most general bucket descriptor Proceedings of the 6th International Conference on includes the bucket descriptors of all the scqd's exported by Extending Database Technology (EDBT), pages the sources. 87{101, Valencia, Spain, March 1998. [8] G. Mihaila, L. Raschid, and M. E. Vidal. Querying 3.3 Finding Best Sources Using the Partially \quality of data" metadata. In Proceedings of the Constructed Lattice Third IEEE Meta-Data Conference, Bethesda, The partially constructed lattice may reduce the complexity Maryland, April 1999. of search, when a combination of scqd's is required to answer the query. However, use of this heuristic can lead to [9] National geophysical data centre (NGDC). problems related to loss of accuracy of the scqd's. Consider http://ngdc.noaa.gov/. the following query: [10] Preparata-Yeh. Introduction to Discrete Structures. Addison-Wesley, 1973. Query 5. Find sources providing temperature data for Toronto [11] World data centre for greenhouse gases (WDCGG). and New York. Relevant sources must be ranked according http://jcdc.kishou.go.jp/wdcgg.html. to their recency.

select best s from Source s, Scqd c in s.scqds, Qod q in c.qods, where c.type=\Air" and q:soma= f\temperature"g and q.lcd [(city,fToronto,NYCg)] and q.recency

Using Quality of Data Metadata for Source Selection and ... - CiteSeerX

Using Quality of Data Metadata for Source Selection and ... - CiteSeerX

Suggest Documents

Quality-driven Source Selection using Data ... - Semantic Scholar

Quality-driven Source Selection using Data ... - Semantic Scholar

Spatial data quality from metadata to quality indicators ... - CiteSeerX

Using Reference Models for Data Warehouse Metadata ... - CiteSeerX

METADATA QUALITY FOR FEDERATED COLLECTIONS ... - CiteSeerX

Using Reference Models for Data Warehouse Metadata ... - CiteSeerX

METADATA QUALITY EVALUATION OF SECONDARY DATA ...

Quality and Recommendation of Multi-source Data for ... - Springer Link

Selection and Classification of Statistical Data Using ... - CiteSeerX

Metadata in Geographic and Environmental Data ... - CiteSeerX

Intensional Associations Between Data and Metadata - CiteSeerX

Representing Dataset Quality Metadata using Multi ...

Towards Data Quality and Data Mining Using Constraints ... - CiteSeerX

Towards Data Quality and Data Mining Using Constraints ... - CiteSeerX

Consultant selection for quality management using ...

Consultant selection for quality management using ...

Quality Assessment of Gene Selection in Microarray Data - CiteSeerX

Data and Metadata Representation

Data and Metadata Representation

Data Quality Mining using Genetic Algorithm - CiteSeerX

The SWAP Data and Metadata Model for Semantics ... - CiteSeerX

The SWAP Data and Metadata Model for Semantics ... - CiteSeerX

Joint smoothing and source rate selection for guaranteed ... - CiteSeerX

Metadata and Data Quality Problems in the Digital Library ... - Journals