Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources (KADASH) Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Edited by
Doina Caragea Iowa State University, USA
Vasant Honavar Iowa State University, USA
Ion Muslea Language Weaver, Inc., USA Houston, USA, November 27, 2005
ISBN 0-9738918-4-X
Raghu Ramakrishnan University of Wisconsin-Madison, USA
Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Proceedings of a Workshop held Conjunction with the Fifth IEEE International Conference on Data Mining 2005 Houston, Texas, November 27th, 2005
Table of contents Workshop Committee……………………………………………………. 4 Forward………………………………………………………………........ 5 Supporting Query-driven Mining over Autonomous Data Sources……………………………………………………………............ 7 Seung-won Hwang Combining Document Clusters Generated from Syntactic and Semantic Feature Sets using Tree Combination Methods……………. 16 Mahmood Hossain, Susan Bridges, Yong Wang, and Julia Hodges Automatically Extracting Subsequent Response Pages from Web Search Sources…………………………………………………….. 25 Dheerendranath Mundluru, Zonghuan Wu, Vijay Raghavan, Weiyi Meng, Hongkun Zhao Collaborative Package-Based Ontology Building and Usage………… 35 Jie Bao and Vasant Honavar OntoQA: Metric-Based Ontology Quality Analysis…………………… 45 Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza A Heuristic Query Optimization for Distributed Inference on Life-Scientific Ontologies ………………………………………....… 54 Takahiro Kosaka, Susumu Date, Hideo Matsuda, Shinji Shimojo
Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Organizing Committee Doina Caragea, Iowa State University Vasant Honavar, Iowa State University Ion Muslea, Language Weaver, Inc. Raghu Ramakrishnan, University of Wisconsin-Madison
Program Committee Naoki Abe - IBM Liviu Badea – ICI, Romania Doina Caragea - Iowa State University Marie desJardins - University of Maryland, Baltimore County C. Lee Giles - Pennsylvania State University Vasant Honavar - Iowa State University Hillol Kargupta – University of Maryland, Baltimore County Sally McClean - University of Ulster at Coleraine Bamshad Mobasher – DePaul University Ion Muslea - Language Weaver, Inc. C. David Page Jr. – University of Wisconsin, Madison Alexandrin Popescul - Ask Jeeves, Inc. Raghu Ramakrishnan - University of Wisconsin-Madison Steffen Staab - University of Koblenz
Acknowledgements Our thanks go to the authors, program committee members, and IEEE ICDM05 all-workshop chair Pawan Lingras for their contributions, services, and support. Also, thanks go to Jie Bao, Facundo Bromberg, Cornelia Caragea, Dae-Ki Kang, Neeraj Koul, Jyotishman Pathak, Oksana Yakhnenko, Flavian Vasile, students in the AI Lab at ISU, for their help with the paper review.
2005 IEEE ICDM Workshop on KADASH
4
Forward Recent advances in high performance computing, high speed and high bandwidth communication, massive storage, and software (e.g., web services) that can be remotely invoked on the Internet present unprecedented opportunities in data-driven knowledge acquisition in a broad range of applications in virtually all areas of human endeavor including collaborative cross-disciplinary discovery in e-science, bioinformatics, egovernment, environmental informatics, health informatics, security informatics, ebusiness, education, social informatics, among others. Given the explosive growth in the number and diversity of potentially useful information sources in many domains, there is an urgent need for sound approaches to integrative and collaborative analysis and interpretation of distributed, autonomous (and hence, inevitably semantically heterogeneous) data sources. Machine learning offers some of the most cost-effective approaches to automated or semi-automated knowledge acquisition (discovery of features, correlations, and other complex relationships and hypotheses that describe potentially interesting regularities from large data sets) in many data rich application domains. However, the applicability of current approaches to machine learning in emerging data rich application domains presents several challenges in practice: • Centralized access to data (assumed by most machine learning algorithms) is infeasible because of the large size and/or access restrictions imposed by the autonomous data sources. Hence, there is a need for knowledge acquisition systems that can perform the necessary analysis of data at the locations where the data and the computational resources are available and transmit the results of analysis (knowledge acquired from the data) to the locations where they are needed. • Ontological commitments associated with a data source (that is, assumptions concerning the objects that exist in the world, the properties or attributes of the objects, the possible values of attributes, and their intended meaning) are determined by the intended use of the data repository (at design time). In addition, data sources that are created for use in one context often find use in other contexts or applications. Therefore, semantic differences among autonomous data sources are simply unavoidable. Because users often need to analyze data in different contexts from different perspectives, there is no single privileged ontology that can serve all users, or for that matter, even a single user, in every context. Effective use of multiple sources of data in a given context requires reconciliation of such semantic differences from the user’s point of view. • Explicitly associating ontologies with data repositories results in partially specified data, i.e., data that are described in terms of attribute values at different levels of abstraction. For example, the program of study a student in a data source can be specified as Graduate Program (higher level of abstraction), while the program of study of a different student in the same data source (or even a different data source) can be specified as Doctoral Program (lower level of abstraction).
2005 IEEE ICDM Workshop on KADASH
5
The workshop brings together researchers in relevant areas of artificial intelligence (machine learning, data mining, knowledge representation, ontologies), information systems (information integration, databases, semantic Web), distributed computing (service-oriented computing) and selected application areas (e.g., bioinformatics, security informatics, environmental informatics) to address several questions such as: • What are some of the research challenges presented by emerging data-rich application domains such as bioinformatics, health informatics, security informatics, social informatics, environmental informatics? • How can we perform knowledge discovery from distributed data (assuming different types of data fragmentation, e.g., horizontal or vertical data fragmentation; different hypothesis classes, e.g., naïve Bayes, decision tree, support vector machine classifiers; different performance criteria, e.g., accuracy versus complexity versus reliability of the model generated, etc.)? • How can we make semantically heterogeneous data sources self-describing (e.g., by explicitly associating ontologies with data sources and mappings between them) in order to help collaborative scientific discovery from autonomous information sources? • How can we represent, manipulate, and reason with ontologies and mappings between ontologies? • How can we learn ontologies from data (e.g., attribute value taxonomies)? • How can we learn mappings between semantically heterogeneous data source schemas and between their associated ontologies? • How can we perform knowledge discovery in the presence of ontologies (e.g., attribute value taxonomies) and partially specified data (data that are described at different levels of abstraction within an ontology)? • How can we achieve online query relaxation when an initial query posed to the data sources fails (i.e., returns no tuples)? That is, how do we perform a query-driven mining of the individual sources that will result in knowledge that can be used for query relaxation?
2005 IEEE ICDM Workshop on KADASH
6
Supporting Query-driven Mining over Autonomous Data Sources Seung-won Hwang Department of Computer Science and Engineering Pohang University of Science and Technology
[email protected]
Abstract As more and more data sources become accessible, mining relevant results has become a clear challenge. In particular, this paper studies how to support such mining over autonomous sources with Boolean query capabilities, such as most Web e-commerce sites like Amazon.com, which have been dominant sources of information recently. This task is challenging: The notion of relevance is user-specific, in that desirable results may differ across users (and their contexts). Meanwhile, due to the autonomous nature of such sources, it is becoming more and more unreasonable to expect that each user knows enough about the underlying data to construct a complete user query describing her specific information needs. We thus develop “query completion” techniques to complement an incomplete initial query to retrieve user-specific relevant results of a certain desirable size. Such techniques are challenging to develop, as they have to adapt to user-specific information needs to be effective. To address this challenge, we develop user-adaptive query completion techniques to dynamically capture the user-specific information needs, based on which we map the criteria into source queries complementing the initial query. Further, as sources often have limited query capabilities, we develop such mapping to be capability-aware to generate processable queries. Lastly, we develop a cost-aware query scheduling to identify a cost-effective scheduling to execute such queries identified. Our experimental results validate the efficiency and effectiveness of our approach.
1
Sign Up to: Save Searches Save Listings
Introduction
As the internet usage explodes, users are provided with access to data sources with a large amount of data, e.g., realtor.com listing all on-sale houses in the US– The problem of mining only the relevant results by
There are a total of 434 properties listed in the area selected.
Sign Up Now! Already a member? SORT BY:
Sign In
Save this Search
$84,900 4 Bed, 2 Bath 1,809 Sq. Ft.
CHAMPAIGN, IL? 61821 Modify Your Search Price Range: to
Beds:
Baths:
Single Family Property, Area: 1-Champaign, Subdivision: MEADOW PARK, County: CHAMPAIGN, Age: 45 year(s) old, Garage, Central air...?View details. RE/MAX REALTY-CHAMPAIGN
Figure 1. Example initial query. querying such sources has thus become critical, while it is challenging as the notion of the relevance may differ across users and their contexts. For instance, in searching for an ideal house, some with a large family wants to know about large houses, while another may prefer a cozy small place for her own. While each may Prequalify for your Bank of formulate a Boolean query to express her specific mining needs, e.g., using a Web interface, such querying is often incomplete, as Example 1 illustrates. Example 1: Consider user Amy, who is looking for a house in Champaign, IL, that is just right size for her large family of four members, yet within budget. To find such house, she may submit an initial query Q1 below to accessible data source, e.g., realtor.com, as Figure 1 illustrates. select * from house where city=‘Champaign’ and price = 4 and baths >= 2 (Query Q1 ) As Figure 1 demonstrates, this query returns only one house, which is too few for this user to make a well-informed decision. Now, knowing that her initial criteria was too good to be true, she may want to relax her query (e.g., by relaxing the acceptable price range), which can easily return too many results and
2005 IEEE ICDM Workshop on KADASH
7
overwhelm Amy from isolating the desirable results out of them.
able to capture the user-specific information needs and adapt query completion to the specific user.
As Example 1 illustrated, it is becoming more and more unreasonable to expect users to formulate a completely query describing her information needs, as users are often unaware of the underlying data distributions (or even not so sure about her information needs until exploring some results). This paper thus studies how to automatically complement such incomplete initial queries– In particular, we study “query completion” techniques to broaden the query results when the initial query returns too few results while narrow down when the initial query returns too many. However, such query completion is challenging, as those techniques should be adaptive to user-specific information needs. To illustrate, suppose we relax the initial query in Example 1. First, it is challenging to identify the attribute(s) to relax, e.g., relaxing beds >= 4 to beds >= 2, that maximize the effectiveness of the relaxation for the specific user. To illustrate, for the user who is not willing to negotiate on price, relaxing the acceptable range of price to include more expensive alternatives is extremely ineffective generating only the alternatives this user will never consider. Rather, for this user, it would be more effective to relax on beds, to suggest a smaller house with a much more appealing price tag. Second, even after we decide on which attributes to relax, it is challenging to generate a relaxed query Q! (1) that can be processed by the underlying source yet (2) returns a manageable amount of results, which requires knowledge on both query capability and data distribution of the underlying source. To illustrate, in Example 1, before considering to relax price T and K > 1. Definition 2: Invalid Response Page and Valid Response Page: The response page returned for an invalid query is an Invalid Response Page (IRP). The response page returned for an ideal query is a Valid Response Page (VRP). CSI takes URL of a Web search source as input and outputs a set of candidate SPPs, which form a small subset of all hyperlinks found in a VRP sampled from the input Web source. Each candidate SPP is initialized with a score of 1.0. We use example pages in Figure 3 as a running example to explain the following steps in CSI: (1) Probe query generation. First, we need to automatically generate three probe queries corresponding to the input Web source. These probe queries include one invalid query (qi) and two distinct ideal queries (qid1, qid2). In our running example, ‘AnI248mpossibleq73uery’, ‘Java’, and ‘XML’ are the three probe queries corresponding to qi, qid1, and qid2. Automatic generation of qi is explained in [3]. In this step, we present Ideal Query Generator (IQG), a module for automatically generating qid1 and qid2. IQG selects the best queries that can produce largest response pages from a set of candidate queries. For general purpose SEs like Google, every non-stopword can produce reasonably good response pages. Thus every nonstopword can be considered as a candidate query. But this does not work for many specialized SEs (e.g., local SE at ieee.org). We only consider one term queries as the candidate queries. The initial candidate query terms are non-stopword terms coming from three sources: cover page of Web source, title of cover page (text between tags and ), and URL of the cover page. Rationale behind this is that cover page and title often contain several descriptions (e.g., company’s name and its products/services) that are present in several documents indexed by the local SE, while URL contains very few highly important terms that are also present in several documents indexed by the local SE.
The cover page may contain too many terms, not all of them can produce a good response page. Also checking them all can be inefficient. Thus we keep only those terms that are most likely to appear in a document. A word frequency list from [1] is used to measure the popularity of terms thereby deleting the less frequent terms. We need adequate candidate queries to generate the required ideal queries. We set the size of the candidate query set as two times the number of required ideal queries. If the cover page of a source is too simple, such that enough candidate queries cannot be generated, we add terms with highest frequencies from the word frequency list [1] to ensure that adequate candidate queries are available. We then send the candidate queries to the source; those candidate queries with the largest response page sizes are kept as ideal queries. Home
Services
Downloads
Help
No results found for AnI248mpossibleq73uery
(a) IRP returned for query ‘AnI248mpossibleq73uery’ Home
Services
Downloads
Help
Suggestions: Java Careers, Java apps. See more 1. Java Tutorial Gives an introduction to core Java … 2. James Gosling – Java Inventor Gosling was the main architect of Java … 1
2
3
Next
(b) VRP returned for query ‘Java’ Home
Services
Downloads
Help
Suggestions: Parsers, XML Schema, See more 1. XML Learn more about XML at Global … 2. XML Spy XML SPY is a powerful XML tool for … 1
2
3
4
Next
(c) VRP returned for query ‘XML’
Figure 3. Three example response pages. (2) SE connection. Let IRP, VRP1, and VRP2 be the response pages returned for the probe queries qi, qid1, and qid2. As mentioned in section 2, to fetch these response pages, we use the SE connection component of SELEGO
2005 IEEE ICDM Workshop on KADASH
28
[6]. In our example, response pages in Figures 3a, 3b, and 3c correspond to IRP, VRP1, and VRP2. (3) Initial candidate SPP list generation. Initial list of candidate SPPs (lc) are generated by extracting only those hyperlinks in VRP1 whose captions match with the captions of hyperlinks in VRP2. In our example, by matching the captions of hyperlinks in Figures 3b and 3c, we get lc = {Home, Services, Downloads, Help, See more, 2, 3, Next}. The main reason for using captions and not URLs as a feature for extracting candidate SPPs is that in almost all the Web search sources that we encountered, we observed that URLs of the SPPs are mostly query dependent, i.e. query is included in the URL of the SPP. Hence, URLs of the SPPs in two different VRPs returned for two different queries will not match and will not be included in the list of candidate SPPs. Therefore, this simple technique of using captions as a feature to identify the candidate SPPs allows us to include some or all of the actual target SPPs as part of the candidate SPPs. This step also gets rid of all other hyperlinks such as result links, advertisement links, suggestions links, etc., which are mostly query dependent and usually form the majority of hyperlinks in a Web page. Therefore, this step usually generates only few hyperlinks as the candidate SPPs, which are further pruned in the next step. (4) Final candidate SPP list generation. Static links (ls) are those hyperlinks that appear in both IRP and VRP. We observed that in most Web sources, Website navigation links and other miscellaneous links such as Terms of Use form the static links, which can be used for further pruning candidate SPPs. In our example, by extracting common hyperlinks in Figures 3a and 3b, we get ls = {Home, Services, Downloads, Help}. It can be seen that ls forms a subset of lc. Since it is intuitive that ls can never contain the actual target SPPs, we remove them from lc to get the final list of candidate SPPs i.e., lc = lc - ls. In our example, lc = {Home, Services, Downloads, Help, See more, 2, 3, Next} - {Home, Services, Downloads, Help} = {See more, 2, 3, Next}. We next initialize each candidate SPP in the final list with an initial score (sc) of 1.0, which may be further adjusted by the subsequent steps. Final score of a candidate SPP defines its measure of likeliness to be the actual target SPP.
4.2. Label matching heuristic (LM) LM takes the list of candidate SPPs and their corresponding scores as input and outputs the same list with scores that may have been adjusted. LM uses a predetermined list of SPP labels collected from two hundred Web sources. LM matches the caption of each candidate SPP with the collected labels. If the caption of any candidate SPP matches exactly with a label, then we decrease its score by a constant parameter " (" # [0, 1]). This way, candidate SPPs whose scores are decreased are
considered to be more likely to be the target SPPs. Sometimes, if a specific Web source uses entirely new captions for the SPPs, then none of the captions of the candidate SPPs match with the predetermined labels. In this case, the scores remain same. However, subsequent steps can still identify target SPPs and when a target SPP is identified its caption is extracted and is added to the predetermined list of labels. This way, newly learned labels may be useful when a new Web source uses similar captions for its SPPs.
4.3. Static links based heuristic (SL) SL takes the list of candidate SPPs and their scores as input and outputs the same list with scores that may have been further adjusted. We use static links (ls) found in section 4.1 as a feature to further adjust the scores of the candidate SPPs. For each candidate SPP, we first download its corresponding page and save it in a repository to avoid downloading it again in subsequent steps. We then check if the downloaded page contains all the static links (ls). If this condition is true, we update the score of that candidate SPP by a constant parameter $1 ($1 # [0, 1]). Otherwise, the score is unchanged. Motivation behind this heuristic is that all static links that appeared in first response page (VRP1) also appear in subsequent response pages and hence the scores of candidate SPPs containing all the static links are adjusted.
4.4. Form widget based heuristic (FW) Another feature common among the different response pages returned for a particular query by a Web search source is the form widget feature. We use form widgets (type of form widget) and their names (value of attribute ‘name’ in the HTML tag of form widget) as a feature in further adjusting the scores of the input candidate SPPs. Form widgets that our algorithm uses are text box, text area, check box, radio button, submit button, image button, drop-down box, and hidden tag (input tag of type hidden). Figure 4 illustrates FW functionality. In Figure 4, we first find all the form widgets in VRP, represented as wlVRP (line 1). If VRP does not have any form widgets, then we check if at least one candidate SPP page (in the repository) has at least one form widget (lines 2 – 3). If this condition is true, we adjust the scores (sc,i) of each of the candidate SPPs not having any form widgets by a constant parameter $2 ($2 # [0, 1]) (lines 4 – 6), whereas scores of all candidate SPPs having at least one form widget are unchanged. Motivation behind this is that if VRP does not have any widgets, then it is highly unlikely that subsequent pages will have any new widgets. Therefore, candidate SPPs not having any widgets have a higher chance of being the target SPPs (hence their scores are reduced) than those having at least
2005 IEEE ICDM Workshop on KADASH
29
one widget. Note that if VRP does not have any form widgets and if none of the candidate SPPs has at least one form widget, then the scores of all candidate SPPs are unchanged. However, if VRP does have some form widgets, then for each candidate SPP page, we check if all form widgets in VRP also appear in the candidate SPP page with the same name. If this condition is true, we adjust the score of the candidate SPP by a constant parameter $3 ($3 # [0, 1]) indicating that it is highly likely to be the target SPP (lines 10 – 12). If all form widgets in VRP do not appear in a candidate SPP page, then the score of that candidate SPP is unchanged. Motivation behind this step is that all form widgets that appeared in first response page (VRP1) also appear in subsequent response pages. Section 6 explains how parameters ", $1, $2, and $3 are set. Procedure: fwHeuristic() 1: let wlVRP be the form widgets in VRP 2: if |wlVRP| == 0 then 3: if atleast 1 lc,i has atleast 1 widget then 4: for each lc,i having no widgets do 5: sc,i = sc,i * $2, where $2 # [0, 1] 6: end for 7: end if 8: else 9: for each lc,i do 10: if wlVRP appear in wlc,i with the same name then 11: sc,i = sc,i * $3, where $3 # [0, 1] 12: end if 13: end for 14: end if Figure 4. Form widget heuristic.
4.5. Page structure similarity (PSS) Another important feature common among different response pages returned for a particular query by a Web search source is the page structure of the response pages itself. Page structure of a response page is reflected in its tag string, which is the concatenation of all opening tags in the page. According to observation in section 1, since all response pages returned by a Web search source are generated by the same program, it is intuitive that the program just wraps content in each response page using similar/same HTML tags. Therefore tag strings of different response pages are considered to be similar/same. To capture the similarity between the tag string of VRP and the tag string of each of the candidate SPPs, we use a popular approximate string matching algorithm called Levenshtein Distance (LD) [7]. LD in our algorithm is defined as the smallest number of insertions, deletions, and substitutions of tags needed to convert one
tag string into another. If