Knowledge Acquisition from Distributed ... - Computer Science

1 downloads 270 Views 4MB Size Report
Nov 27, 2005 - implementation of Map and histograms adopts Python programming ...... [6] Guha, R., McCool, R.: TAP: A Semantic Web Test- bed. Journal of ...
Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources (KADASH) Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Edited by

Doina Caragea Iowa State University, USA

Vasant Honavar Iowa State University, USA

Ion Muslea Language Weaver, Inc., USA Houston, USA, November 27, 2005

ISBN 0-9738918-4-X

Raghu Ramakrishnan University of Wisconsin-Madison, USA

Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Proceedings of a Workshop held Conjunction with the Fifth IEEE International Conference on Data Mining 2005 Houston, Texas, November 27th, 2005

Table of contents Workshop Committee……………………………………………………. 4 Forward………………………………………………………………........ 5 Supporting Query-driven Mining over Autonomous Data Sources……………………………………………………………............ 7 Seung-won Hwang Combining Document Clusters Generated from Syntactic and Semantic Feature Sets using Tree Combination Methods……………. 16 Mahmood Hossain, Susan Bridges, Yong Wang, and Julia Hodges Automatically Extracting Subsequent Response Pages from Web Search Sources…………………………………………………….. 25 Dheerendranath Mundluru, Zonghuan Wu, Vijay Raghavan, Weiyi Meng, Hongkun Zhao Collaborative Package-Based Ontology Building and Usage………… 35 Jie Bao and Vasant Honavar OntoQA: Metric-Based Ontology Quality Analysis…………………… 45 Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza A Heuristic Query Optimization for Distributed Inference on Life-Scientific Ontologies ………………………………………....… 54 Takahiro Kosaka, Susumu Date, Hideo Matsuda, Shinji Shimojo

Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Organizing Committee Doina Caragea, Iowa State University Vasant Honavar, Iowa State University Ion Muslea, Language Weaver, Inc. Raghu Ramakrishnan, University of Wisconsin-Madison

Program Committee Naoki Abe - IBM Liviu Badea – ICI, Romania Doina Caragea - Iowa State University Marie desJardins - University of Maryland, Baltimore County C. Lee Giles - Pennsylvania State University Vasant Honavar - Iowa State University Hillol Kargupta – University of Maryland, Baltimore County Sally McClean - University of Ulster at Coleraine Bamshad Mobasher – DePaul University Ion Muslea - Language Weaver, Inc. C. David Page Jr. – University of Wisconsin, Madison Alexandrin Popescul - Ask Jeeves, Inc. Raghu Ramakrishnan - University of Wisconsin-Madison Steffen Staab - University of Koblenz

Acknowledgements Our thanks go to the authors, program committee members, and IEEE ICDM05 all-workshop chair Pawan Lingras for their contributions, services, and support. Also, thanks go to Jie Bao, Facundo Bromberg, Cornelia Caragea, Dae-Ki Kang, Neeraj Koul, Jyotishman Pathak, Oksana Yakhnenko, Flavian Vasile, students in the AI Lab at ISU, for their help with the paper review.

2005 IEEE ICDM Workshop on KADASH

4

Forward Recent advances in high performance computing, high speed and high bandwidth communication, massive storage, and software (e.g., web services) that can be remotely invoked on the Internet present unprecedented opportunities in data-driven knowledge acquisition in a broad range of applications in virtually all areas of human endeavor including collaborative cross-disciplinary discovery in e-science, bioinformatics, egovernment, environmental informatics, health informatics, security informatics, ebusiness, education, social informatics, among others. Given the explosive growth in the number and diversity of potentially useful information sources in many domains, there is an urgent need for sound approaches to integrative and collaborative analysis and interpretation of distributed, autonomous (and hence, inevitably semantically heterogeneous) data sources. Machine learning offers some of the most cost-effective approaches to automated or semi-automated knowledge acquisition (discovery of features, correlations, and other complex relationships and hypotheses that describe potentially interesting regularities from large data sets) in many data rich application domains. However, the applicability of current approaches to machine learning in emerging data rich application domains presents several challenges in practice: • Centralized access to data (assumed by most machine learning algorithms) is infeasible because of the large size and/or access restrictions imposed by the autonomous data sources. Hence, there is a need for knowledge acquisition systems that can perform the necessary analysis of data at the locations where the data and the computational resources are available and transmit the results of analysis (knowledge acquired from the data) to the locations where they are needed. • Ontological commitments associated with a data source (that is, assumptions concerning the objects that exist in the world, the properties or attributes of the objects, the possible values of attributes, and their intended meaning) are determined by the intended use of the data repository (at design time). In addition, data sources that are created for use in one context often find use in other contexts or applications. Therefore, semantic differences among autonomous data sources are simply unavoidable. Because users often need to analyze data in different contexts from different perspectives, there is no single privileged ontology that can serve all users, or for that matter, even a single user, in every context. Effective use of multiple sources of data in a given context requires reconciliation of such semantic differences from the user’s point of view. • Explicitly associating ontologies with data repositories results in partially specified data, i.e., data that are described in terms of attribute values at different levels of abstraction. For example, the program of study a student in a data source can be specified as Graduate Program (higher level of abstraction), while the program of study of a different student in the same data source (or even a different data source) can be specified as Doctoral Program (lower level of abstraction).

2005 IEEE ICDM Workshop on KADASH

5

The workshop brings together researchers in relevant areas of artificial intelligence (machine learning, data mining, knowledge representation, ontologies), information systems (information integration, databases, semantic Web), distributed computing (service-oriented computing) and selected application areas (e.g., bioinformatics, security informatics, environmental informatics) to address several questions such as: • What are some of the research challenges presented by emerging data-rich application domains such as bioinformatics, health informatics, security informatics, social informatics, environmental informatics? • How can we perform knowledge discovery from distributed data (assuming different types of data fragmentation, e.g., horizontal or vertical data fragmentation; different hypothesis classes, e.g., naïve Bayes, decision tree, support vector machine classifiers; different performance criteria, e.g., accuracy versus complexity versus reliability of the model generated, etc.)? • How can we make semantically heterogeneous data sources self-describing (e.g., by explicitly associating ontologies with data sources and mappings between them) in order to help collaborative scientific discovery from autonomous information sources? • How can we represent, manipulate, and reason with ontologies and mappings between ontologies? • How can we learn ontologies from data (e.g., attribute value taxonomies)? • How can we learn mappings between semantically heterogeneous data source schemas and between their associated ontologies? • How can we perform knowledge discovery in the presence of ontologies (e.g., attribute value taxonomies) and partially specified data (data that are described at different levels of abstraction within an ontology)? • How can we achieve online query relaxation when an initial query posed to the data sources fails (i.e., returns no tuples)? That is, how do we perform a query-driven mining of the individual sources that will result in knowledge that can be used for query relaxation?

2005 IEEE ICDM Workshop on KADASH

6

Supporting Query-driven Mining over Autonomous Data Sources Seung-won Hwang Department of Computer Science and Engineering Pohang University of Science and Technology [email protected]

Abstract As more and more data sources become accessible, mining relevant results has become a clear challenge. In particular, this paper studies how to support such mining over autonomous sources with Boolean query capabilities, such as most Web e-commerce sites like Amazon.com, which have been dominant sources of information recently. This task is challenging: The notion of relevance is user-specific, in that desirable results may differ across users (and their contexts). Meanwhile, due to the autonomous nature of such sources, it is becoming more and more unreasonable to expect that each user knows enough about the underlying data to construct a complete user query describing her specific information needs. We thus develop “query completion” techniques to complement an incomplete initial query to retrieve user-specific relevant results of a certain desirable size. Such techniques are challenging to develop, as they have to adapt to user-specific information needs to be effective. To address this challenge, we develop user-adaptive query completion techniques to dynamically capture the user-specific information needs, based on which we map the criteria into source queries complementing the initial query. Further, as sources often have limited query capabilities, we develop such mapping to be capability-aware to generate processable queries. Lastly, we develop a cost-aware query scheduling to identify a cost-effective scheduling to execute such queries identified. Our experimental results validate the efficiency and effectiveness of our approach.

1



Sign Up to: Save Searches Save Listings

Introduction

As the internet usage explodes, users are provided with access to data sources with a large amount of data, e.g., realtor.com listing all on-sale houses in the US– The problem of mining only the relevant results by

There are a total of 434 properties listed in the area selected.

Sign Up Now! Already a member? SORT BY:

Sign In



Save this Search

$84,900 4 Bed, 2 Bath 1,809 Sq. Ft.

CHAMPAIGN, IL? 61821 Modify Your Search Price Range: to

Beds:

Baths:



Single Family Property, Area: 1-Champaign, Subdivision: MEADOW PARK, County: CHAMPAIGN, Age: 45 year(s) old, Garage, Central air...?View details. RE/MAX REALTY-CHAMPAIGN

Figure 1. Example initial query. querying such sources has thus become critical, while it is challenging as the notion of the relevance may differ across users and their contexts. For instance, in searching for an ideal house, some with a large family wants to know about large houses, while another may prefer a cozy small place for her own. While each may Prequalify for your Bank of formulate a Boolean query to express her specific mining needs, e.g., using a Web interface, such querying is often incomplete, as Example 1 illustrates. Example 1: Consider user Amy, who is looking for a house in Champaign, IL, that is just right size for her large family of four members, yet within budget. To find such house, she may submit an initial query Q1 below to accessible data source, e.g., realtor.com, as Figure 1 illustrates. select * from house where city=‘Champaign’ and price = 4 and baths >= 2 (Query Q1 ) As Figure 1 demonstrates, this query returns only one house, which is too few for this user to make a well-informed decision. Now, knowing that her initial criteria was too good to be true, she may want to relax her query (e.g., by relaxing the acceptable price range), which can easily return too many results and

2005 IEEE ICDM Workshop on KADASH

7

overwhelm Amy from isolating the desirable results out of them.

able to capture the user-specific information needs and adapt query completion to the specific user.

As Example 1 illustrated, it is becoming more and more unreasonable to expect users to formulate a completely query describing her information needs, as users are often unaware of the underlying data distributions (or even not so sure about her information needs until exploring some results). This paper thus studies how to automatically complement such incomplete initial queries– In particular, we study “query completion” techniques to broaden the query results when the initial query returns too few results while narrow down when the initial query returns too many. However, such query completion is challenging, as those techniques should be adaptive to user-specific information needs. To illustrate, suppose we relax the initial query in Example 1. First, it is challenging to identify the attribute(s) to relax, e.g., relaxing beds >= 4 to beds >= 2, that maximize the effectiveness of the relaxation for the specific user. To illustrate, for the user who is not willing to negotiate on price, relaxing the acceptable range of price to include more expensive alternatives is extremely ineffective generating only the alternatives this user will never consider. Rather, for this user, it would be more effective to relax on beds, to suggest a smaller house with a much more appealing price tag. Second, even after we decide on which attributes to relax, it is challenging to generate a relaxed query Q! (1) that can be processed by the underlying source yet (2) returns a manageable amount of results, which requires knowledge on both query capability and data distribution of the underlying source. To illustrate, in Example 1, before considering to relax price T and K > 1. Definition 2: Invalid Response Page and Valid Response Page: The response page returned for an invalid query is an Invalid Response Page (IRP). The response page returned for an ideal query is a Valid Response Page (VRP). CSI takes URL of a Web search source as input and outputs a set of candidate SPPs, which form a small subset of all hyperlinks found in a VRP sampled from the input Web source. Each candidate SPP is initialized with a score of 1.0. We use example pages in Figure 3 as a running example to explain the following steps in CSI: (1) Probe query generation. First, we need to automatically generate three probe queries corresponding to the input Web source. These probe queries include one invalid query (qi) and two distinct ideal queries (qid1, qid2). In our running example, ‘AnI248mpossibleq73uery’, ‘Java’, and ‘XML’ are the three probe queries corresponding to qi, qid1, and qid2. Automatic generation of qi is explained in [3]. In this step, we present Ideal Query Generator (IQG), a module for automatically generating qid1 and qid2. IQG selects the best queries that can produce largest response pages from a set of candidate queries. For general purpose SEs like Google, every non-stopword can produce reasonably good response pages. Thus every nonstopword can be considered as a candidate query. But this does not work for many specialized SEs (e.g., local SE at ieee.org). We only consider one term queries as the candidate queries. The initial candidate query terms are non-stopword terms coming from three sources: cover page of Web source, title of cover page (text between tags and ), and URL of the cover page. Rationale behind this is that cover page and title often contain several descriptions (e.g., company’s name and its products/services) that are present in several documents indexed by the local SE, while URL contains very few highly important terms that are also present in several documents indexed by the local SE.

The cover page may contain too many terms, not all of them can produce a good response page. Also checking them all can be inefficient. Thus we keep only those terms that are most likely to appear in a document. A word frequency list from [1] is used to measure the popularity of terms thereby deleting the less frequent terms. We need adequate candidate queries to generate the required ideal queries. We set the size of the candidate query set as two times the number of required ideal queries. If the cover page of a source is too simple, such that enough candidate queries cannot be generated, we add terms with highest frequencies from the word frequency list [1] to ensure that adequate candidate queries are available. We then send the candidate queries to the source; those candidate queries with the largest response page sizes are kept as ideal queries. Home

Services

Downloads

Help

No results found for AnI248mpossibleq73uery

(a) IRP returned for query ‘AnI248mpossibleq73uery’ Home

Services

Downloads

Help

Suggestions: Java Careers, Java apps. See more 1. Java Tutorial Gives an introduction to core Java … 2. James Gosling – Java Inventor Gosling was the main architect of Java … 1

2

3

Next

(b) VRP returned for query ‘Java’ Home

Services

Downloads

Help

Suggestions: Parsers, XML Schema, See more 1. XML Learn more about XML at Global … 2. XML Spy XML SPY is a powerful XML tool for … 1

2

3

4

Next

(c) VRP returned for query ‘XML’

Figure 3. Three example response pages. (2) SE connection. Let IRP, VRP1, and VRP2 be the response pages returned for the probe queries qi, qid1, and qid2. As mentioned in section 2, to fetch these response pages, we use the SE connection component of SELEGO

2005 IEEE ICDM Workshop on KADASH

28

[6]. In our example, response pages in Figures 3a, 3b, and 3c correspond to IRP, VRP1, and VRP2. (3) Initial candidate SPP list generation. Initial list of candidate SPPs (lc) are generated by extracting only those hyperlinks in VRP1 whose captions match with the captions of hyperlinks in VRP2. In our example, by matching the captions of hyperlinks in Figures 3b and 3c, we get lc = {Home, Services, Downloads, Help, See more, 2, 3, Next}. The main reason for using captions and not URLs as a feature for extracting candidate SPPs is that in almost all the Web search sources that we encountered, we observed that URLs of the SPPs are mostly query dependent, i.e. query is included in the URL of the SPP. Hence, URLs of the SPPs in two different VRPs returned for two different queries will not match and will not be included in the list of candidate SPPs. Therefore, this simple technique of using captions as a feature to identify the candidate SPPs allows us to include some or all of the actual target SPPs as part of the candidate SPPs. This step also gets rid of all other hyperlinks such as result links, advertisement links, suggestions links, etc., which are mostly query dependent and usually form the majority of hyperlinks in a Web page. Therefore, this step usually generates only few hyperlinks as the candidate SPPs, which are further pruned in the next step. (4) Final candidate SPP list generation. Static links (ls) are those hyperlinks that appear in both IRP and VRP. We observed that in most Web sources, Website navigation links and other miscellaneous links such as Terms of Use form the static links, which can be used for further pruning candidate SPPs. In our example, by extracting common hyperlinks in Figures 3a and 3b, we get ls = {Home, Services, Downloads, Help}. It can be seen that ls forms a subset of lc. Since it is intuitive that ls can never contain the actual target SPPs, we remove them from lc to get the final list of candidate SPPs i.e., lc = lc - ls. In our example, lc = {Home, Services, Downloads, Help, See more, 2, 3, Next} - {Home, Services, Downloads, Help} = {See more, 2, 3, Next}. We next initialize each candidate SPP in the final list with an initial score (sc) of 1.0, which may be further adjusted by the subsequent steps. Final score of a candidate SPP defines its measure of likeliness to be the actual target SPP.

4.2. Label matching heuristic (LM) LM takes the list of candidate SPPs and their corresponding scores as input and outputs the same list with scores that may have been adjusted. LM uses a predetermined list of SPP labels collected from two hundred Web sources. LM matches the caption of each candidate SPP with the collected labels. If the caption of any candidate SPP matches exactly with a label, then we decrease its score by a constant parameter " (" # [0, 1]). This way, candidate SPPs whose scores are decreased are

considered to be more likely to be the target SPPs. Sometimes, if a specific Web source uses entirely new captions for the SPPs, then none of the captions of the candidate SPPs match with the predetermined labels. In this case, the scores remain same. However, subsequent steps can still identify target SPPs and when a target SPP is identified its caption is extracted and is added to the predetermined list of labels. This way, newly learned labels may be useful when a new Web source uses similar captions for its SPPs.

4.3. Static links based heuristic (SL) SL takes the list of candidate SPPs and their scores as input and outputs the same list with scores that may have been further adjusted. We use static links (ls) found in section 4.1 as a feature to further adjust the scores of the candidate SPPs. For each candidate SPP, we first download its corresponding page and save it in a repository to avoid downloading it again in subsequent steps. We then check if the downloaded page contains all the static links (ls). If this condition is true, we update the score of that candidate SPP by a constant parameter $1 ($1 # [0, 1]). Otherwise, the score is unchanged. Motivation behind this heuristic is that all static links that appeared in first response page (VRP1) also appear in subsequent response pages and hence the scores of candidate SPPs containing all the static links are adjusted.

4.4. Form widget based heuristic (FW) Another feature common among the different response pages returned for a particular query by a Web search source is the form widget feature. We use form widgets (type of form widget) and their names (value of attribute ‘name’ in the HTML tag of form widget) as a feature in further adjusting the scores of the input candidate SPPs. Form widgets that our algorithm uses are text box, text area, check box, radio button, submit button, image button, drop-down box, and hidden tag (input tag of type hidden). Figure 4 illustrates FW functionality. In Figure 4, we first find all the form widgets in VRP, represented as wlVRP (line 1). If VRP does not have any form widgets, then we check if at least one candidate SPP page (in the repository) has at least one form widget (lines 2 – 3). If this condition is true, we adjust the scores (sc,i) of each of the candidate SPPs not having any form widgets by a constant parameter $2 ($2 # [0, 1]) (lines 4 – 6), whereas scores of all candidate SPPs having at least one form widget are unchanged. Motivation behind this is that if VRP does not have any widgets, then it is highly unlikely that subsequent pages will have any new widgets. Therefore, candidate SPPs not having any widgets have a higher chance of being the target SPPs (hence their scores are reduced) than those having at least

2005 IEEE ICDM Workshop on KADASH

29

one widget. Note that if VRP does not have any form widgets and if none of the candidate SPPs has at least one form widget, then the scores of all candidate SPPs are unchanged. However, if VRP does have some form widgets, then for each candidate SPP page, we check if all form widgets in VRP also appear in the candidate SPP page with the same name. If this condition is true, we adjust the score of the candidate SPP by a constant parameter $3 ($3 # [0, 1]) indicating that it is highly likely to be the target SPP (lines 10 – 12). If all form widgets in VRP do not appear in a candidate SPP page, then the score of that candidate SPP is unchanged. Motivation behind this step is that all form widgets that appeared in first response page (VRP1) also appear in subsequent response pages. Section 6 explains how parameters ", $1, $2, and $3 are set. Procedure: fwHeuristic() 1: let wlVRP be the form widgets in VRP 2: if |wlVRP| == 0 then 3: if atleast 1 lc,i has atleast 1 widget then 4: for each lc,i having no widgets do 5: sc,i = sc,i * $2, where $2 # [0, 1] 6: end for 7: end if 8: else 9: for each lc,i do 10: if wlVRP appear in wlc,i with the same name then 11: sc,i = sc,i * $3, where $3 # [0, 1] 12: end if 13: end for 14: end if Figure 4. Form widget heuristic.

4.5. Page structure similarity (PSS) Another important feature common among different response pages returned for a particular query by a Web search source is the page structure of the response pages itself. Page structure of a response page is reflected in its tag string, which is the concatenation of all opening tags in the page. According to observation in section 1, since all response pages returned by a Web search source are generated by the same program, it is intuitive that the program just wraps content in each response page using similar/same HTML tags. Therefore tag strings of different response pages are considered to be similar/same. To capture the similarity between the tag string of VRP and the tag string of each of the candidate SPPs, we use a popular approximate string matching algorithm called Levenshtein Distance (LD) [7]. LD in our algorithm is defined as the smallest number of insertions, deletions, and substitutions of tags needed to convert one

tag string into another. If
and
are two tag strings, then LD between them is 1. For current work, we use normalized LD (NLD), which was also used in other works such as [8]. NLD between two tag strings t1 and t2 is defined as, LD(t1, t2) NLD(t1, t2) = (length(t1)+length(t2))/2 where LD(t1, t2) gives the LD between t1 and t2 and length function returns the number of tags in the input tag string. Therefore, in our algorithm we calculate NLD between the tag string of VRP and the tag string of each of the candidate SPP pages. NLD value obtained between a VRP tag string and a candidate SPP tag string is added to the current score of that candidate SPP to get its final score. We found that NLD between VRP and the candidate SPPs that form the actual target SPPs is almost always very low compared to NLD between VRP and the candidate SPPs that are not the actual target SPPs.

5. Wrapper construction This section discusses Wrapper Builder component in detail. Wrapper Builder takes candidate SPPs and their scores, identified by different steps in section 4, as input and outputs a single wrapper that can be used to extract any specified response page from the input Web search source for any specified query. In section 5.1, we discuss the features that constitute an SPP wrapper and in section 5.2 we discuss how an SPP wrapper is constructed when the target SPPs are of any of the types seen in section 1.

5.1. Wrapper format description As noted earlier, SPPs are mostly in the form of hyperlinks and are represented by the corresponding URLs. Even when SPPs are of submit_spp type, they are represented by the URLs constructed as described in section 5.2. URLs representing SPPs almost always contain two parts: (1) a host part and (2) a query part. If www.google.com/search?q=xml&p=11 is an example SPP, then www.google.com/search is the host part while q=xml&p=11 is the query part. A delimiter ‘?’ separates the host and query parts. Similarly, query part consists of one or more parameters which are also separated by delimiters. A parameter is a pair and is usually represented as ‘name=value’. ‘&’ is used as the parameter delimiter. An SPP described above can be represented as the following regular expression: hostpart(D(parameter(d))) where hostpart is the host part of the SPP, D is the delimiter separating the host and query parts, parameter represents the parameters in the query part while d is the delimiter separating those parameters.

2005 IEEE ICDM Workshop on KADASH

30

Observations. Structured data extraction algorithms [4, 5, 6, 8, 9, 10, 11, 12, 14] are based on the assumption that the data to be extracted from a Web page follows few regularities in the way they are displayed (e.g., similar tag strings, each record starting in a new line, etc.) for making it user comprehensible. Similarly, we observed that URLs of SPPs generated by a Web source in response to a query also follow few regularities, most likely to have a simple and straightforward logic in the server side programs that generate response pages: (1) One feature is that the query submitted to any Web search source is included as a parameter value in the query part of all the SPPs generated for that query. We refer to such a parameter which has the query as its value as a query parameter. In the above example SPP, q=xml is a query parameter, if the query submitted is xml. (2) Another feature is that in most cases, one or at the most two parameters are used to uniquely identify each SPP returned for a particular query by a Web search source. We refer to such a parameter(s) which uniquely identifies an SPP as a page parameter(s). If we consider two different SPPs generated for a particular query by a Web search source, then the two SPPs only differ in the page parameter(s) i.e. the two SPPs are identical if we ignore values in the page parameter(s). For example, consider the following three example SPPs to fetch second, third, and fourth response pages returned for the query ‘xml’ by Google. The three SPPs are identical, if the page parameters values 11, 21, and 31 in the SPPs are ignored. A page parameter in an SPP maintains information about the actual number of the corresponding response page. www.google.com/search?q=xml&p=11&sl=1 www.google.com/search?q=xml&p=21&sl=1 www.google.com/search?q=xml&p=31&sl=1 By plugging a new value in the query parameter and an appropriate value in the page parameter, we can fetch the desired response page for the new query. Therefore, goal of Wrapper Builder is to identify both query and page parameters in at least two candidate SPPs. Once identified, we generalize the corresponding SPPs by replacing query and page parameter values with two place holders ($query and $start respectively) thereby obtaining a single generalized SPP. For the above three SPPs, www.google.com/search?q=$query&p=$start&sl=1 is a generalized SPP. By replacing $query and $start with appropriate values, any desired response page can be fetched. However, to replace $start with an appropriate value; we have to first know what the appropriate value is. For this we discover two more values called initial value and incremental value and are defined as follows: Definition 1: An initial value for a generalized SPP G is the minimum of all page parameter values identified in the candidate SPPs that generated G.

Definition 2: An incremental value for a generalized SPP G is the minimum of all pair-wise absolute differences between all the page parameter values identified in the candidate SPPs that generated G. For the above three example SPPs, initial value is 11 i.e. min(11, 21, 31) and incremental value is 10 i.e. min(abs(11-21), abs(21-31), abs(31-11)). If we assume that the first SPP in the first response page points to kth (k = 1 or 2) response page, then to fetch the nth response page, we replace $start with, initial value + incremental value * (n – k) For example, if first example SPP shown above is pointing to the second response page, then to fetch the fifth response page for a query “health”, we replace $query in above generalized SPP with “health” and $start with 41. It should be noted that in all the response pages we encountered, initial and incremental values are always constant for any query submitted to a Web source. Also, for cases where there are two page parameters in an SPP, we use $end as a place holder for the second page parameter value and its initial and incremental values are found in the same way as described above. We consider such SPPs that differ only in one or two page parameters to be following regular query patterns. Rarely, we may encounter SPPs not following a regular query pattern and we consider such SPPs to be following irregular query patterns. Currently, our system reports that a generalized SPP could not be created when it encounters SPPs following irregular query patterns. Once a generalized SPP and the corresponding initial and incremental values are discovered, we record them in an XML file, which forms our final SPP wrapper.

5.2. Constructing SPP wrapper In this section, we explain how an SPP wrapper (discussed in previous section) is constructed when target SPPs are of any of the types described in section 1. We first check if the number of input candidate SPPs is greater than one or equal to one or equal to zero. If the number of candidate SPPs is greater than one, procedure checkmulti_spp (Figure 5) is invoked, to check if SPPs are of multi_spp type. Input to checkmulti_spp is the list of candidate SPPs (SPPlist) and their corresponding scores (Scorelist). In line 1, procedure buildwrapper is invoked, which tries to build a wrapper as described in section 5.1. Main functionality of buildwrapper includes building a generalized SPP by comparing two candidate SPPs at a time and identifying its initial and incremental values. Due to space constraint, pseudocode for buildwrapper is not provided. buildwrapper returns true if it can successfully build a wrapper implying that SPPs are of multip_spp type. Otherwise, it returns false in which case

2005 IEEE ICDM Workshop on KADASH

31

we invoke procedure checksingle_spp (Figure 6) to check if SPPs are of single_spp type (line 3). If checksingle_spp returns true, it implies wrapper is constructed successfully and SPPs are of single_spp type (line 12). Otherwise, procedure checksubmit_spp (Figure 7) is invoked to check if SPPs are of submit_spp type (line 5). If checksubmit_spp returns true, it implies wrapper is constructed successfully and SPPs are of submit_spp type (line 9). Otherwise, it implies that algorithm is unable to build a wrapper (lines 6 – 7). Multiple generalized SPP case. In some cases, more than one generalized SPP may be identified for the same source. This happens when different groups of candidate SPPs were following a regular query pattern. In such cases, system should choose only one of the generalized SPPs as the target generalized SPP. Randomly choosing one of them might lead to selecting a false positive. To decrease the likelihood of selecting a false positive, system first determines a generalized SPP score for each generalized SPP, which is defined as, % lc,i.score lc,i # lc.G | lc.G| where G is a generalized SPP, lc.G is the list of candidate SPPs that generated G, lc,i.score is score of a candidate SPP that was involved in generating G, and |lc.G| is the total number of candidate SPPs that generated G. System then chooses the generalized SPP having the lowest generalized SPP score as the target generalized SPP. This functionality for choosing a single generalized SPP is also part of the procedure buildwrapper. Procedure: checkmulti_spp(SPPlist, Scorelist) 1: wflag = buildwrapper(SPPlist, Scorelist) 2: if wflag is false then 3: wflag = checksingle_spp(SPPlist, Scorelist) 4: if wflag is false then 5: wflag = checksubmit_spp() 6: if wflag is false then 7: wrapper could not be created 8: else 9: wrapper created successfully 10: end if 11: else 12: wrapper created successfully 13: end if 14: else 15: wrapper created successfully 16: end if

Apart from being invoked from checkmulti_spp, procedure checksingle_spp (Figure 6) will also be invoked when the number of input candidate SPPs is equal to one. Input to checksingle_spp is the list of candidate SPPs (SPPlist) and their scores (Scorelist). In line 1, we sort SPPlist in ascending order of the scores of the candidate SPPs. In line 2, we start processing each of the first K (K = 5 in our experiments) candidate SPPs in the sorted SPPlist having query qid1 in the query part of their URLs. In line 3, we check if caption of any of the hyperlinks (say ltemp) present in page pc,i (corresponding to candidate SPP lc,i) matches with caption of lc,i. If this condition is true, we invoke procedure buildwrapper, which tries to build a wrapper using parameters lc,i and ltemp. If buildwrapper returns false, we repeat the process with the next candidate SPP. Otherwise, checksingle_spp returns true indicating that a wrapper has been constructed successfully (line 6). If all K candidate SPPs failed to build a wrapper, checksingle_spp returns false indicating that target SPPs are not of single_spp type. In this case, procedure checksubmit_spp is invoked to check if SPPs are in the form of submit_spp type. Note that parameter K is used only for efficiency purpose as in practice it is highly unlikely to find a wrapper if it was not found after processing the first K candidate SPPs having the lowest scores. Also, sorting input candidate SPPs in line 1 decreases the likelihood of finding a generalized SPP that is a false positive as we return the first identified generalized SPP. Main motivation behind finding a hyperlink (in page pc,i) with a caption similar to the caption of lc,i is that in most cases of single_spp type SPPs, the caption of the target SPP is similar in all response pages returned by a Web search source. Due to space constraint, we do not discuss the functionality that handles the case when SPPs are of single_spp type, but their captions are different in different response pages returned by a Web search source. Procedure: checksingle_spp(SPPlist, Scorelist) 1: sort SPPlist in ascending order of SPP scores 2: for each lc,i having query qid1 in SPPlist until K do 3: if lc,i.caption matches a caption of any link (ltemp) in pc,i then 4: wflag = buildwrapper(lc,i, ltemp) 5: if wlag is true then 6: return true 7: end if 8: end if 9: end for 10: return wflag Figure 6. Checking for single_spp type.

Figure 5. Checking for multi_spp type. We now describe procedure checksubmit_spp (Figure 7), which may also be invoked when the total number of

2005 IEEE ICDM Workshop on KADASH

32

input candidate SPPs is equal to zero. In line 1, we build a tag tree [13] of VRP1. A tag tree data model for Web pages consists of tag nodes forming a nested tree structure. A start tag and its optional end tag in a Web page are represented as a single tag node in the tag tree. Any content (text or other tags) between the start and the end tag in the Web page is reflected in the sub-tree of the corresponding tag node in the tag tree. Root of the tag tree is HTML and each tag node can be located by following a path from the root to the node. In line 2, we extract all the form objects i.e., tag nodes corresponding to the form tags along with their sub-trees. In lines 3 – 8, we process each form object having at least one hidden tag. In line 4, we construct the URL by extracting pairs from each hidden tag i.e., by extracting the values of the attributes name and value. If the constructed URL contains qid1, we add it to list1 (lines 5-6). Finally, if list1 has at least one element, procedure processHiddenURL is invoked with list1 as the parameter (lines 9-10). Otherwise, we return false indicating that a wrapper could not be constructed (line 12). processHiddenURL takes a list of hidden URLs (URLs constructed from hidden tags) as input and processes each URL separately. For each hidden URL (hc), it first downloads the page referenced by hc and extracts all hidden URLs in that page as described above. It then tries to build a wrapper by invoking procedure buildwrapper using hc and each of the newly extracted hidden URLs (say hc1) as the parameters. If buildwrapper returns true, then processHiddenURL also returns true indicating that a wrapper has been constructed successfully. Otherwise, buildwrapper is invoked using a new combination of hc and hc1. Procedure: checksubmit_spp() 1: build a tag tree of VRP1 2: extract all form objects in the tag tree 3: for each form object having hidden tags do 4: form a URL by extracting pairs from hidden tags 5: if URL contains query qid1 then 6: add URL in list1 7: end if 8: end for 9: if |list1| >= 1 then 10: return processHiddenURL(list1) 11: else 12: return false 13: end if Figure 7. Checking for submit_spp type.

6. Experiments We initially surveyed several Web sources to determine the important heuristics that can be used in our

algorithm. After identifying the heuristics, we developed the system by setting some initial (intuitive) values for parameters ", $1, $2, and $3. We then conducted our initial experiments to determine the final values of ", $1, $2, and $3 that reflect the effectiveness of LM, SL, and FW heuristics in identifying SPPs. We then used the final parameter values to perform our final experiments to evaluate our system by testing it on 85 new Web sources. Below, we first explain the results of our initial experiments followed by our final experiments. Table 1. Initial experiments summary. FW($3) LM SL FW($2) 42 36 2 43 A 1 13 1 14 B 73% 66% 75% Accuracy 97% Initial Experiments: Our initial experiments included a total of 50 Web search sources taken from diverse categories. We measured the accuracy of each of the heuristics, which indicates their effectiveness in identifying target SPPs. Let A be the number of sources for which the target SPP is identified by a heuristic H and B be the number of sources for which at least one candidate SPP that is not the target SPP is identified by H. Then, accuracy of H is defined as A/A+B. It should be noted that apart from A and B there may be some sources among the 50 Web sources for which H may not identify even a single candidate SPP as the target SPP. Table 1, which presents summary of our initial experiments, shows that LM is the best heuristic with an accuracy of 97%. FW for the case when $3 is used is the next best with an accuracy of 75%. SL is third best with an accuracy of 73%. Though, FW for the case when $2 is used was encountered only 2 times, it could identify target SPP on both occasions and hence we consider it to be an important feature. From these observations, we decided to use 0.1 for ", 0.4 for $3, 0.5 for $1, and 0.6 for $2. Final Experiments: Our final experiments included 85 new Web sources (completely different from those used in initial experiments) that include several general purpose SEs (e.g., Google), e-stores and other specialized sources taken from categories such as health (cancer research), science & technology, newspapers, etc. Such diverse sources were chosen to validate our method’s effectiveness comprehensively. It should be noted that sources whose SPPs follow irregular query patterns were not included in our final experiments. Below, we summarize our test results. Accuracy: Our system achieved an accuracy of 95.2% i.e. out of 85 sources it failed on only 4 sources. Among the 4 failed cases, one source failed as its IRP always displayed SPPs though 0 results were returned as it always returns several advertisement links no matter what invalid query is submitted. Due to this, SPPs are always included in the

2005 IEEE ICDM Workshop on KADASH

33

static links and hence will be removed from the final candidate SPP list returned by CSI. Two other sources having a single_spp type SPP failed as the system wrongly found that SPPs are of multi_spp type as few candidate SPPs followed regular query patterns. Last source, which had a single_spp type SPP, failed as a wrong generalized SPP was chosen since the corresponding candidate SPP had a score less than the score of the candidate SPP that was the actual target SPP. Effectiveness of CSI module: Total number of unique hyperlinks identified across all 85 Web pages was 4228, while total number of candidate SPPs identified by CSI was only 734. Therefore, on average 49.74 unique links were identified in a Web page, while on average only 8.63 candidate SPPs were identified. This shows that CSI plays a critical role in keeping the system efficient as subsequent steps have to process only few candidate SPPs. SPP types on the Web: We found that multi_spp and single_spp type SPPs were employed by most Web sources. multi_spp type SPPs were used by 47 sources, while single_spp type SPPs were used by 31 sources. SPPs of submit_spp type were used by only 2 sources and 5 sources did not use any SPPs at all i.e. result records were always displayed in the first response page. Execution time: Average time taken for our system to build a wrapper was 20.5 seconds. Once a wrapper is created for a Web source, any specified response page can be fetched from that source in a fraction of a second. All experiments were conducted on a Pentium 4 3.1GHz laptop with a 512 MB RAM and T-1 Internet access. We consider that the proposed method is efficient enough to be practically used.

7. Conclusions and future work This paper proposed an effective and efficient solution for automatically fetching any specified response page from Web search sources. This is an important task for information integration systems as most often Web search sources split their results among several response pages and return only the first response page. The proposed approach first identifies certain important hyperlinks present in the response page sampled from an input Web search source and then further analyzes them using four independent heuristics. Finally a wrapper is built to automatically extract any specified response page from the input Web search source. Experimental results showed that the proposed method is highly effective (95.2% accuracy) and efficient. In immediate future, we plan to: (1) Handle cases where SPPs follow irregular query patterns. For this, we will be using a machine learning approach, where we use a feature vector for each candidate SPP. We will use the output obtained from each

heuristic, after processing the candidate SPP, as a feature in the feature vector. (2) Handle cases where SPPs are Java Script enabled (e.g., preventcancer.org), i.e. clicking an SPP will invoke a Java Script method before sending request to the server. (3) Use visual position of SPPs to identify them. For example, in some cases SPPs appear both at the top and bottom of the Web page and such visual information can be utilized for identifying them. (4) Perform large-scale experiments.

References [1] W. Meng, C. Yu, and K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, 34(1), 48-89, March 2002. [2] S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. VLDB, 129–138, September 2001. [3] Z. Wu, D. Mundluru, and V. Raghavan. Automatically Detecting Boolean Operations Supported by Search Engines, towards Search Engine Query Language Discovery. Workshop on Web-based Support Systems, 171178, September 2004. [4] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. WWW, 66-75, May 2005. [5] L. Chen, H. Jamil, and N. Wang. Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification. SIGMOD Record, 33(2), 58-64, 2004. [6] Z. Wu, V. Raghavan, H. Qian, V. Rama, W. Meng, H. He, and C. Yu. Towards Automatic Incorporation of Search Engines into a Large Scale Metasearch Engine. Web Intelligence, 658-661, October 2003. [7] P. Hall and G. Dowling. Approximate String Matching. ACM Computing Surveys, 12(4): 381-402, 1980. [8] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. SIGKDD, 601-606, August 2003. [9] B. Liu and K. Chang. Editorial: special issue on web content mining. SIGKDD Explorations, 6(2), 1-4, December 2004. [10] I. Muslea, S. Minton, and C. Knoblock. A Hierarchical Approach to Wrapper Induction. Agents, 190-197, 1999 [11] L. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. ICDCS, 361370, 2001. [12] C. Chang and S. Liu. IEPAD: Information Extraction Based on Pattern Discovery. WWW, 681-688, 2001. [13] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufman, San Francisco, CA, 2002. [14] R. Doorenbos, O. Etzioni, and D. Weld. A Scalable Comparison-Shopping Agent for the World-Wide Web. Agents, 39-48, 1997.

2005 IEEE ICDM Workshop on KADASH

34

Collaborative Package-based Ontology Building and Usage Jie Bao and Vasant Honavar Artificial Intelligence Research Laboratory Computer Science Department Iowa State University Ames, IA USA 50010 Email: {baojie, honavar}@cs.iastate.edu

Abstract With the increased complexity of ontologies, building and using large-scale ontologies require cooperative and scalable mechanisms. In this paper, we propose a modular ontology approach for such challenge. A large ontology has multiple smaller modules called packages, which are sets of highly related terms and term relations; packages can be nested in other packages, forming a package hierarchy. The visibility of a term is controlled in order to realize partial knowledge hiding in the process of knowledge sharing. The whole ontology is composed of a set of packages. A concrete packagebased ontology language, i.e., the package-based partialorder (hierarchy) ontology (PPO), is studied to illustrate the features of such modular ontologies. We also present a modular ontology editor for building PPO and we show its applications on practical ontologies.

1

Introduction

Semantic Web [3] aims to support seamless and flexible access and use of semantically heterogeneous, networked data, knowledge, and services. The success of the Semantic Web relies on the availability of a large collection of domain or application specific ontologies and mappings between ontologies to allow integration of data [16]. An increasing need for sharing information and services between autonomous organizations has led to major efforts aimed at the construction of ontologies in many domains e.g., the gene ontology (GO, www.geneontology.org). With the rapid proliferation of ontologies, their complexity (e.g., size, structure) and the complexity of the process of building ontologies are also increasing. A typical ontology used today contains thousands of terms. For example, GO contains 2 × 105 terms and

the Gramineae Taxonomy Ontology contains 7 × 105 terms. Large-scale ontologies bring new challenges for the building, adaption, use and reuse of such ontologies. There is a great need for special mechanisms and tools for well-controlled and efficient management of large ontologies. By its very nature, ontology construction is a collaborative process which involves direct cooperation among individuals or groups of domain experts or indirect cooperation through the reuse or adaptation of previously published, autonomously developed, very likely, semantically heterogeneous ontologies. In order for an ontology to be broadly useful to a certain community, it needs to capture the knowledge based on the collective expertise of multiple experts and research groups. Typically, a large ontology is built and curated by a community. Each member of the community only contributes a part of the ontology and the final ontology is compiled from the contributed pieces. The mechanisms and tools should be scalable to handle large-scale ontology for storage, editing, browsing, visualizing, reasoning and reusing. While a small or moderate-sized ontology can be fully loaded into an ontology tool (e.g., editor or reasoner), it’s usually very inefficient or even impossible to fully load a large ontology into such a tool. We need tools that are able to process large ontologies with limited time and space (e.g., memory) resources. Hence, there is an acute need for approaches and tools that facilitate collaborative and scalable design and reuse of ontologies. Against this background, this paper describes a modular ontology approach to the problem described and studies the package-based organization and development of ontologies. In particular, we study the package-based partial-order ontology language and tools for modular hierarchies, and illustrate their usefulness with an example from life science.

2005 IEEE ICDM Workshop on KADASH

35

2

Desiderata of Modular Ontology

Typically, during the construction of a large ontology, multiple relatively autonomous groups will contribute parts of such ontology that pertain to their domains of expertise or responsibility. For example, an ontology about a university contains information about buildings, people, departments, institutions, teaching activities, research activities, finance, administration and so on. The ontology is built by multiple groups of people, each from a relatively autonomous unit, such as different departments, laboratories or offices. The ontology in question should be a semantically coherent integration of the constituent parts developed by the individual groups. We enumerate below, some desiderata of such collaborative large-scale ontology building process. Local Terminology: Terms used in different ontologies should be given unique identifiers. This is necessary to avoid name conflicts when merging ontologies and to avoid unwanted interactions among modules. For example, TurkeyStudy means the study of a bird in the animal science department ontology, while TurkeyStudy means the study of a country in the geography science department ontology. Manual processing of such name conflicts does not scale up with increase in size, number, and complexity of ontologies. Localized Semantics: Ontology construction requires different groups to adapt or reuse ontologies developed by other groups. However, unrestricted use of entities and relationships from different ontologies can result in serious semantic conflicts, especially when the ontologies in question represent local views of the ontology producers. For example, the computer science department may have no qualifying exam for PhD students, therefore, the CS ontology assumes that all PhDStudent s are PhDCandidates, while in the electrical engineering department, which requires qualifying exam for all PhD students, PhDStudent s includes both PhDCandidates and ProPhD. Partial Reuse: In collaborative design of ontologies, it often makes sense to reuse parts of existing ontologies. However, lack of modularity and localized semantics in ontologies forces an all or nothing choice. For example, a department library may want to reuse only relevant part of the Congress Library Catalog (CLC) ontology in creating its own ontology. Nevertheless, current ontology languages such as OWL do not support such partial importing of an ontology. Modular ontologies will facilitate more flexible and efficient reuse of existing ontologies. For example, the computer science department library may just reuse the Q76 component of the CLC ontology.

Knowledge Hiding: In many applications, the provider of an ontology may not wish, because of copyright considerations or privacy or security concerns, to make the entire ontology visible to the outside, while willing to expose certain parts of the ontology to certain subsets of users. For example, the SSN information of the university employees should be hidden to general access, while visible to a particular component, such as the payroll office ontology. In other cases, an ontology component may only provide a limited query interface, while the details of the component are not important or necessary to users and other ontology components. For example, the department ontology may share course grading information to the registration office, but hold detailed homework and exam grading records to local access only. Ontology Evolution: Ontology construction is usually an iterative process. A small change in one part of an ontology may be propagated in an unintended and hence undesirable manner across the entire ontology. For example, a term can be obsoleted (such as a graduated student name or a canceled course name) but still be referred by other parts. A concept definition can be changed without informing other parts, for instance, the computer science department may introduce the PhD qualifying exam this year and change its definition of PhDStudent to PhDCandidate " ProPhD, while some other parts still hold to its old definition PhDStudent ≡ PhDCandidate. Distinction between Organizational and Semantic Structure: Two kinds of knowledge structure could be recognized in a knowledge base. One is organizational structure, which arranges terms for better usage and understanding. For example, we can have a complete, huge English dictionary, or instead, we can have domain-specific dictionaries, such as computer science dictionary and life science dictionary. The second kind of structure is the semantic structure, which has to do with how the meaning of terms is related. For example, ‘Mouse’ is an ‘Animal’ or ‘Mouse’ is part of a ‘PC’. In the university ontology example, we may compile the university people directory from multiple department directories (the organizational structure), and pertain the classification of the people such as ‘Alice’ is a ‘Student’ (the semantic structure). Proposed Approach Current ontology languages, like OWL, while they offer some degree of modularization by restricting ontology segments into separated XML namespaces, fail to fully support modularity, localized semantics, and knowledge hiding. This state of affairs in ontology languages is reminiscent of the early programming languages and first attempts at software engineering when

2005 IEEE ICDM Workshop on KADASH

36

uncontrolled use of global variables, spaghetti code, absence of well-defined modules were leading to unwanted and uncontrolled interactions between code fragments. In this paper, we argue for package based ontology languages to overcome these limitations. A package is an ontology module with clearly defined access interface. Mapping between packages is performed by views, which define a set of visible terms on the referred packages. Semantics are localized by hiding semantic details of a package by defining appropriate interfaces (special views). Packages provide an attractive way to compromise between the need for knowledge sharing and the need for knowledge hiding in collaborative design and use of ontologies. The structured organization of ontology entities in packages bring to ontology design and reuse, the same benefits as those provided by packages in software design and reuse in software engineering.

3

Package-based Ontologies

This section explores the basic definitions of package-based ontology.

3.1

General Ontology

In knowledge engineering, an ontology is the shared specification of a conceptualization [13]. Here we first give a general definition of an ontology. Definition 1 (Ontology) An ontology is a tuple (S, R), where S = SI ∪ SC ∪ SP is the set of terms and SI , SC , SP are the set of individuals, concepts, and properties, respectively, R = S × S is the set of associations of the terms. An interpretation of an ontology is a tuple (∆, Φ), where ∆ is the domain of all possible individuals, a subset of ∆ is the interpretation of a concept, Φ ⊆ ∆n , n ! 2 is the domain of all properties. The interpretation of a term t is denoted as tI . For example, for an ontology (S, R), where S = {Alice, Bob, Student, P eople, Knows}, R = {Student & P eople, Alice isa Student, Bob isa Student}, SI = {Alice, Bob} is the individual set, SC = {Student, P eople} is the concept set, SP = {Knows} is the property set. In an interpretation of the ontology, concepts are subsets of the domain, e.g. StudentI ⊂ ∆, individuals are elements in the domain, e.g. AliceI ∈ ∆, BobI ∈ ∆, properties are Cartesian products of the domain, e.g. KnowsI ⊂ P eopleI × P eopleI . Association Student & P eople is interpreted as StudentI ⊆ P eopleI , Alice isa Student as AliceI ∈ StudentI .

A specific ontology language, such as the description logics SHIQ(D) , can be extended from the general ontology definition. In the next section, we will study partial-order ontology as a concrete example.

3.2

Package and Package Hierarchy

In this research, we introduce package [2] as the basic module of an ontology. In such a modular ontology representation, an ontology is divided into smaller components called packages, and each package contains a set of highly related terms and their relations; packages can be nested in other packages, and form a package hierarchy; the visibility of a term is controlled by scope limitation modifiers, such as public, private and protected. The whole ontology is composed of a set of packages. We first give the definition of package. Definition 2 (Package) A package P = (SP , RP ) of an ontology O = (S, R) is a fragment of O, such that SP ⊆ S, RP ⊆ R. The set of all possible packages is denoted as ∆P . A term T ∈ SP is called a member of P and can also be denoted as T ∈ P . P is called the home package of T and denoted as P = HP(T ). Global package Pglobal is a special package which is the default package for any term without package membership. For example, an ontology O has two packages Pstudent and Pf aculty . Pstudent has terms Alice, Bob, Student, P eople, Pf aculty has terms Carl, Dell, F aculty. Alice is a member of Pstudent , and Pstudent is the home package of Alice: Pstudent = HP(Alice). Both Pstudent and Pf aculty are elements of the package domain ∆P . A package can be declared as a sub package of another package. Therefore, the whole ontology will have an organizational hierarchy, in additional to the semantic hierarchy. Formally, we define package nesting as Definition 3 (Package Nesting) A package P1 can be called nested in another package P2 and denoted as P1 ∈N P2 . All package nesting relations in an ontology form the organizational hierarchy of the ontology. Transitive nesting ∈∗N is defined as • P1 ∈N P2 → P1 ∈∗N P2 • P1 ∈∗N P2 and P2 ∈∗N P3 → P1 ∈∗N P3 One advantage of package hierarchy is that both concepts and individuals of a module can be structured in an organizational hierarchy, while their semantics

2005 IEEE ICDM Workshop on KADASH

37

could have different hierarchy or no hierarchy at all. For example, the Pstudent package may include two sub packages Pcs−student and Pee−student , developed by the computer science department and the electrical engineering department respectively. Part of the contents of these packages is shown in Figure 1. We can see that the concept semantic hierarchy is different to the package organizational hierarchy, and individuals have organizational hierarchy but no semantic hierarchy.

! !

!

Figure 2. Package-based Ontology: The ontology has three packages P1 , P2 , P3 . P3 is nested in P2 . A public term in P1 is visible to P2 while a private term in P1 is only visible in the home package(P1 ).

Figure 1. Organizational Hierarchy and Semantic Hierarchy

When SLMs are also included in the package definition, a package is defined as (SP , RP , SLMP ) where for any t ∈ SP , there is one and only one SLM ∈ SLMP for t. The scope of a term t is the set of packages from where t is visible, i.e. scope(t) = {p|SLM (p, t) = TRUE}.

3.3

Term Scope Limitation

A term defined in a package can also be associated with a scope limitation modifier (SLM). SLM of a term controls the visibility of the term to other packages. For example, a term with SLM ‘public’ can be visited from any package (See Figure 2). Formally, a SLM is defined as follows. Definition 4 (SLM) The scope limitation modifier of a term t in a package P is a boolean function V (p, t), where p is another package, and p can access t iff V (p, t) = TRUE. We denote t ∈V P . We define three default SLMs as • public(p, t) := TRUE, means term t is accessible everywhere. • protected(p, t) := (t ∈ p) or p ∈∗N HP(t), means t is visible to its home package and all its descendant packages on the organizational hierarchy. • private(p, t) := (t ∈ p), means t is visible only to its home package. We can also define other types of SLMs as needed. For example, f riend(P, t) := TRUE will grant access of t to a particular package P .

The horizon of a package p is the set of all terms that are visible to p, i.e. horizon(p) = {t|SLM (p, t) = TRUE} Scope limitation can serve for multiple purposes, such as • Hide private or copyrighted information. For example, the Pcs−student package may include SSN information which is not shared except to particular friend packages (such as Pregistration ) • Restrict extensibility and reduce unwanted coupling. When different packages are defining the ontology on different level of abstraction, the highlevel package may publicize its ‘open’ terms (the terms that need further specialization) to low-level packages, and keep ‘final’ terms (the terms that don’t need extension) local, and therefore reduce the possibility of unintended coupling. For example, in the design stage of the university ontology, to build the administration hierarchy of university units, the high level package Padm may make ‘President’ as final term while ‘DepartmentHead’ as open term, the lower level package Pcs−adm may extend ‘DepartmentHead’ with ‘CsDepartmentHead’ as final term, and further keep ‘CsLabDirector’ as open term.

2005 IEEE ICDM Workshop on KADASH

38

of the package with reduced complexity and focused topics. For example, the publication ontology module of the computer science department (as a part of the research activities ontology module) contains the bibtex information for all publications over the years in the department. As the module (as package Pcs−pub ) may contain thousands of terms, it may also provide specialized interfaces for different topics, such as software engineering (with interface Vcs−pub−se ) or artificial intelligence (with interface Vcs−pub−ai ).

• Hide details to improve usability. For example, Pstudent defines property CourseGrading as public but keeps HomeworkGrading open only to sub packages and Pregistration .

3.4

Views and Interfaces

Views and interfaces provide a flexible way to connect and reuse packages. A view is a set of visible terms from one or more packages. An interface is a view over only one package. Formally we have

• Interfaces can help in ontology evolution. In the design of an ontology, the details of an ontology module may change over the time. However, with a well-defined interface, such changes can be hidden to other modules or users with the separation between the change-prone terms and the stable terms. For example, the Pcs−student package may contain a interface Vcs−current−student , which contains current students in the department. While the current student list is ever-changing and the definition of ‘current student’ might change (e.g., excludes minor degree students), the interface can be kept the same.

Definition 5 (View and Interface) A view V for a set of packages P1 , ...Pn , denoted as P1 , ...Pn :: V , is a term set V ⊆ {t|∀p, SLM (p, t) = TRUE, t ∈ Pi , 1 " i " n}. A view is an interface if n = 1. The default interface of a package P is the public term set of the package, i.e. {t|t ∈public P }.

P1

I 12

P3

P1

P2

P3

P4

I 11

P2

(a)

(b)

Figure 3. View and Interface: (a) A Package with multiple interfaces (b) A View can be built upon multiple packages and can be referred by multiple modules We can benefit from views and interfaces in the following ways • Views can help to integrate ontology components (Figure 3 (b)). The terms defined in a view can be from different packages, therefore providing a way to combine ontology pieces. For example, a Vcs−people view can be defined to unify information from both Pcs−student and Pcs−f aculty . • Views can be used to ‘customize’ packages, i.e. have polymorphism (Figure 3 (a)). Packages can have multiple views for different purposes. For example, the terms in package Pcs−student can be included in the view Vcs−people (the people in the computer science department) or Pstudent (students in the university) for different usages. • Packages can be partially reused with interfaces. An interface contains a selected set of terms defined in a package, therefore enabling the reuse

The aforementioned definition of package-based ontology is summarized in the Table 1. ∆S is the ontology term domain, which is the set of all possible names in the ontology. ∆P is the domain of all possible packages.

4

Package-based Partial-order Ontology (PPO)

In this section, we will study a concrete packagebased ontology language to show how to extend an existing ontology language to a modular ontology language. We choose the partial-order ontology (PO) as the example. PO is relatively simple but representative and widely applicable, since it can be used as the theoretical model for the most popular ontologies, i.e., hierarchies.

4.1

Partial-order ontology

Firstly we give the standard definition of partialorder and partial-order ontology. Definition 6 (Partial-Order) A partial-order over a set S is a relation over S × S such that



• ≤ is transitive : x ≤ y and y ≤ z → x ≤ z • ≤ is self-reflexive: x ≤ x

2005 IEEE ICDM Workshop on KADASH

39

Table 1. Syntax and semantics of Package-based Ontology Constructor Syntax Semantics Package P P I ∈ ∆P I Global Package Pglobal Pglobal ∈ ∆P I Membership t ∈ P or member(t, P ) member ⊆ ∆S × ∆P Home Package HP(t) HP(t)I = p, where t ∈ p Nesting ∈N ∈IN ∈ ∆P × ∆P Transitive nesting ∈∗N ∈N →∈∗N , ∈∗N = (∈∗N )+ SLM SLM(p, t) p ∈ ∆P can access t ∈ ∆S iff SLM(p,t)=TRUE public(p, t) ∀p, public(p, t) := TRUE private(p, t) ∀p, private(p, t) := (p = HP(t)) protected(p, t) ∀p, protected(p, t) := (p = HP(t) or p ∈∗N HP(t)) View P1 , ...Pn :: V V I ⊆ {t|∀p, SLM (p, t) = TRUE, t ∈ Pi , 1 " i " n} Interface P :: F F I ⊆ {t|∀p, SLM (p, t) = TRUE, t ∈ P }

• ≤ is anti-symmetric: x ≤ y and y ≤ x → x = y Definition 7 (Partial-Order Ontology) A partialorder ontology over a set S has • A set of partial-order axioms x ≤ y, where x, y ∈ S; • A set of equivalence axioms x nonequivalence axioms x ,= y.

=

y or

Each item in S is called a term. A partial-order ontology can be mapped to a graph with each term as a node and each partial-order axiom as a directed edge. Equivalent terms share the same node. In particular, a PO ontology can be a hierarchy, i.e., a Directed Acyclic Graph (DAG), which has no loops. A DAG is a tree if all nodes except one (the root, which has no incoming edge) has one and only one incoming edge. For example, we can describe professions with a hierarchy such as:

4.2

Package-based Partial-order ontology

A modular partial-order ontology has one or many packages. Each package in the ontology is a partialorder ontology component. Definition 8 (Package-based PO ontology) A package-based partial-order ontology is a set of related packages {P1 , ...Pn }, where Pi = (Si , Ri ), Si is a term set and Ri is a set of partial-order, equivalence or nonequivalence axioms. Packages in a PPO can have different roles (see Figure 4), such as • Branch Packages A big hierarchy may be divided into smaller branches and each of them is a sub-package. For example, to build the administration hierarchy, we may create high level topic packages, such as Padm , and then define nested packages such as Pcs−adm or Pee−adm . Here, different level packages capture different levels of abstraction of concepts. • Aspect Packages An ontology may contain hierarchies for different aspects of the domain. For example, in additional to the administration hierarchy, we may also describe people with the profession hierarchy.

Faculty Professor ... Student PhDStudent MasterStudent Undergraduate Staff Secretary

4.3

where indent means a partial-order, e.g. Professor ≤ Faculty.

Building PPO

We developed the INDUS DAG Editor, a modular ontology editor, to enable the building and deployment of PPO. The editor enables the ontology developer to create a community-shared ontology server with

2005 IEEE ICDM Workshop on KADASH

40

Figure 5. the INDUS DAG Editor, a collaborative editor for package-based partial-order ontology

• The editor supports multi-relation hierarchy. For example, the ontology can have both is-a and partof hierarchies while some terms are shared in both hierarchies. Such structure is widely used in life science ontologies. • The editor supports multiple users to concurrently edit an ontology. A locking mechanism is provided to avoid conflicts and abnormalities. Modules of the ontology can be developed by different authors and assembled at latter stages.

Figure 4. Example of Package-based PartialOrder Ontology

database storage, which supports concurrent browsing and editing of the ontology. The editor supports team collaboration with features such as: • The ontology is stored on a relational database server; a user can connect to the server and check out one or more packages, edit them, and check them back in the database, when finished editing.

• User profile management is provided. Authors of the ontology have different levels of privileges (such as ontology admin and package admin) over modules in that ontology. The author of a package can authorize other users the access to certain terms, therefore controls the extendibility of that package. • The editor provides a handy graphical user interface (GUI) to edit and browse a DAG. • The editor can import and export ontologies in standard format, such as OBO (Open Biomedical Ontologies) and OWL (Web Ontology Language). In Figure 5, we show a snapshot of the editor loaded with the Gene Ontology (GO). Three namespaces of

2005 IEEE ICDM Workshop on KADASH

41

GO is modeled as three top packages. Each package can be viewed in either is-a view or part-of view. The neighbor terms of a selected term on both hierarchies are shown in the smaller browsing panels. A user can open one package instead of the whole ontology to reduce the memory requirement.

4.4

Design Principles of PPO

In the practice of package-based ontology building, we get some experience on the design of modular hierarchies. Moderate Size. A package should not be too small or too big. Typically a package is developed by an autonomous author and the author should have good knowledge about the package. A large package is beyond the control of a single author, therefore it is not safe on consistency and inefficient for loading and reasoning. We are working on a package decomposition method to limit the size of a single package. A small package is also inefficient as a piecemeal module. Overlap the package hierarchy and the Semantic hierarchy if you can. It is best to keep the package hierarchy consistent with the semantic hierarchy (such as isa hierarchy), since it will be more semantically clear and safe. High level packages should contain general concepts, while low level packages should contain specialized concepts. Reuse ‘Pattern’. If a ‘pattern’ is repeated in different packages, it is better to extract it as a reusable package. For example, if the hierarchy DepartmentHead - LabDirector -ResearchAssistant repeats in many departments, it can be extracted in a common template package for all departments. Term scope limitation can be changed in the released ontology. The use of term SLM in the design stage and the release stage can be different. In the design stage, we are focused on avoiding name conflicts, reducing unwanted coupling, less memory demand and good evolvability, while in the release stage, we are focused on partial re-usability and critical information hiding. For example, we have mentioned in 3.3 that a Padm may make ‘President ’ as a private final term to restrict its extensibility. We can change it to a public term in the released ontology. Package-based ontology can be reduced to normal ontology. If partial reusing and information hiding are not concerned issues, the package-based ontology can be reduced to a ‘normal’ ontology without packages in the released version. For example, even if we edit GO with package-based structure, to release it in generally acceptable format, the released ontology can contain only the semantic structure of the ontology, but

no package organization or scope limitation. This is a way to comprise the modular ontology building and current non-modular ontology usage.

5

Relation Work

Distributed Logics: A number of distributed logics systems have been studied during recent years. Examples include Local Model Semantics [10] and Distributed First Order Logic (DFOL) [11] which emphasize local semantics and the compatibility relations among local models. Inspired by DFOL, Borgida et.al. [4] extends the description logic to obtain a distributed description logic (DDL) system. A DDL system consists of a set of distributed TBoxes and ABoxes connected by ”bridge rules”. Bridge rules are unidirectional, thereby ensuring that there is no ”back-flow” of information among modules being connected by a bridge rule. The authors in [15] extend local model semantics and harmonize local models via agreement on vocabulary provenance. Grau et.al. [12] explores using E-connections to extend OWL and this approach is very straightforward to implement on existing tableau OWL reasoners. Serafini et.al. [17, 18] defines a sound and complete distributed tableau-based reasoning procedure which is built as an extension to standard Description Logic tableau. [18] also describes the design and implementation principles of a distributed reasoning system, called DRAGO (Distributed Reasoning Architecture for a Galaxy of Ontologies), that implements such distributed decision procedure. Modular Ontologies Two approaches to integration of separate ontologies have been developed based on DFOL and DDL. The Modular Ontology [19] offers a way to exploit modularity in reasoning. It also defines an architecture that supports local reasoning by compiling implied subsumption relations and offers a way to maintain the semantic integrity of an ontology when it undergoes local changes. In the ”view-based” approach to integrating ontologies, all external concept definitions are expressed in the form of queries. However, A-Box is missing in the query definition, and the mapping between modules is unidirectional, making it difficult to preserve local semantics. Fikes et.al. [9] mentioned integration of modular ontology in the Ontolingua system, which restricts symbol access to public or private. The major difference between our approach and their approach is that we use packages not only as modular ontology units, but also in organizational hierarchies, therefore enabling the hierarchical management of modules in collaborative ontology building. The scope limitation modifier idea is

2005 IEEE ICDM Workshop on KADASH

42

an extension of the symbol access restriction, but it is more flexible and expressive. Contextual Ontology: Contextual logic, a formalism based on DDL, emphasizes localized semantics in ontologies. Contextual ontology keeps contents local and maps the content to other ontologies via explicit bridge rules. Bouquet et.al. [5] proposed CTXML, which includes a hierarchy-based ontology description and a context mapping syntax. They further [6] combined CTXML and OWL into Context OWL (C-OWL), a syntax for bridge rules over OWL ontology. Our approach has several improvements over C-OWL by introducing scope limitation modifiers (SLM). Bridge rules can be viewed as special cases of queries and SLM offers a controllable way to keep content local by definition. Serafini et.al. [14] gave a survey of existing ontology mapping languages, such as DDL, C-OWL, OIS [7], DLII [8], and E-connections, by translating them into distributed first order logic. None of those approaches provides scope limitation, therefore it is limited on ontology partial reuse and avoidance of unintended coupling.

6

Conclusion

Modularity in ontologies is beneficial to ontology engineering in several ways. It simplifies the maintenance and evolution of ontologies and mappings between ontologies; it enables flexible partial reuse of a large ontology; it offers a more efficient organization for an ontology; it also improves the processing time when querying ontology. In this research, we introduce package as the basic module of ontologies. In such a modular ontology representation, an ontology is divided into smaller components called packages, and each package contains a set of highly related terms and their relations; packages can be nested in other packages, forming a package hierarchy; the visibility of a term is controlled by scope limitation modifier such as public, private and protected. The whole ontology is composed of a set of packages. We have shown as an example a packagebased partial-order ontology and described the implementation of a collaborative ontology building tool for such ontologies. The major contributions of the paper are

sharing. Such design is beneficial to avoid unintended coupling and to keep private information. • The implementation of a modular ontology editor for collaborative ontology building with good scalability. We will continue the work on several directions • Extend the language expressiveness. We have proposed a preliminary framework to extend description logics (DL) with packages [1]. For example, the well-known DL wine ontology1 can be divided into several packages, such as a Region package, a Food package and a Wine package (inside Food ). It will help to partially reuse the ontology, for example, to reuse the Region ontology in other applications. In the definition of grape, finegrain concept like CabernetFranceGrape, which is only interesting in winery but not to other domains, can be declared as ‘protected’ in Wine. We will study the syntax and semantics of packagebased description logics as well as the reasoning task for such ontologies. • The distributed reasoning algorithm. Reasoning with package-based ontologies is a special case of distributed ontology reasoning. The native reasoning ability is provided by local modules and the total reasoning process is based on these native reasoners. We will study reasoning algorithms for PPO. Since the basic reasoning task in a single DAG can be reduced to a graph reachability problem or the transitive closure computation problem, the reasoning task of PPO will be a distributed version of such problems. Also, because modular ontologies have special properties compared to general graphs (for example, limited depth and limited branch factor), we may find optimized algorithms with better performance on time complexity and/or space complexity. • Improve the modular ontology editor. We will extend the existing tool to more expressive languages such as OWL and connect the modular ontology reasoner with the editor to ensure consistency of the ontology.

Acknowledgment

• The formal definition of a package-based ontology as a modular ontology approach for large-scale ontology building and using.

This research is supported in part by grants from the National Science Foundation (0219699) and the National Institutes of Health (GM 066387) to Vasant Honavar

• The proposal of scope limitation of ontology terms to support partial knowledge hiding in knowledge

1 http://www.w3.org/TR/2003/CR-owl-guide20030818/wine#

2005 IEEE ICDM Workshop on KADASH

43

References [1] J. Bao and V. Honavar. Ontology language extensions to support localized semantics, modular reasoning, and collaborative ontology design and ontology reuse. Technical report, TR-341, Computer Sicence, Iowa State University, 2004. [2] J. Bao and V. Honavar. Ontology language extensions to support localized semantics, modular reasoning, collaborative ontology design and reuse. In 3rd International Semantic Web Conference (ISWC2004), Poster Track , 7-11 November 2004, Hiroshima, Japan, 2004. [3] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, May 2001. [4] A. Borgida and L. Serafini. Distributed description logics: Directed domain correspondences in federated information sources, 2002. [5] P. Bouquet, A. Dona, L. Serafini, and S. Zanobini. Conceptualized local ontologies specification via ctxml. In Working Notes of the AAAI-02 workshop on Meaning Negotiation, Edmonton (Canada). [6] P. Bouquet, F. Giunchiglia, and e. F. van Harmelen. C-OWL: Contextualizing ontologies. In Second International Semantic Web Conference, volume 2870 of Lecture Notes in Computer Science, pages 164–179. Springer Verlag, 2003. [7] D. Calvanese, G. D. Giacomo, and M. Lenzerini. A framework for ontology integration. In The Emerging Semantic Web, 2001. [8] D. Calvanese, G. D. Giacomo, and M. Lenzerini. Description logics for information integration. In Computational Logic: Logic Programming and Beyond, pages 41–60, 2002. [9] R. Fikes, A. Farquhar, and J. Rice. Tools for assembling modular ontologies in ontolingua. In AAAI/IAAI, pages 436–441, 1997. [10] C. Ghidini and F. Giunchiglia. Local model semantics, or contextual reasoning = locality + compatibility. Artificial Intelligence, 127(2):221–259, 2001. [11] C. Ghidini and L. Serafini. Frontiers Of Combining Systems 2, Studies in Logic and Computation, chapter Distributed First Order Logics, pages 121–140. Research Studies Press, 1998. [12] B. C. Grau, B. Parsia, and E. Sirin. Working with multiple ontologies on the semantic web. In International Semantic Web Conference, pages 620–634, 2004. [13] T. R. Gruber. A translation approach to portable ontology specifications. Knowl. Acquis., 5(2):199–220, 1993. [14] H. S. Luciano Serafini and H. Wache. A formal investigation of mapping language for terminological knowledge. In 19th IJCAI, 2005. [15] Y. Qu and Z. Gao. Interpreting distributed ontologies. In Alternate track papers & posters of the 13th international conference on World Wide Web, pages 270–271. ACM Press, 2004.

[16] J. Reinoso-Castillo, A. Silvescu, D. Caragea, J. Pathak, and V. Honavar. Information extraction and integration from heterogeneous, distributed, autonomous information sources: A federated, querycentric approach. In IEEE International Conference on Information Integration and Reuse, 2003. [17] L. Serafini and A. Tamilin. Local tableaux for reasoning in distributed description logics. In Description Logics, 2004. [18] L. Serafini and A. Tamilin. Drago: Distributed reasoning architecture for the semantic web. In ESWC, pages 361–376, 2005. [19] H. Stuckenschmidt and M. Klein. Modularization of ontologies - wonderweb: Ontology infrastructure for the semantic web.

2005 IEEE ICDM Workshop on KADASH

44

OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza LSDIS Lab Department of Computer Science University of Georgia Athens, GA 30602 USA {tartir, budak, moore, amit, boanerg}@cs.uga.edu

Abstract As the Semantic Web gains importance for sharing knowledge on the Internet this has lead to the development and publishing of many ontologies in different domains. When trying to reuse existing ontologies into their applications, users are faced with the problem of determining if an ontology is suitable for their needs. In this paper, we introduce OntoQA, an approach that analyzes ontology schemas and their populations (i.e. knowledgebases) and describes them through a well defined set of metrics. These metrics can highlight key characteristics of an ontology schema as well as its population and enable users to make an informed decision quickly. We present an evaluation of several ontologies using these metrics to demonstrate their applicability.

1. Introduction The Semantic Web envisions making web content machine processable, not just readable or consumable by human beings [3]. This is accomplished by the use of ontologies which involve agreed terms and their relationships in different domains (e.g., the gene ontology (GO) and other ontologies at Open Biology Ontologies1 in biology as well as general-purpose ontologies such as SWETO – Semantic Web Technology Evaluation Ontology [1], and TAP[6]). Different users can agree on the use of a common ontology in RDF(S) (Resource Description Framework) [15, 16] or OWL (Web Ontology Language) [12] in order to annotate their content or resolve their differences through interactions and negotiations (i.e. emergent semantics). An ontology describes a hierarchy of concepts usually related by subsumption relationships. In more sophisticated cases, suitable axioms are added in order to express other relationships between concepts and to constrain their intended interpretation [5]. After an ontology is constructed, it is usually populated by instances either manually, semi-automatically or mostlyautomatically. For example, the IMPs architecture [22] and SCORE [21] facilitate the retrieval, crawling, extraction disambiguation, restructuring, integration and 1

formalization of task-relevant ontological knowledge from the semi-structured and structured sources on web. Assessing the quality of an ontology is important for several reasons including allowing the ontology developer to automatically recognize areas that might need more work, allowing the ontology user to know what parts of the ontology might cause problems, and allow him/her to compare between different ontologies when only one is going to be used. In our view, the quality of ontologies can be assessed in different dimensions. For example, quality metrics can be used to evaluate the success of a schema in modeling a real-world domain such as computer science researchers and their publications (Quality 1 in Figure 1). The depth, breadth, and height balance of the schema inheritance tree can play a role in a quality assessment. Additionally, the quality of a populated ontology (i.e., KB) can be measured to check whether it is a rich and accurate representative of real world entities and relations (Quality 2 in Figure 1). Finally, the quality of KB can be measured to see if the instances and relations agree with the schema (Quality 3 in Figure 1). We propose a method to evaluate the quality of an ontology on the different dimensions mentioned above. This method can be used by ontology users before considering an ontology as a source of information or by ontology developers to evaluate their work in building the ontology.

Fig. 1. Different dimensions to evaluate ontology quality

http://obo.sourceforge.net

2005 IEEE ICDM Workshop on KADASH

45

Our contributions in this paper are the following: • Categorizing the quality of ontologies into three groups: schema, knowledgebase (KB) and class metrics. These metrics serve as a means to evaluate the quality of a single ontology or to compare ontologies when more than one candidate fits certain requirements. • Providing metrics to quantitatively assess the quality in each group. • A tool for quality analysis and providing experimental results. The rest of the paper is organized as follows: Section 2 provides the motivation to our work. Section 3 details the model we base our work on. Section 4 presents the metrics of our model. Section 5 discusses the implementation and presents experimental results. Finally, Section 6 discusses previous related work and compares that work to our approach.

2. Motivation The motivation for our work began during our work on the SWETO ontology [1]. SWETO is intended to be a broad and general purpose ontology covering multiple domains and populated with real data from heterogeneous sources. One purpose of not limiting SWETO to a single domain enabled us to harvest facts from open-source, non-copyrighted Web sources to populate it with approximate one million facts. We wanted it to serve as a test-bed for advanced semantic applications such as discovery of semantic associations and semantic entity disambiguation in our own research, as well as to make it available to the research community for scalability and performance testing of techniques at the RDF/S level. Semantic associations [2] are the paths (entities and relationships) that connect two different entities. The nature of SWETO requires the careful design of the schema and the extraction of data from a large number of distinct resources to cover the different schema classes in such a way that represents the real world. SWETO includes some geographical data represented by classes of cities, states, and countries. It also contains information about logistic and financial aspects of terrorism. The publications domain is included in SWETO by adding classes representing Researchers, Scientific Publications, Journals, Conferences and Books. SWETO also includes information about business organizations such as Companies and Banks. The extraction process was done mostlyautomatically on several phases and resulted in hundreds of thousands of instances in the KB. Some of the sources

used were the CIA World Factbook2, which includes rich geographical information, and conference web sites. After each phase of the extraction process, there was a need to evaluate the quality of the extracted data and to decide on the targets for the next extraction phase. Some of the issues that needed additional attention included the abundance of instances on some parts of the schema while other parts have no instances, and that instances of some classes are focusing on using some of the relationships defined in the schema while ignoring the other relationships, as will be shown in the experimental results in Section 5. These problems can result in the lack of rich semantic associations in the SWETO KB; restricting relationships found between two persons to co-authorship in a certain publication, where a more interesting relationship that was not captured due to not being extracted that can establish business interests between these two persons. The discovery of these and similar problems is a difficult process because of the large number of classes in the schema and the large number of instances that belong to these classes. The set of metrics presented here can be used to describe an ontology’s schema and KB to provide the ontology designer with information they can use to further enhance such ontology. These metrics can be used not only during the development of ontologies but also by a user looking for an ontology to suit his/her needs to compare between different existing ontologies.

3. Model OntoQA is used to describe different metrics of an ontology using the vocabulary defined in an RDF-S or OWL document and instances defined in an RDF file, requiring no further information in all metrics (with the exception of the metric that requires information about the expected number of instances for each class). The model considers how classes are organized in the schema and on how instances are distributed across the schema. The model that will be used in the definition of the metrics is based on [8]. It formally defines the schema and KB structures. This model is going to be used in the definition of metrics in Section 5. Ontology structure (Schema). An ontology schema is a 6-tuple O := {C, P, A, HC, prop, att}, consisting of two disjoint sets C and P whose elements are called concepts and relationships, respectively, a concept hierarchy HC : HC is a directed, transitive relation HC ⊆ C × C which is also called concept taxonomy. HC(C1, C2) means that C1 is a sub-concept of C2, a function prop: P ! C × C, that relates concepts non-taxonomically (The function 2

http://www.cia.gov/cia/publications/factbook

2005 IEEE ICDM Workshop on KADASH

46

dom: P ! C with dom(P) := !1(rel(P) gives the domain of P, and range: P ! C with range(P) := !2(rel(P)) gives its range. For prop(P) = (C1, C2) one may also write P(C1, C2)). A specific kind of relations are attributes A. The function att: A ! C relates concepts with literal values (this means range(A) := STRING). Knowledgebase (metadata) structure. A metadata structure is a 6-tuple MD := {O, I, L, inst, instr, instl}, that consists of an ontology O, a set I whose elements are also called instance identifiers (correspondingly C, P and I are disjoint), a set of literal values L, a function inst: C ! 2I called concept instantiation (For inst(c) = I one may also write C(I)), and a function instr : P ! 2IxI called relation instantiation (for inst(P) = P{I1, I2} one may also write P(I1, I2)). The attribute instantiation is described via the function instl: P ! 2IxL relates instance s with literal values.

4. The Metrics The metrics we are proposing are not 'gold standard' measures of ontologies. Instead, the metrics are intended to evaluate certain aspects of ontologies and their potential for knowledge representation. Rather than describing an ontology as merely effective or ineffective, metrics describe a certain aspect of the ontology because, in most cases, the way the ontology is built is largely dependent on the domain in which it is designed. Ontologies modeling human activities (e.g., travel or terrorism) will have distinctly different characteristics from those modeling the natural (or physical) world (e.g. genomes or complex carbohydrates. We divided the metrics into two related categories: schema metrics and instance metrics. The first category evaluates ontology design and its potential for rich knowledge representation. The second category evaluates the placement of instance data within the ontology and the effective usage of the ontology to represent the knowledge modeled in the ontology. 4.1. Schema Metrics The schema metrics address the design of the ontology. Although we cannot know if the ontology design correctly models the knowledge, we can provide metrics that indicate the richness, width, depth, and inheritance of an ontology schema. Relationship Richness: This metric reflects the diversity of relations and placement of relations in the ontology. An ontology that contains many relations other than class-subclass relations is richer than a taxonomy with only class-subclass relationships.

Formally, the relationship richness (RR) of a schema is defined as the ratio of the number of relationships (P) defined in the schema, divided by the sum of the number of subclasses (SC) (which is the same as the number of inheritance relationships) plus the number of relationships.

RR =

P SC + P

The result of the formula will be a percentage representing how much of the connections between classes are rich relationships compared to all of the possible connections that can include rich relationships and inheritance relationships. For example, if an ontology has an RR close to zero, that would indicate that most of the relationships are class-subclass (i.e. ISA) relationships. In contrast, an ontology with a RR close to one would indicate that most of the relationships are other than class-subclass. Attribute Richness: The number of attributes (slots) that are defined for each class can indicate both the quality of ontology design and the amount of information pertaining to instance data. In general we assume that the more slots that are defined the more knowledge the ontology conveys. Formally, the attribute richness (AR) is defined as the average number of attributes (slots) per class. It is computed as the number attributes for all classes (att) divided by the number of classes (C).

AR =

att C

The result will be a real number representing the average number of attributes per class, which gives insight into how much knowledge about classes is in the schema. An ontology with a high value for the AR indicates that each class has a high number of attributes on the average, while a lower value might indicate that less information is provided about each class. Inheritance Richness: This measure describes the distribution of information across different levels of the ontology’s inheritance tree or the fan-out of parent classes. This is a good indication of how well knowledge is grouped into different categories and subcategories in the ontology. This measure can distinguish a horizontal ontology from a vertical ontology or an ontology with different levels of specialization. A horizontal (or flat) ontology is an ontology that has a small number of inheritance levels, and each class has a relatively large number of subclasses. In contrast, a vertical ontology contains a large number of inheritance levels where classes have a small number of subclasses. This metric

2005 IEEE ICDM Workshop on KADASH

47

can be measured for the whole schema or for a subtree of the schema. Formally, the inheritance richness of the schema (IRs) is defined as the average number of subclasses per class. The number of subclasses (C1)for a class Ci is defined as

H C (C1 , Ci )

! H (C , C ) C

1

IR

S

=

i

C i ∈C

C

The result of the formula will be a real number representing the average number of subclasses per class. An ontology with a low IRS would be of a vertical nature, which might reflect a very detailed type of knowledge that the ontology represents. while an ontology with a high IRS would be of a horizontal nature, which means that ontology represents a wide range of general knowledge. 4.2. Instance Metrics The way data is placed within an ontology is also a very important measure of ontology quality. The placement of instance data and distribution of the data can indicate the effectiveness of the ontology design and the amount of knowledge represented by the ontology. Instance metrics are grouped into two categories: KB metrics, which describe the KB as a whole, and Class metrics, which describe the way each class that is defined in the schema is being utilized in the KB. 4.2.1. Knowledgebase Metrics Class Richness: This metric is related to how instances are distributed across classes. The number of classes that have instances is compared with the total number of classes, giving a general idea of how many instances are related to classes defined in the schema. Formally, the class richness (CR) of a KB is defined as the ratio of the number of classes used in the base (C`) divided by the number of classes defined in the ontology schema (C).

CR =

C` C

The result will be a percentage indicating how rich in classes the KB is. Thus, if the KB has a very low CR, then the KB does not have data that exemplifies all the knowledge in the schema. On the other hand, a KB that has a very high CR (close to 100%) would indicate that the data in the KB represents most of the knowledge in the schema.

Average Population (or average distribution of instances across all classes): This measure is an indication of the number of instances compared to the number of classes. It can be useful if the ontology developer is not sure if enough instances were extracted compared to the number of classes. Formally, the average population (P) of classes in a KB is defined as the number of instances of the KB (I) divided by the number of classes defined in the ontology schema (C).

P=

I C

The result will be a real number that shows how well is the data extraction process that was performed to populate the KB. For example, if the average number of instances per class is low, when read in conjunction with the previous metric, this number would indicate that the instances extracted into the KB might be insufficient to represent all of the knowledge in the schema. Keep in mind that some of the schema classes might have a very low number or a very high number by the nature of what it is representing. Cohesion: If instances and the relationships among them are considered as a graph where nodes represent instances and edges represent the relationships between them, this metric is the number of separate connected components in the instances. It can be used to indicate what areas need more instances in order to enable instances to be more closely connected. This metric can help if “islands” form in the KB as a result of extracting data from separate sources that do not have common knowledge. Formally, the cohesion (Coh) of a KB is defined as the number of separate connected components (SCC) of the graph representing the KB.

Coh = SCC The result will be an integer representing the number of separate components. For example, a more useful throughput of semantic-association discovery algorithms might be expected from an ontology with a Coh of 1 (as this would indicate that all data in the KB is connected, and it will be possible to use a semantic association discovery algorithm without worrying about not considering a part of the KB). 4.2.2. Class Metrics Importance: The percentage of instances that belong to classes at the subtree rooted at the current class with respect to the total number of instances. This metric can also be called instance distribution as it refers to the distribution of instances over classes. This metric is

2005 IEEE ICDM Workshop on KADASH

48

important in that it will help in identifying which areas of the schema are in focus when the instances are extracted and inform the user of the suitability of his/her intended use. It will also help direct the ontology developer or data extractor towards where s/he should focus on getting data if the intention is to get a consistent coverage of all classes in the schema. Although this measure might not be exact, it can be used to give a clear idea on what parts of the ontology are considered focal and what parts are on the edges. Formally, the importance (Imp) of a class Ci is defined as the number of instances that belong to the subtree rooted at Ci in the KB (Ci(I)) compared to the total number of instances in the KB (I).

Imp =

Ci ( I ) I

The result of the formula will be a percentage representing the importance of the current class. Fullness: This metric details the KB average population metric mentioned above. It would be mainly used by an ontology developer interested in knowing how well the data extraction was with respect to the expected number of instances of each class. This is helpful in directing the extraction process to any resources that will add instances belonging to classes that are not full. Formally, the fullness (F) of a class Ci is defined as the actual number of instances that belong to the subtree rooted at Ci (Ci(I)) compared to the expected number of instances that belong to the subtree rooted at Ci (Ci`(I)).

F=

Ci ( I ) Ci `( I )

The result of the formula will be a percentage representing the actual coverage of instances compared to the expected coverage. In most cases, this measure is an indication of how well the instance extraction process performed. For example, a KB where most classes have a low F would require more data extraction. On the other hand, a KB where most classes are almost full would indicate that it reflects more closely the knowledge encoded in the schema. Inheritance Richness: This measure details the schema IRS metric mentioned above and describes the distribution of information in the current class subtree per class. This measure is a good indication of how well knowledge is grouped into different categories and subcategories under this class. Formally, the inheritance richness (IRc) of class Ci is defined as the average number of subclasses per class in the subtree. The number of subclasses for a class Ci is defined as

H C (C1 , Ci ) and the number of nodes in the

subtree is |C’|.

! H (C , C ) C

1

IRC =

i

Ci ∈C '

C'

The result of the formula will be a real number representing the average number of classes per schema level. The interpretation of the results of this metric depends highly on the nature of the ontology. Classes in an ontology that represents a very specific domain will have low IRC values, while classes in an ontology that represents a wide domain will usually have higher IRC values. Relationship Richness: This is an important metric reflecting how much of the properties in each class in the schema is actually being used at the instances level. It is a good indication of the how well the extraction process performed in the utilization of information defined at the schema level. Formally, the relationship richness (RR) of a class Ci is defined as the number of relationships that are being used by instances Ii that belong to Ci (P(Ii,Ij)) compared to the number of relationships that are defined for Ci at the schema level (P(Ci,Cj)).

RRc =

P( I i , I j ), I i ∈ Ci ( I ) P(Ci , C j )

The result of the formula will be a percentage representing how well the KB utilizes the knowledge defined in the schema regarding the class in focus. For example, if most classes have low RRC values, this would mean that instances are using only a few number of the class relationships in the schema in contrast to another ontology where instances have relationships that span most of the relationships available at the class level in the schema. Connectivity: This metric is intended to give an indication of the number of relationships instances of each class to other instances. This measure works in tandem with the importance metric mentioned above to create a better understanding of how focal some classes function. For more details, instances within a class can be grouped based on the number of relationships they have with other instances. Formally, the connectivity (Cn) of a class Ci is defined as the number of instances of other classes that are connected to instances of that class (Ij).

Cn = I j , P ( I i , I j ) ∧ I i ∈ Ci ( I ) The result of the formula will be an integer representing the popularity of instances of the class. A class with a high Cn plays a central role in the ontology compared to a class with a lower value. This measure can be used to understand the nature of the ontology by

2005 IEEE ICDM Workshop on KADASH

49

indicating which classes play a central role compared to other classes. Readability: This metric indicates the existence of human readable descriptions in the ontology, such as comments, labels, or captions. This metric can be a good indication if the ontology is going to be queried and the results listed to users. Formally, the readability (Rd) of a class Ci is defined as the sum of the number attributes that are comments and the number of attributes that are labels the class has.

The table above shows that TAP is the most general due to the large value for its inheritance richness (fanout) and the richest ontology in the three with the largest number of instances. GlycO, on the other hand, is clearly domain specific as indicated by its small number of subclasses per class and by its small number of instances with a relatively high number of subclasses per class and a large number of instances, SWETO is somewhere in the middle, and it can be classified as a moderately general purpose ontology.

Rd = A, A = rdfs : comment + A, A = rdfs : label

5.1. Class importance

The result of the formula will be an integer representing the availability of human-readable information for the instances of the current class.

Using the class importance metric to compare the above three ontologies clearly shows how they are intended to be used. Figure 2 shows the most important classes in each ontology.

5. Implementation and Experiments

Table 1. Summary of SWETO and TAP

Event

ACM_Subject_Desc riptors

Terrorist_Attack

Airport

Bank

City

Place

Conference

Company

Organization

Computer_Science_ Researcher

Publication

Scientific_Publicatio n

Class

(a)

Fortune1000Comp any

Astronaut N-glycan_alpha-DGalp

ComicStrip

City

N-acetyl-alphaneuraminosyl_resid ue

University

beta-Dgalactopyranosyl_r esidue

UnitedStatesCity

ProductType

Book

Movie

Actor

Author

Athlete

Musician

35 30 25 20 15 10 5 0

PersonalComputer Game

Class Importance

Class

(b) Class Importance

alpha-Dgalactopyranosyl_r esidue

N-glycan_beta-DGalp

N-glycan_alphaNeu5Ac

N-acetyl-beta-Dglucopyranosaminy l_residue

N-glycan_beta-DGlcpNAc

alpha-Dmannopyranosyl_r esidue

N-glycan_alpha-DManp

N-glycan_residue

carbohydrate_resid ue_property

100 90 80 70 60 50 40 30 20 10 0

N-glycan

We implemented the metrics presented above in a Java-based prototype. The system first calculates the ontology schema metrics, which is defined using an RDFS or OWL file, and then uses the given RDF file to compute the instance metrics. Our implementation uses the Sesame RDF store [4] to load data for the ontology schema and KB. For a data stored, Sesame and Jena were considered. Finally, Sesame was selected because it was able to handle large data sizes compared with the Jena data store [9]. The main obstacle in experimenting with our model was the lack of ontologies that offer their schema and have a KB of a large size (>1 MB) reflecting the intended use of the schema. Results of running the application on the following ontologies are discussed below: 1. SWETO. SWETO is our general purpose ontology that covers domains including publications, affiliations, geography and terrorism. 2. TAP [6]. TAP is Stanford’s general purpose ontology. It is divided into 43 domains. Some of these domains are publications, sports, and geography. 3. GlycO [20]. GlycO is another ontology under development in the LSDIS Lab for the Glycan Expression. Its goal is to develop a suite of databases in addition to computational tools that facilitate efficient acquisition, description, analysis, sharing and dissemination of the data contained therein. Inheritance Instances Classes Ontology Richness SWETO 44 813,217 4.00 TAP 3,229 70,850 5.36 GlycO 352 2,034 1.56

Class Importance 70 60 50 40 30 20 10 0

Class

(c) Fig. 2. Class importance in (a) SWETO (b) TAP and (c) GlycO

From the figure it can be clearly seen that classes related to publications are the dominant classes in SWETO. While, with the exception of the Musician

2005 IEEE ICDM Workshop on KADASH

50

class, TAP gives consistent importance to most of its classes covering the different domains it includes. The nature of the GlycO ontology is reflected in the classes that are most important. The importance of the “Nglycan_residue” and the “alpha-Dmannopyranosyl_residue” and other classes show the narrow domain of GlycO is intended for, although the “glycan_moiety” class is the most important class covering about 90% of the instances in the KB.

Company) and geographic information (City and State). In a similar manner, TAP continues to show that it covers different domains, and its most connected classes cover the education domain (CMUCourse and CMUSCS_ResearchArea), the entertainment domain (TV and Movie), and other domains as well. GlycO’s specific-purpose nature is evident from the Glycan related classes that are most connected. 5.3. Class readability

5.2. Class connectivity

Computer_Science_ Researcher

CMU_RAD N-glycan_beta-DManp

Publication

Thing

Movie N-glycan_beta-DGlcpNAc

Scientific_Publicatio n

State

UnitedStatesPreside nt

Bank

Airport

ACM_Third_level_Cl assification

City

ACM_Second_level _Classification

Terrorist_Attack

120 100 80 60 40 20 0

Terrorist

Class Readability

Terrorist_Organizati on

Company

Scientific_Publicatio n

Computer_Science_ Researcher

ACM_Top_level_Cla ssification

State

ACM_Subject_Desc riptors

City

Airport

ACM_Second_level _Classification

Bank

Terrorist_Attack

9 8 7 6 5 4 3 2 1 0

ACM_Third_level_Cl assification

Class Connectivity

Class readability is a useful metric when there is an intention to frequently use an ontology by humans. Figure 4 shows the most readable classes in each of the three ontologies.

ACM_Top_level_Cla ssification

As explained above, class connectivity is used to indicate which classes play a more central role than other classes, which is another way of describing the nature of an ontology. Figure 3 shows the most connected classes in each of the three ontologies.

Class

(a)

Class

Class Readability

(a)

105 100

Class Connectivity

95 90

(b)

(c) Fig. 3. Class connectivity in (a) SWETO (b) TAP and (c) GlycO

Figure 3 above shows that SWETO also includes good information about domains other than publications, including the terrorism domain (Terrorist_Attack and Terrorist_Organization), the business domain (Bank and

MailingList

MusicGenre

N-glycan

reducing_Nglycan_D-GlcNAc

N-glycan_beta-DGalpNAc

N-glycan_alphaD-Manp

N-glycan_beta-DXylp

N-glycan_alphaNeu5Gc

N-glycan_alphaD-Galp

N-glycan_beta-DGalp

N-glycan_alphaNeu5Ac

N-glycan_alphaD-Glcp

N-glycan_alphaL-Fucp

N-glycan

reducing_Nglycan_D-GlcNAc

N-glycan_alpha-LFucp

N-glycan_alpha-DGlcp

N-glycan_beta-DXylp

N-acetylglucosaminyl_transf erase_V

N-glycan_beta-DManp

N-glycan_alphaNeu5Gc

N-glycan_alpha-DGalp

N-glycan_beta-DGalp

Class

N-acetylglucosaminyl_tran

40 35 30 25 20 15 10 5 0

12 10 8 6 4 2 0

N-glycan_alphaNeu5Ac

LiteratureGenre

(b) Class Readability

Class Connectivity

N-glycan_alpha-DManp

Cartoon Class

Class

N-glycan_beta-DGalpNAc

CMUSCS_Research Project

CMUSCS_Research Area

PoliticalParty

UnitedStatesCongre ssman

CMUFace

W3CNote

Bas eballTeam

CMUCourse

ComputerSc ienti st

W3CPers on

W3CWorkingDr aft

W3CSpecific ati on

CMU_RAD

CMUPublication

CMUG raduateSt udent

MailingList

Pers on

Res earc hProjec t

80

CMUCourse

85

CMUFac ulty

7 6 5 4 3 2 1 0

Class

(c) Fig. 4. Class readability in (a) SWETO (b) TAP and (c) GlycO

With different degrees, all three ontologies include readable information. SWETO does not provide human readable information to most of the classes, which can be a concern if the ontology is going to be used by humans. On the other hand, both TAP and GlycO can be

2005 IEEE ICDM Workshop on KADASH

51

considered human-friendly as they provide descriptive information for most of their classes.

6. Related Work In recent years, increasing interest has been given to ontology design and quality. In [7], the authors propose a complex framework consisting of 160 characteristics spread across five dimensions: content of the ontology, language, development methodology, building tools, and usage costs. Unfortunately, the use of the OntoMetric tool introduced in the paper is not clearly defined, and the large number of characteristics makes their model difficult to understand. [10] provides a seven-step guide for developing ontology. The steps include guidelines ranging from what to include in the ontology, how to build a good class hierarchy, how to create class slots (attributes), and finally to populating the KB of the ontology. This guide is intended for developers and would not help users in the evaluation of an existing ontology. [13] uses a logic model to detect unsatisfiable concepts and inconsistencies in OWL ontologies. The approach is intended to be used by ontology designers to evaluate the quality of their work and to indicate any possible problems. In [19] the authors propose a model for evaluating ontology schemas. The model contains two sets of features: quantifiable and non-quantifiable. It crawls the web (causing some delay, especially if the user has some ontologies to evaluate), searches for suitable ontologies, and then returns the ontology schemas’ features to allow the user to select the most suitable ontology for the application. The application does not consider ontologies’ KBs’ quality that can provide more insight into the way the ontology is used. [11] defines a framework for comparing ontology schemas. It compares CYC, Dahlgren’s, Generalized upper model, GENSIM, KIF, PLNIUS, Sowa’s, TOVE, UMLS, and WORDNET. The framework defines characteristics that can be used to compare these ontologies. These characteristics are divided into the following groups: design process, taxonomy, internal concept structure and relations between concepts, axioms, inference mechanism, applications, and contribution. The authors' goal was a review of current design ontology schema design techniques by manually inspecting them and classifying them into different design categories. In [18], the authors introduce an environment for ontology development called DODDLE-R. DODDLE-R, which consists of two parts: a pre-processing part that generates a prototype ontology, and a quality improvement part to refine that ontology. The quality improvement part focuses on fixing the problems related

to the issue of Concept Drift where positions of particular concepts changes depending on the domain. This approach can be helpful for experts trying to build an ontology from scratch, but it is does not serve users who are not design experts and who only want an ontology that fits their needs. Table 2 below summarizes the approaches discussed above. It considers the target audience (‘D’ = ‘Developers’ and ‘E’ = ‘End Users’), whether the approach is automatic or manual, whether it considers the schema or both the schema and the KB (‘S’ = ‘Schema’), and whether the approach allows the user to specify the ontologies s/he wants to analyze. Table 2. Summary of current ontology quality management approaches

Approach OntoMetric OntoDev Swoop Charac Survey Doddle-R OntoQA

Target D D D D+E D D D+E

Auto/Man Manual Manual Auto Auto Manual Manual Auto

S/KB S S+KB S S S S S+KB

Ontology Input Input Input Crawled Input Input Input

8. Conclusions and Future Work In this paper, we show how OntoQA can be used to describe ontologies in a way that enables the user or ontology developer determine the quality of an ontology. We envision future releases of OntoQA to allow the calculation of domain-dependent metrics that make use of some standard ontologies in a certain domain. We are also planning on making OntoQA a web-enabled tool where users can enter their ontology files’ path and use our application to measure the quality of the ontology. Acknowledgments This work is funded by NSF-ITR-IDM Award#0325464 titled ‘SemDIS: Discovering Complex Relationships in the Semantic Web’ and NSF-ITR-IDM Award#0219649 titled ‘Semantic Association Identification Knowledge Discovery for National Security Applications.’

References [1] Aleman-Meza, B., Halaschek, C., Sheth, A., Arpina, I. B., Sannapareddy, G. SWETO: Large-Scale Semantic Web Test-bed. Proceedings of the 16th International Conference on Software Engineering & Knowledge Enginerring (SEKE2004): Workshop on Ontology in Action, Banff, Canada, June 21-24, 2004, pp. 490-493 [2] Anyanwu K. and Sheth, A. ρ-Queries: Enabling Querying for Semantic Associations on the

2005 IEEE ICDM Workshop on KADASH

52

Semantic Web, Proceedings of the 12th Intl. WWW Conference, Budapest, Hungary, 2003. [3] Berners-Lee T., Hendler, J. and Lassila, O. The Semantic Web, A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, May 2001. [4] Broekstra, J., Kampman, A. and van Harmelen, F. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. Proceedings of 1st International Semantic Web Conference (ISWC 2002), June 9-12th, 2002, Sardinia, Italia. [5] Guarino N. Formal Ontology and Information Systems. Proceedings of FOIS’98, Trento, Italy, 6-8 June 1998. Amsterdam, IOS Press, pp. 3-15. [6] Guha, R., McCool, R.: TAP: A Semantic Web Testbed. Journal of Web Semantics, 1(1):81-87, (2003). [7] Lozano-Tello, A. & Gomez-Perez, A. ONTOMETRIC: a method to choose the appropriate ontology. Journal of Database Management 2004, 15, 1–18. [8] Maedche, A. and Zacharias, V. Clustering Ontology-based Metadata in the Semantic Web. Proceedings of the 6th European Conference, PKDD 2002, Helsinki, Finland, August 19-23, 2002. [9] Mc Bride, B. Jena: Implementating the RDF Model and Syntax Specification, Proceedings of the Second International Workshop on the Semantic Web (SemWeb’2001), May 2001. [10] Noy, N. and McGuinness, D. Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001. [11] Noy, N., and Hafner, C. The state of the art in ontology design: A survey and comparative review. AI Magazine, 18(3):53-74, 1997. [12] OWL: Web Ontology Language Overview, W3C Recommendation, February 2004. (http://www.w3.org/TR/owl-features/).

[13] Parsia, B., Sirin, E. and Kalyanpur, A. Debugging OWL Ontologies. Proceedings of WWW 2005, May 10-14, 2005, Chiba, Japan. [14] Patel, C., Supekar, K., Lee, Y. and Park, E.K. OntoKhoj: A Semantic Web Portal for Ontology Searching, Ranking and Classification. Proceedings of 5fth International Workshop on Web Information and Data Management (WIDM'03), November 7–8, 2003, New Orleans, Louisiana, USA. [15] RDF Schema, W3C Recommendation, February 2004. (http://www.w3.org/TR/rdf-schema/) [16] RDF: Resource Description Framework, W3C Recommendation, February 2004 (http://www.w3.org/TR/2004/REC-rdf-primer20040210). [17] Sheth, A. and Avant, D. Semantic Visualization: Interfaces for exploring and exploiting ontology, knowledgebase, heterogeneous content and complex relationships. NASA Virtual Iron Bird Workshop, March 31 and April 2, California (2004) [18] Sugiura, N., Shigeta, Y., Fukuta, N., Izumi, N. and Yamaguchi, T. Towards On-the-fly Ontology Construction – Focusing on Ontology Quality Improvement. Proceedings of the First European Semantic Web Symposium, ESWS 2004 Heraklion, Crete, Greece, May 10-12, 2004. [19] Supekar, K., Patel, C. and Lee, Y. Characterizing Quality of Knowledge on Semantic Web. Proceedings of AAAI Florida AI Research Symposium (FLAIRS-2004), May 17-19, 2004, Miami Beach, Florida. [20] Sheth, A. et al. Semantic Web technology in support of Bioinformatics for Glycan Expression" W3C Workshop on Semantic Web for Life Sciences, 2728 October 2004, Cambridge, Massachusetts USA [21] Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y. Managing Semantic Content for the Web. IEEE Internet Computing archive Volume 6 , Issue 4 (July 2002) [22] Crow L., Shadbolt N. Extracting Focused Knowledge From The Semantic Web. International Journal of Human-Computer Studies, Volume 54, Number 1, January 2001, pp. 155-184(30)

2005 IEEE ICDM Workshop on KADASH

53

A Heuristic Query Optimization for Distributed Inference on Life-Scientic Ontologies Takahiro Kosaka Susumu Date Hideo Matsuda Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka 565-0871, Japan tak-k,sdate,matsuda @ist.osaka-u.ac.jp

Abstract This paper describes a method of applying heuristics to optimize queries in distributed inference on life-scientic ontologies. An actual scenario in drug discovery illustrates two requirements for this inference: composition of ontologies by message passing, and global query optimization among computational nodes. In our proposal algorithm, each node involved in the inference optimizes its queries to the other nodes in terms of memory usage, aiming at elimination of combinatorial explosion. This optimization, which is based on greedy heuristics, is done by exploiting information on the number of expected answers to a set of queries. A performance study shows that the proposed algorithm optimizes memory usage needed for the distributed inference, whereas this study also suggests necessity for a further investigation to reduce computation time.

1 Introduction

Recently, ontology is gaining a wide acceptance as a formalized method to describe knowledge on various domains of the real world. Especially in the eld of life sciences, a number of ontologies (e.g., Open Biomedical Ontologies [2]) are carefully designed and published on the Web. While these ontologies are separately developed by different organizations, they have a strong potential to be integrated into a large single knowledge base for life sciences. This knowledge base will enable researchers to derive implicit life-scientic knowledge comprehensively across domains of these ontologies. One promising application of such a life-scientic knowledge base is drug discovery that demands multi-

Shinji Shimojo Cybermedia Center, Osaka University 5-1 Mihogaoka, Ibaraki, Osaka 567-0047, Japan [email protected]

domain knowledge on life. Particularly, the authors have been focusing on a scenario of ontology-based in silico screening [17]. In silico screening is a method to search a compound library for drug candidates that are bound to disease-related biomolecules, in particular, proteins. Schuffenhauer et al. [17] propose a screening method by developing an interaction ontology, which relates a compound ontology to a protein ontology. Given a tendency for homologous proteins to bind structurally similar compounds, this ontology-based in silico screening searches candidate compounds to a novel target protein through the following three steps (Figure 1). 1. Protein homology search Search a protein ontology for homologous proteins of a novel target protein. 2. Interaction analysis Search the interaction ontology for compounds that are known to be chemically bound to the above homologous proteins. 3. Compound similarity search Search a compound ontology for compounds that are structurally similar to those found in step 2. Queried compounds are considered to be candidate that can be bound to the novel target. Through these three steps, this in silico screening enables to derive implicit relationships between proteins and chemical compounds. The above three steps can be described as an inference utilizing the compound, interaction, and protein ontologies, as well as the following rule.

2005 IEEE ICDM Workshop on KADASH

(1)

54

dopamine receptor

#11125 (compound class)

is_a agonist

dopamine D2 receptor-like

#07701 novel (candidate is_a is_a interaction compound antagonist class) dopamine D2 Step 1: Step 3: receptor Protein Compound search for known (novel target) similar_to by using homology Similarity interaction reference target homologous search as a key Search #07703 (reference compound class)

compound ontology

antagonist Step 2: Interaction analysis interaction ontology

dopamine D3 receptor (reference target)

protein ontology

Figure 1. An ontology-based in silico screening [17].

The description of this rule is simple. However, we need to take into account the following two premises to develop a knowledge base system that executes the inference utilizing the above rule. 1. These ontologies are autonomous and distributed on the Web 2. These ontologies may be in a black box and not downloadable The second premise can be explained, for instance, in a context of commerce. It is conceivable that a company that provides a chemical ontology does not allow its customers to download the whole ontology, but allows them access to the ontology via a simplied querying interface. Consequently, these premises lead to a necessity of a common communication protocol for subsystems that are provided by developers of life-scientic ontologies.

Considering the above premises, the need for the communication protocol, and our application scenario, we need to satisfy at least the following two requirements in designing such a protocol. 1. Composition of ontologies by message passing 2. Global query optimization 1.2.1 Composition of ontologies by message passing Since ontologies cannot always be downloaded, inference on these ontologies needs to be performed by composing

ontologies through message passing among computational nodes (subsystems) that are provided by each ontology developer. This inference cannot be achieved by an existing stand-alone system such as Racer [14]. Additionally, the scheme of this inference is quite different from classical distributed reasoning in which one large ontology is divided into partitions for improving performance. In our study, we need to realize inference on ontologies that are accessed to as black boxes, i.e., structure of these ontologies are not given. In this paper, we call our message-passing based inference on black-boxed ontologies as distributed inference. 1.2.2 Global query optimization In addition to the above ontology composition by message passing, query optimization needs to be globally addressed. That is, a set of aggregated nodes as a whole must optimize computational time and space needed for the inference. Specically, the authors focus on combinatorial explosion of individuals expressed in ontologies. Consider the following situation of the application scenario. The compound ontology contains

compounds

The interaction ontology contains

interactions

The protein ontology contains

proteins

In this situation, assuming that nodes do not know how many individuals are included in the other node’s ontology (since these ontologies are black-boxed), a node that infers with the rule (1) needs to calculate conjunctions on combinations in the worst case (Figure 2). Actually, if the node evaluates this rule (1) from left to right (as e.g. Jena [1] does), this node goes into this worst case. Even though current reasoners are highly optimized, this problem cannot be addressed by only their internal optimization.

In order to discuss the composition of ontologies by message passing and the global query optimization, rst it is necessary to give a more concrete denition of the distributed inference. For this denition, the authors focus on Adjiman et al.’s denition for distributed inference in propositional logic [7]. Our application scenario shows that propositional logic is not sufcient for retrieving individuals such as a compound. In this paper, the authors try to extend Adjiman et al.’s denition from propositional logic to Horn logic. With respect to the query optimization, this study has been conducted based on an assumption that there is no centralized control over these nodes. On this assumption, approaches to the global query optimization are limited to

2005 IEEE ICDM Workshop on KADASH

55

screening rule

interacts(Compound1,Protein2)