An Investigation into the Feasibility of the Semantic Web - CiteSeerX

2 downloads 164 Views 193KB Size Report
Now, we define the concept of semantic web space, which is a collection of ontologies and information resources. Definition 1 A semantic web space W is ...
An Investigation into the Feasibility of the Semantic Web Zhengxiang Pan, Abir Qasem, and Jeff Heflin Lehigh University Technical Report LU-CSE-06-025

Abstract This report is an expanded version of a paper in AAAI-2006 proceedings. In this report, we investigate the challenges that must be addressed for the Semantic Web to become a feasible enterprise. Specifically we focus on the query answering capability of the Semantic Web. We put forward that two key challenges we face are heterogeneity and scalability. We propose a flexible and decentralized framework for addressing the heterogeneity problem and demonstrate that sufficient reasoning is possible over a large dataset by taking advantage of database technologies and making some tradeoff decisions. As a proof of concept, we collect a significant portion of the available Semantic Web data; use our framework to resolve some heterogeneity and reason over the data as one big knowledge base. In addition to demonstrating the feasibility of a “real” Semantic Web, our experiments have provided us with some interesting insights into how it is evolving and the type of queries that can be answered.

Introduction The Semantic Web envisions a web of ontologies and associated data, which will augment the present Web by adding formal semantics to its documents. By transforming the Web from a collection of documents to a collection of semantically rich data sources the Semantic Web promises the unprecedented benefit of a global knowledge infrastructure. In addition, the idea of using knowledge representation in a Web friendly manner makes the emerging Semantic Web an ideal test bed for AI applications. However, there are three key questions that have to be addressed before the Semantic Web becomes a feasible enterprise. They are: 1. Who will annotate the data? 2. Without any centralized control how will all this data be connected to one another? 3. Will the existing AI techniques be sufficient to process this huge amount data? The first question has received significant attention in recent years. The proliferation of tools and services that can automatically generate Semantic Web data and advances in ontology learning techniques (Maedche & Staab 2001), are inc 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.

dicators that we are making good progress towards overcoming the annotation problem of the Semantic Web. However, the latter two questions in our opinion have not received the attention they deserve. The second question stems from the inherent autonomous nature of the Web. While this autonomy definitely fuels the Web’s phenomenal growth, it also increases the heterogeneity of the information sources. This heterogeneity is equally likely to occur in the Semantic Web. Without a simple (hence easily adaptable) mechanism to address this heterogeneity, it will be difficult to realize the full potential of the Semantic Web. The third question revolves around the scalability issue of AI systems. At the time of writing of this paper, the number of Resource Description Framework (RDF) documents, cached by Google was about 2.8 million. The Semantic Web search engine Swoogle (Ding et al. 2005) currently boasts an index of 850K of Semantic Web documents and more than doubled its index in less than a year. Even in its infancy, Semantic Web data (45 million triples as per Swoogles index of November 2005) is too large for typical reasoning systems. Most of these systems implement the reasoning process in main memory. A size of more than 45M triples places the existing Semantic Web well beyond the reach of many advanced AI systems. Given that the Semantic Web is starting to reach critical mass, we believe the time is right to begin exploring the last two questions. In this paper we try to do reasoning on this heterogeneous and massive Semantic Web. Specifically we focus on the query answering capability of the Semantic Web. We should note here that there are several other possible applications of the Semantic Web. In recent years the Semantic Web has been touted as an infrastructure for automatic web service composition and autonomous web agents that perform transactions on behalf of a user. However, we believe answering a query by means of logical reasoning lies in the core of original Semantic Web vision. Furthermore, query answering complements the other applications. For example, an agent working for a user will perform better if it has access to richer query answering facilities. To the best of our knowledge this is the first attempt to reason with a substantial amount of authentic Semantic Web data. Specifically we make three contributions. First, we provide a decentralized framework to integrate data sources that use different

vocabularies. Second, we propose a mechanism to realize this framework. And finally we describe a system that uses this mechanism to integrate the existing Semantic Web data and demonstrate that our system is capable of performing scalable reasoning on the real Semantic Web. The following is the outline of our paper. First, we discuss the heterogeneity problem on the Semantic Web and introduce our framework for addressing this issue. Then we describe a mechanism that allows us to work in our framework. After that we describe our Semantic Web Knowledge Base system. We then describe our experience in working with the real Semantic Web. Specifically we describe the challenges faced in loading, aligning and querying this heterogeneous and massive data set and how we addressed some of them. We finally discuss some related work, make some conclusions and provide some future research direction.

Heterogeneity and the Semantic Web An initial inspection of the data revealed that many terms with different URIs 1 have identical local names. In fact, this type of synonymy amounts to about 50% of the terms in our data. Among those terms, each local name on average is defined by 4.2 different ontologies. Our initial investigation supports our hypothesis that the Semantic Web is and will be a heterogeneous medium, however a more thorough analysis is still needed. Another trend in the Semantic Web, is the proliferation of data that commits to simple ontologies like FOAF (Brickley & Miller 2006). FOAF data currently makes the bulk of the Semantic Web documents. The simplicity of the ontology facilitates automatic generation of FOAF data and is the main reason for such proliferation. Although a large collection of data that has relatively simple semantics can still make for some very interesting applications (Mika 2005) (Halaschek-Wiener et al. 2005), much more can be achieved if we can somehow add a layer of semantics on the FOAF ontology and then exploit the vast FOAF data with more advanced reasoning. The trends described above require us to establish alignments between terms of different ontologies. Our approach is to use axioms of OWL ontologies to describe a map for a pair of ontologies. The axioms will relate concepts from one ontology to the concepts of the other ontology. An application can consult this map to retrieve information from sources that may have otherwise appeared to be providing differing information. In our framework, a map is an ontology that imports the two ontologies whose terms it will align. It is obvious that we will not have mapping between all pairs of ontologies. However it should be possible to compose a map from existing maps by traversing through a ”semantic connection” of maps.In the sense we now have a web of ontologies as opposed to a web of documents. Figure 1 describes one such scenario. In Figure 1 We depict the ontologies using O and a number, the data sources using S and a number and maps between two ontologies with M and the numbers corresponding to the mapped ontologies. We depict ”annotated with” 1

A URI consists of a namespace and a local name.

Figure 1: Maps in the Semantic Web

relationship between an ontology and a data source using solid arrows and import relationships between two ontologies using dotted arrows. There is a direct map M12 between O1 and O2. Therefore for a query asked in terms of O1, in addition to S1 we can access S2 and S3. There is however no direct map between O1 and O3 but by combining M12 and M23 we can now retrieve answers from S1, S2, S3, S4 and S5. Furthermore, there is no data source that is explicitly annotated with O5. However, we may still be able to provide some answer to a query posed in terms of O5. We do this by again exploiting the extension relationship between ontologies. Since O1 extends O5 and M12 extends O1 it implicitly extends O5. In the next section we describe a mechanism for realizing this framework. We use ontology perspectives (Heflin & Pan 2004), which is a formal way to allow the same set of resources to be viewed from different contexts, using different assumptions and background information. However, before ending this section, we would like to note several points about our mapping framework. 1. There is a well-established body of research in the area of automated ontology alignment. This is not our focus. Instead we investigate the application of these alignments to provide an integrated view of the Semantic Web data. 2. The maps in our framework are basically OWL ontologies that express alignment using OWL axioms. Anyone can create a mapping ontology between two or more ontologies, and publish that ontology for use by all. Users can then select a mapping (or have one selected for them) to use in order to integrate resources. The maps can have any degree of alignments (all terms, some terms or just one term). In the true spirit of the Web we can have incremental scalability: the more maps we have the better the integration. We believe our framework addresses the need for integrating resources in a pragmatic way on the Semantic Web and it is an important first step towards achieving the promise of unprecedented benefit gained from ”total” information access.

Ontology Perspectives Heflin (2001) has observed that there may not be a universal model of all the resources and ontologies on the Web. In fact, it is extremely unlikely that one could even exist. Instead, we must allow for different viewpoints and contexts, which are supported by different ontologies. He defines perspectives which then allow the same set of resources to be viewed from different contexts, using different assumptions and background information. A model theoretic description of perspectives is presented in (Heflin & Pan 2004). In this section, we set aside the versioning issues in that paper and show how these simplified perspectives would work for our framework. Each perspective will be based on an ontology, hereafter called the basis ontology or base of the perspective. By providing a set of terms and a standard set of axioms, an ontology provides a shared context. Thus, resources that commit to the same ontology have implicitly agreed to share a context. When it makes sense, we also want to maximize integration by including resources that commit to different ontologies. We now provide formal definitions to describe our model of the Semantic Web. We need to have structured representations of the data associated with information resources, and these resources must be able to commit to ontologies which provide formal definitions of terms. We will describe this by defining ontologies and resources using a a logical language and by providing a model theoretic semantics for these structures. These definitions improve upon those given by Heflin (Heflin 2001). Let D be the domain of discourse, i.e., the set of objects that will be described. Let R be a subset of D that is the set of information resources, such as web pages, databases, and sensors. In order to maintain the generality of this theory, we will use first-order logic as the representation language. Due to the expressivity of FOL, this theory will still apply to the languages commonly used for semantic web systems, such as description logics, as well as proposed Horn logic extensions. We will assume that we have a first-order language LS with a set of non-logical symbols S. The predicate symbols of S are SP ⊂ S, the variable symbols are SX ⊂ S, and the constant symbols are SC ⊂ S. For simplicity, we will not discuss function symbols, since an n-ary function symbol can be represented by a n+1-ary predicate. The wellformed formulas of LS are defined in the usual recursive way. We will use L(V ) to refer to the infinite set of wellformed first-order logic formulas that can be constructed using only the non-logical symbols V . We define interpretation and satisfaction in the standard way. Now, we define the concept of semantic web space, which is a collection of ontologies and information resources.

1. V ⊆ S (the vocabulary V is a subset of the non-logical symbols of LS ) 2. V ⊃ SX and V ⊃ SC (V contains all of the variable and constant symbols of LS ) 3. A ⊆ L(S) (the axioms A are a subset of the well-formed formulas of LS . 4. E ⊂ O is the set of ontologies extended by O. As a result of this definition, an ontology defines a logical language that is a subset of the language LS , and defines a core set of axioms for this language. An ontology can extend another, which means that it adds new vocabulary and or axioms. For convenience, we define the concept of an ancestor ontology. An ancestor of an ontology is an ontology extended either directly or indirectly by it.2 . If O2 is an ancestor of O1 , we write O2 ∈ anc(O1 ). The formal definition of an ancestor is: Definition 3 (Ancestor function) Given two ontologies O1 = hV1 , A1 , E1 i and O2 = hV2 , A2 , E2 i, O2 ∈ anc(O1 ) iff O2 ∈ E1 or there exists an Oi = hVi , Ai , Ei i such that Oi ∈ E1 and O2 ∈ anc(Oi ). We will now provide meaning for ontologies by defining models of ontologies. Recall that in logic, a model of a theory is an interpretation that satisfies every formula in the theory. Thus, we must first determine the structure of our interpretations. One possibility is to use interpretations of LV , the language formed using only the non-logical symbols in V . However, this will limit our ability to compare interpretations of different ontologies, so we will instead use interpretations for LS . As such, each interpretation contains mappings for any predicate symbol in any ontology, and thus a single interpretation could be a model of two ontologies with distinct vocabularies. For ontology extension to have its intuitive meaning, all models of an ontology should also be models of every ontology extended by it. Definition 4 Let O = hV, A, Ei be an ontology and I be an interpretation of LS . Then I is a model of O iff I satisfies every formula in A and I is a model of every ontology in E. Thus an ontology attempts to describe a set of possibilities by using axioms to limit its models. Some subset of these models are those intended by the ontology, and are called the intended models of the ontology. Note that unlike a first-order logic theory, an ontology can have many intended models because it can be used to describe many different states of affairs (Guarino 1998). The axioms of an ontology limit its models by restricting the relations that directly or indirectly correspond to predicates in its vocabulary, while allowing models that provide for any possibility for predicate symbols in other domains.

Definition 1 A semantic web space W is two-tuple hO, Ri, where O is a set of ontologies and R is a set of resources. We assume that W is based on a logical language LS .

Definition 5 (Knowledge Function) Let K : R→P(L(S)) be a function that maps each information resource into a set of well-formed formulas.

Definition 2 Given a logical language LS , an ontology O in O is a three-tuple hV, A, Ei, where

2 Extension is sometimes referred to as inclusion or importing. The semantics of our usage are clarified in Definition 4

We call K the knowledge function because it provides a set of formulas for a resource. For example, this could be the first order logic expressions that correspond to the RDF syntax of the resource. Now we need to associate an ontology with each resource. Definition 6 (Commitment Function) Let R be the set of information resources and O be the set of ontologies. The commitment function C : R → O maps resources to ontologies. We call this the commitment function because it returns the ontology that a particular resource commits to. When a resource commits to an ontology, it agrees to the meanings ascribed to the symbols by that ontology. We now wish to define the semantics of a resource. When a resource commits to an ontology, it has agreed to the terminology and definitions of the ontology. Thus every interpretation that is a model of resource must also be a model of the ontology for that resource. Definition 7 Let r be a resource where C(r) = O and I be an interpretation of LS . I is a model of r iff I is a model of O and I satisfies every formula in K(r). Since OWL is essentially a description logic, all OWL axioms can be expressed in LS . Unfortunately, OWL does not distinguish between resources and ontologies; instead all OWL documents are ontologies.3 We reconcile this situation by introducing a rule: any OWL document that contains a description of a class or a property is an ontology; otherwise, the document is a resource. After having made this distinction, some language constructs can be associated to the ontology definition (Definition 2). From an OWL DL ontology, we can construct an ontology O=hV, A, Ei 1. for each symbol s in rdf:ID, rdf:about and rdf:resource, s ∈ V 2. for each axiom a in class axioms and property axioms, a ∈ A 3. for each Oi in triple hO owl : imports Oi i, Oi ∈ E Note if the subject of owl:imports is a resource document in our definition, then hr owl : imports Oi → C(r) = O. In the case that a resource document imports multiple ontologies, C(r) = Ounion . Ounion is a virtual ontology where Ounion = h∅, ∅, {Oi |Oi ∈ hr owl : imports Oi i}, ∅, ∅i. RDF and RDFS have a model theoretic semantics (Hayes 2003), but do not have any explicit semantics for interdocument links. One possible way to assign such a semantics is to treat each RDF schema as an ontology and each reference to a namespace as an implicit commitment or extension relationship. We can now define an ontology perspective model of a semantic web space. In essence the following definitions define the semantics of the document links we have discussed by determining the logical consequences of a semantic web space based on the extensions and commitment relations. 3 This decision was not endorsed by all members of the working group that created OWL

Definition 8 (Ontology Perspective Model) An interpretation I is an ontology perspective model of a semantic web space W = hO, Ri based on O = hV, A, Ei (written I|=O W ) iff: 1. I is a model of O 2. for each r ∈ R such that C(r) = O or C(r) ∈ anc(O) then I is a model of r. Note the first part of the definition is needed because there may be no resources in R that commit to O, and without this condition, O would not have any impact on the models of the perspective. We now turn our attention to how one can reason with ontology perspectives. Definition 9 (Ontology Perspective Entailment) Let W = hO, Ri be a semantic web space and O be an ontology. A formula φ is a logical consequence of the ontology perspective of W based on O (written W|=O φ) iff for every interpretation I such that I|=O W, I|=φ. It would also be useful if we can define a first order logic theory that is equivalent to the definition above. Definition 10 (Ontology Perspective Theory) Let W = hO, Ri be a semantic web space where O = {O1 , O2 , . . ., On } and Oi = hVi , Ai , Ei i. An ontology perspective theory for W based on Oi is: Φ

= Ai ∪

[

Aj

{j|Oj ∈anc(Oi )}



[

K(r)

{r∈R|C(r)=Oi ∨ C(r)∈anc(Oi )}

These theories describe a perspective based on a selected ontology. Each theory contains the axioms of its basis ontology, the axioms of its ancestors, the assertions of all resources that commit to the basis ontology or one of its ancestors. Theorem 1 Let W = hO, Ri be a semantic web space, O = hV, A, Ei be an ontology, and Φ be a ontology perspective theory for W based on O. Then Φ |= φ iff W|=O φ. PROOF. (Sketch) We can show that the sets of models are equivalent. The models S of Definition 8, part 1 are exactly the models of Ai ∪ {j|Oj ∈anc(Oi )} Aj (from Definition 4). The union of these axioms corresponds to the conjunction of the conditions, and when simplified is equivalent to the theory specified in Definition 10. The implication of Definition 10 is that we do not need special purpose reasoners to perform ontology perspective reasoning. Since the entailments of an ontology perspective theory are equivalent to ontology perspective entailment (by Theorem 1), you can create the corresponding FOL theory and use an FOL theorem prover to reasoner about ontology perspective entailments. Furthermore, if we restrict the logic used in defining the resources and ontologies, then we should be able to use the more efficient reasoners that correspond to these restricted logics.

The Semantic Web can be used to locate documents for people or to answer specific questions based on the content of the Web. These uses represent the document retrieval and knowledge base views of the Web. In order to resolve a number of problems faced by the Semantic Web we have extensively discussed means of subdividing it. Theoretically, each of these perspectives represents a single model of the world, and could be considered a knowledge base. Thus, the answer to a semantic web query must be relative to a specific perspective. Definition 11 A Semantic Web query is a pair hOi , λi where Oi is the base ontology of the perspective and λ is a formula with existentially quantified variables. An answer to the query hOi , λi is θλ iff W|=O θλ where θ is a substitution for the variables in λ. In order to relate our framework to the perspectives, we now define the minimum mapping ontology between Oi and Oj is Om , where Om extends Oi and Oj and contains the axioms that map their vocabularies. Let’s take an example to see how our definitions can be applied to the real ontologies. Imagine we have two simple ontologies. O1 defines concept Car and O2 defines concept Automobile. Two data sources commits to them respectively: r1 states {ezz3290 ∈ O1 : Car} and r2 states {df g2134 ∈ O2 : Automobile}. A map between O1 and O2 would be an ontology Om12 that imports O1 ,O2 and has the axiom {O1 : Car ⇔ O2 : Automobile}. We have a query formula Car(x). If the query’s perspective is based on either O1 or O2 , there would be one or zero answers. However, by choosing Om12 as the perspective, the query will retrieve both instances. We argue that our perspectives have at least two advantages over the traditional KR languages. First, the occurrence of inconsistency is reduced compared to the global views, since only a relevant subset of the Semantic Web is involved in processing a query. Even if two ontologies have conflicting axioms, inconsistency would not necessarily happen in most perspectives other than the ones that are based on their common descendants. Second, the integration of information resources is more flexible compared to local views, since any two data sources could be included in the same perspective as long as the ontologies they commit to are both being extended by a third ontology. This third ontology is likely to be the mapping ontology that serves as the perspective.

DLDB: a Semantic Web Query Answering Engine Supports Perspective In this section we describe a system that implements our framework in a scalable way. First we describe the overall architecture. We then describe the detail design of how we represent OWL information and support perspectives in a relational database system.

Architecture DLDB (Pan & Heflin 2003) is a knowledge base system that extends a relational database management system

Figure 2: The Architecture of DLDB with additional capabilities for partial OWL reasoning. As shown in Figure 2, components in DLDB are loosely coupled. The DLDB core consists of Load API and Query API implemented in Java. Any DIG (Bechhofer, Moller, & Crowther 2003) compliant DL reasoner and any SQL compliant RDBMS with JDBC driver can be plugged into DLDB. This flexible architecture maximizes the customizability and allows reasoners and RDBMS run as services or even cluster on multiple machines.

Design Note we have extended DLDB to support perspectives, the following description is specific to this extended version of DLDB. The basic table design of DLDB adopted an approach that is similar to the ”Specific Representation” in (Alexaki et al. 2001) or the ”Schema-aware” representation in (Theoharis, Christophides, & Karvounarakis 2005). According to this model, creating tables corresponds to the definition of classes or properties in ontology. The classes or properties’ ID should serve as the table names. Each class has a table that records information about its instances. Since each row in the class tables corresponds to an instance of the class, we only need an ’ID’ field here to record the ID of the individuals who have rdf:type relation with this particular class. The rest of the data about an instance is recorded using the table per property (or decompositional) approach. Each instance of a property must have a subject and an object, which together identify the instance itself. Thus the ’subject’ and ’object’ fields are set in each table for property. Normally, the ’subject’ and ’object’ fields are foreign keys from the ’ID’ field of the class tables that are the domain or range of the property, respectively (see Figure 3). However, if the property’s range is a declared data type, then the object field is of corresponding datatype. This approach makes it easy to answer some queries with data constraints, such as finding

Figure 3: Tables in DLDB individuals whose ages are under 21. In DLDB class hierarchy information is stored through views. A view is a form of query in relational database; it creates the logical combination of tables and fields. From the user’s perspective, it looks like a ”virtual” table. In our design, the view of a class is defined recursively. It is the union of its table and all of its direct subclasses’ views. Hence, a class’s view contains the instances that are explicitly typed, as well as those that can be inferred. OWL has many features from description logics, the most significant are the constraints for class description. Using these, a DL reasoner can compute class subsumption, i.e., the implicit rdfs:subClassOf relations. DLDB uses DL reasoner (Horrocks, Sattler, & Tobies 1999) to precompute subsumption, and then using the inferred class taxonomy to create class views. Thus, DLDB only consults the DL reasoner once for each new ontology in the knowledge base. Whenever queries are issued concerning the instances of the ontology, the inferred hierarchy information can be automatically utilized. We think as a query answering engine, DLDB’s precomputation improves the computational efficiency which can save time as well as system resources. Following the definitions of perspectives, every time DLDB loads a new ontology O, it actually sends O with all the ancestors of O to reasoner for precomputation. Since different perspectives may have different axioms associated with a class/property and hence have different subsumptions returned by reasoner, DLDB also creates views on each ancestor’s classes and properties but does not create tables for them. The views are virtual queries, so that the number of views has slight impact on the size of the database. Meanwhile, the query based on different perspectives will be translated into corresponding database views. However, without the separate tables, the instances that commit to different ontologies are indistinguishable. This is problematic because that the perspective model does not include the resources that commit to a descendant ontology. To solve this problem, we add a new field in the class tables, namely the ’onto’ field. Thus, each instance of a particular class is recorded by its ID, its source document’s ID and the ID of the ontology to which the resource commits. The

example tables in the Figure 3 would become Student(ID, Source, Onto) TakesCourse(Subject, Object, Source, Onto) Consider that in ontology O1 the class Student is defined as all the people who take a Course, the class GradStudent is defined as all the people who take a GradCourse, and the class GradCourse is a subclass of Course. Then, by OWL entailment we can conclude that GradStudent is a subclass of Student. Thus, if we call the class view creation algorithm after interaction with a DL reasoner, the view of Student will be defined as: CREATE O1:Student:viewBy:O1 AS SELECT * FROM O1:Student WHERE O1:Student.Onto=O1 UNION SELECT * FROM O1:UndergraduateStudent:viewBy:O1 UNION SELECT * FROM O1:GradStudent:viewBy:O1; Since the views are defined recursively, the indirect subsumptions can be automatically taken care of by the database. Note A owl:equivalentClass B are implemented as A rdfs:subClassOf B plus B rdfs:subClassof A. For properties, the rdfs:subProperty and owl:equivalentProperty relations are handled in the same way. In addition, owl:inverseOf relations are supported in the property view creation algorithm. Suppose degreeFrom and hasAlumnus is a pair of inverse properties, the view definition of hasAlumnus will be: CREATE O1:hasAlumnus:viewBy:O1 AS SELECT * FROM O1:hasAlumunus WHERE O1:hasAlumuns.Onto=O1 UNION SELECT degreeFrom:viewBy:O1.object AS subject, O1:degreeFrom:viewBy:O1.subject AS object FROM O1:degreeFrom:viewBy:O1; Currently, DLDB is known to be incomplete with respect to transitive properties and class membership realization through restrictions, but supports most other OWL DL features depending on the capability of underlying DL reasoner. Note the queries supported by current DLDB fall under extensional conjunctive queries.

Experiments The broader goal of our experiment is to gather evidence regarding the feasibility of the Semantic Web. We wanted to see if our framework and the DLDB approach to scalable reasoning at least begin to address the last two of the three questions we posed at the beginning of the paper. The results are encouraging. DLDB has scaled well and the mapping framework with perspectives provided successful integration of disparate data. The specific DLDB configuration we used in loading and querying the Semantic Web is as follows: the DLDB main program runs on a SUN workstation featuring dual 64-bit Opteron CPUs and 2GB main memory. The backend DBMS is PostgreSQL 8.0 running on the same machine. The DL

reasoner is Racer Pro 1.9 running on another Windows powered machine. Both machines are connected via 100 Base-T Ethernet. We have used Swoogle’s 2005 index as our dataset. Of the 364,179 urls that we got from Swoogle, we successfully loaded 343,977 SW documents in 15 days and 15 hours. In total 41,741 among them are identified as ontologies by DLDB. It takes approximately 8 gigabytes disk space for the RDBMS to store the 45M triples. To the best of our knowledge this is largest load of a diverse real world Semantic Web data. This once again validates that DLDB approach scales fairly well. Since our experiment is a pioneering attempt to treat the Semantic Web as a singular knowledge base and pose queries to it, naturally we did not have access to query patterns or query logs that would allow us to choose “typical” queries. For this experiment we have selected queries which can showcase the possibilities of an integrated Semantic Web using existing data. Note, limitations in available data sometimes placed severe constraints on the kinds of queries we could ask. From the statistics about the identical local names in the second section, it is obvious that we need to map a large number of concepts with each other to have a fully integrated Semantic Web. However, the size of the Semantic Web dictates that we use some form of automated ontology mapping tool to address this problem. We note, that there are several ongoing research projects in this area (Noy & Musen 2002) but we chose to apply a heuristic described below to select our maps. We do this because in this paper our goal is not to present a comprehensive set of maps but rather to demonstrate the potential for integration in the Semantic Web. In order to meet our objective, we have adopted the following strategy. We perform a breadth analysis where we try to come up with maps that will loosely align a large number of ontologies. We also perform a depth analysis where we try to come up with maps that will tightly align a few ontologies. In each case we examine the effect of integration on results retuned by various queries. In breadth analysis we select mapping candidates that will have the highest “integratiblity”. We loosely define the “integratiblity” of a concept as its potential to connect multiple ontologies. Specifically, in this experiment, we have considered integratiblity of a concept to be the frequency of its occurrence in different ontologies. We use a combination of manual inspection and semi automatic filtering to determine concepts with high integratiblity. The main idea behind this heuristic is that these concepts will give us the most integration with least amount of effort. We mapped concepts that have the highest frequency of occurrence, shown in the table below. Note, here we only considered the concepts that have instance data: Concept Frequency of occurrence Person 10 Country 10 Project 7 City 5 University 4 We note two aspects of the above findings. First, a more

careful analysis must be done to come up with a better integratiblity measure. For example, we need to identify synonyms or sub class / super class of a concept and identify those in different ontologies. Although they may not weigh as much as a direct match, they contribute to determining integratability. Second, we have noticed that there are several concepts that have an integratiblity of 3 or below. We do not list them here for space constraints. We select Person from the above list in order to investigate the effect of maps from our breadth analysis. We perform an experiment as follows: We first query the knowledge base from the perspective of each of the 10 ontologies that define the concept Person. We now ask for all the instances of the concept Person. The results vary from 4 to 1,163,628. We then map the Person concept from all the ontologies to the Person concept defined in the FOAF ontology. We now issue the same query from the perspective of this map and we get 1,213,246 results. The results now encompass all the data sources that commit to these 10 ontologies. Note: a pair wise mapping would have taken 45 mapping axioms to establish this alignment instead of the 9 mapping axioms that we used. More importantly due to this network effect of the maps, by contributing just a single map, one will “automatically” get the benefit of all the data that is available in the network. Although in our experiment we have created this “super map” manually, it is conceivable that this can be automated. The response time for this query is about a minute. Although this is a very high number for an interactive system, we note that it is unlikely a user will ask a query that has such a low selectivity. In addition, our investigation suggests that this number may be high due to inefficiency in handling large result set found in the JDBC driver implementation we used. In addition to developing the breadth analysis based maps, we also perform a depth analysis of some selected ontologies and mapped their concepts and roles in complete details. The table below describes the prefixes, abbreviations and urls of the ontologies we use in our in depth map experiment. Prefix Abb. url foaf Of http://xmlns.com/foaf/0.1/ aktors Oa http://www.aktors.org/ontology/portal hack Oh http://www.hackcraft.net/bookrdf/vocab/0 1/ swrc Os http://swrc.ontoware.org/ontology dc Od http://purl.org/dc/elements/1.1/ Our experiment considers maps of 1,500 concepts that span over 1,200 ontologies. As we have described through out this section some of these maps have been created by us specifically for this experiment. The rest of the maps are from ontologies that define their concepts in terms of ontologies they extend. We describe the ρ formulas (see definition 11) of the queries below in FOL: 1.∃ X,Y,Z, Of :Person(X) ∧ Of :Document(Y) ∧ Of :maker(Z) ∧ Of :surname(X, ‘‘Williams’’) 2.∃ X, Od :Document(‘‘rdf-sparql-query’’) ∧ Of :isReferencedBy(‘‘rdf-sparql-query’’, X) 3.∃ X, Oa :Activity(X)

In the first query we show how maps can dramatically im-

prove retrieval in the Semantic Web. The query is about a list of publications by a certain individual using the FOAF ontology. There are several other ontologies that define these concepts and they have varying number of data that commit to them. However, a query posed in terms of FOAF ontology will miss the data that commits to the others. We can however map several ontologies with a mapping ontology and use that perspective to query the integrated view. We have mapped the relevant vocabularies of the following ontologies:(described using Description Logic) foaf:Person ≡ swrc:Person ≡ aktors:Person swrc:Publication v foaf:Document hack:Book v foaf:Document aktors:Publication v foaf:Document aktors:has-author v foaf:maker

In the second query, we show that our flexible mapping framework can be used to add additional semantics to an existing ontology. For example, Dublin core(Od ) has two RDF properties: isReferencedBy and references. It is fairly obvious that they have an inverse relationship. However, RDF does not allow us to express this. We can create a map: dc:references− ≡ dc:isRefenecedBy.

The third query demonstrates the fact that reasoning across ontologies can discover implicit maps. In addition, unlike the first two queries, this one has low selectivity since it places no additional constraints on the individuals. In this query, we ask for a list of activities in terms of Aktors ontology. If our mapping ontology is as follows: swrc:Project ≡ ∃ swrc:isAbout.swrc:ResearchArea aktors:Activity ≡ ∃ aktors:addresses-generic-area-of-interest .aktors:Generic-Area-Of-Interest swrc:ResearchProject v aktors:Generic-Area-Of-Interest swrc:isAbout v aktors:addresses-generic-area-of-interest

The reasoner will be able to infer swrc:Project to be a sub class of aktors:Activity. As a result, our query will return all the information about activities using aktors ontology and projects using swrc ontology as well. The table below summarizes the results of the queries. It shows that by adding the mapping ontologies and using them as query perspectives, we retrieve more results with a slight increase in query time. Also, most of the queries were under 1 second and the longest query was in 5 seconds in our experiment. Considering the size and reasoning involved in the system, we believe our approach is fairly scalable given the number of maps used in the experiment. Query

Perspective

# of Results

time(ms)

7 365 0 7 196 371

3964 5132 5 22 10 31

formula

1 1 2 2 3 3

Os map of Of , Oh , Os , Oa Od map on Od Oa map of Oa and Os

Related work We claim in our paper that ours is the first attempt to reason with the Semantic Web as a whole. Here we use reason-

ing strictly as a knowledge representation term. By reasoning we denote making inferences according to the rules of the logic used to represent the knowledge in the knowledge base. We should note that there are several projects that process the Semantic Web in various other ways. For example, Swoogle is the largest index of Semantic Web documents. Its index and its statistics have helped us immensely in our work. However, Swoogles query and retrieval mechanism is basically an IR based system. The mechanism relies on string match and relevancy computation. This fails to exploit the reasoning that can be done over Semantic Web data. Essentially this mechanism makes Swoogle a Web query system as opposed to a Semantic Web query system. Flink (Mika 2005) is a presentation of the scientific work and social connectivity of Semantic Web researchers. This interesting application uses FOAF (which is essentially an RDF application of social network) to develop social maps. The Carnot (Huhns et al. 1993) system is one the earliest work that uses articulation (i.e. mapping) axioms to address heterogeneity in enterprise integration. Carnot depends on commonsense knowledge and is more of a centralized approach which is not a suitable model for the Web. There are some on going efforts to bootstrap the Semantic Web by providing ontologies and reusable knowledge bases. TAP (Guha 2002) is an example of one such effort. They provide a fairly comprehensive source of basic information about popular entities, like music, authors, autos etc. They do not replace the upper ontologies but complement them. Their work can be seen as “grounding” of some fragment of those upper ontologies with real world data. The C-OWL work (Bouquet et al. 2003) proposed that ontologies could be contextualized to represent local models from a view of a local domain. They suggested that each ontology is an independent local model. If some other ontologies’ vocabularies need to be shared, some bridge rules should be appended to the ontology which extends those vocabularies. Compared to C-OWL, our perspective approach also provides multiple models from different views without modifying the current Semantic Web languages. In the past few years there has been a growing interest in the development of systems that will store and process large amounts of Semantic Web data. The general design approach of these systems is similar to ours, in the sense that they all use some database systems to gain scalability while supporting as much inference as possible. However, most of these systems are geared towards RDF and RDF(S) data and therefore focus less on OWL reasoning. Sesame(Broekstra, Kampman, & van Harmelen 2002) is currently the most well known Semantic Web query answering system. It is however an RDF Schema-based repository and therefore incomplete for many types of OWL queries (Guo, Pan, & Heflin 2005). The PostgreSQL implementation of Sesame has very similar table design to DLDB but uses the object-relational subtable mechanism between tables, instead of views to construct subsumptions. However, when new subsumptions are discovered, the overhead of changing subtable relations is very expensive. Therefore, we argue that our approach is more efficient. The ICS-FORTH RDFSuite(Alexaki et al. 2002) consists of the Validating RDF Parser (VRP), the RDF

Schema-Specific Data Base (RSSDB) and the query interpreter. Its database schema is very similar to Sesame and also based on an object relational database model. A more recent work at Oracle (Chong et al. 2005) uses an SQLbased approach to develop a query answering system for RDF(S) data. Although their approach is highly scalable, i.e. 80 million triples, they lack any form of inference capability beyond RDF (S) reasoning. A couple of other systems support OWL reasoning at various degrees. Instance Store(Bechhofer, Horrocks, & Turi 2004) has a basic architecture that is similar to DLDB. They combine a description logic reasoner with a database. Like us they also focus on Abox reasoning. However, the current version implements a restricted form of Abox reasoning. It can only deal with role-free Aboxes. This translates into Instance Store’s inability to reason with properties of an OWL individual. This in our opinion is a serious limitation. KAON2 is another emerging Semantic Web knowledge based system. It does not implement the tableaux calculus based reaoner. Rather, reasoning in KAON2 is implemented by a novel algorithm which reduces a SHIQ(D) knowledge base to a disjunctive datalog program(Hustadt, Motik, & Sattler 2004). KAON2 has database backend support that allows one to use a relational database system. OWLIM(Kiryakov, Ognyanov, & Manov 2005) is a Storage and Inference Layer (SAIL) for Sesame RDF database. OWLIM performs OWL DLP(Grosof et al. 2003) reasoning and exhibits good performances on loading and querying large knowledge bases. It should be noted here that none of the systems mentioned above have reported any experiments in processing the open Semantic Web. All of them have been evaluated with restricted knowledge bases expressed in RDF/OWL.

Conclusion and future work We present an integration framework for dealing with heterogeneity in the Semantic Web. The framework does not require any additional language primitives beyond OWL. We present a mechanism to implement this framework. This ontology perspective mechanism gives the user the flexibility to choose the type of integration (s)he desires (or believes in). We have demonstrated that it is feasible to perform essential OWL reasoning with very large (45M triples), authentic Semantic Web data. Our system is optimized for working with OWL instances. We put forward that scalability is more critical in processing the data sources as opposed to ontologies, because data sources will substantially outnumber the ontologies in the Semantic Web. Although we believe our work is a first step in the right direction, we have discovered many issues that remain unsolved. First, although our system scales well to the current size of the Semantic Web, it is still unknown if such techniques will continue to scale well as the Semantic Web grows. Second, although perspectives resolve some form of inconsistencies; inconsistency between data sources that commit to the same ontology still result in inconsistent perspectives. These are some interesting research questions that we plan to explore and invite others to do the same.

Acknowledgment This material is based upon work supported by the National Science Foundation (NSF) under Grant No. IIS-0346963. We sincerely thank Tim Finnin of UMBC for providing us access to the Swoogle’s index of URLs.

References Alexaki, S.; Christophides, V.; Karvounarakis, G.; Plexousakis, D.; and K.Tolle. 2001. On storing voluminous rdf description: The case of web portal catalogs. In Proc. of the 4th International Workshop on the Web and Databases. Alexaki, S.; Athanasi, N.; Christophides, V.; Karvounarakis, G.; Magkanaraki, A.; Plexousakis, D.; and Tolle, K. 2002. The ics-forth rdfsuite: High-level scalable tools for the semantic web. In Poster Session of the Eleventh International World Wide Web Conference. Bechhofer, S.; Horrocks, I.; and Turi, D. 2004. Implementing the instance store. Technical Report CSPP-29, University of Manchester,Computer Science preprint. Bechhofer, S.; Moller, R.; and Crowther, P. 2003. The dig description logic interface. In Proc. of International Workshop on Description Logics (DL2003). Bouquet, P.; Giunchiglia, F.; van Harmelen, F.; Serafini, L.; and Stuckenschmidt, H. 2003. C-OWL: Contextualizing ontologies. In Proc. of the 2003 Int’l Semantic Web Conf. (ISWC 2003), LNCS 2870, 164–179. Springer. Brickley, D., and Miller, L. 2006. FOAF Vocabulary Specification. At: http://xmlns.com/foaf/0.1/. Broekstra, J.; Kampman, A.; and van Harmelen, F. 2002. Sesame: A generic architecture for storing and querying rdf and rdf schema. In Proc. of the First Internation Semantic Web Conference. Chong, E.; Das, S.; Eadon, G.; and Srinivasan, J. 2005. An efficient sql-based rdf querying scheme. In Proc. of the 31st VLDB Conference. Ding, L.; Finin, T.; Joshi, A.; Peng, Y.; Pan, R.; and Reddivari, P. 2005. Search on the semantic web. IEEE Computer 10(38):62–69. Grosof, B.; Horrocks, I.; Volz, R.; and Decker, S. 2003. Description logic programs: Combining logic programs with description logic. In Proceedings of WWW2003. Budapest, Hungary: World Wide Web Consortium. Guarino, N. 1998. Formal ontology and information systems. In Proceedings of Formal Ontology and Information Systems. Trento, Italy: IOS Press. Guha, R. 2002. Tap: Towards the semantic web. Demo on World Wide Web 2002 Conference. At: http://tap.stanford.edu/www2002.ppt. Guo, Y.; Pan, Z.; and Heflin, J. 2005. Lubm: A benchmark for owl knowledge base systems. Journal of Web Semantics 3(2):158–182. Halaschek-Wiener, C.; Golbeck, J.; Schain, A.; Grove, M.; Parsia, B.; and Hendler, J. 2005. Photostuff - an image annotation tool for the semantic web. In Proc. of the 4th International Semantic Web Conference. Poster paper.

Hayes, P. 2003. RDF semantics. Proposed Recommendation. http://www.w3.org/TR/2003/PR-rdf-schema20031215/. Heflin, J., and Pan, Z. 2004. A model theoretic semantics for ontology versioning. In Proc. of the 3rd International Semantic Web Conference, 62–76. Heflin, J. 2001. Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment. Ph.D. Dissertation, University of Maryland. Horrocks, I.; Sattler, U.; and Tobies, S. 1999. Practical reasoning for expressive description logics. In Proc. of LPAR’99, 161–180. Huhns, M.; Jacobs, N.; Ksiezyk, T.; Shen, W. M.; Singh, M.; and Canata, P. 1993. Enterprise information modeling and model integration in carnot. In Proc. of the International Conference on Intelligent and Cooperative Information Systems, 12–14. Hustadt, U.; Motik, B.; and Sattler, U. 2004. Reducing shiq description logic to disjunctive datalog programs. In Proc. of the 9th International Conference on Knowledge Representation and Reasoning, 152–162. Kiryakov, A.; Ognyanov, D.; and Manov, D. 2005. OWLIM a pragmatic semantic repository for owl. In Proc. of Int. Workshop on Scalable Semantic Web Knowledge Base Systems, LNCS 3807, 182–192. Maedche, A., and Staab, S. 2001. Learning ontologies for the semantic web. IEEE Intelligent Systems 16(2):72 – 79. Mika, P. 2005. Flink: Semantic web technology for the extraction and analysis of social networks. Journal of Web Semantics 3(2). Noy, F., and Musen, A. 2002. Evaluating ontologymapping tools: Requirements and experience. In The Proceedings of OntoWeb-SIG3 Workshop at the 13th International Conference on Knowledge Engineering and Knowledge Management. Pan, Z., and Heflin, J. 2003. DLDB: Extending relational databases to support semantic web queries. In Proc. of the Workshop on Practical and Scaleable Semantic Web Systms, ISWC, 109–113. Theoharis, Y.; Christophides, V.; and Karvounarakis, G. 2005. Benchmarking database representations of rdf/s stores. In Proc. of the 4th International Semantic Web Conference, 685–701.