Efficient Resource Retrieval from Semantic Knowledge Bases Floriana Esposito Nicola Fanizzi Steffen Staab Claudia d’Amato University of Bari University of Bari University of Koblenz-Landau University of Bari Bari, Italy Bari, Italy Koblenz, Germany Bari, Italy E-mail:
[email protected] E-mail:
[email protected] E-mail:
[email protected] E-mail:
[email protected]
Abstract—Efficient resource retrieval is a crucial issue, particularly in the context of Semantic Web, since forms of reasoning are used for answering requests. Resources are retrieved by performing a match test between each resource description and the query. This approach becomes inefficient with the increase of available resources. We propose a method for improving the retrieval process by constructing a tree index through a new conceptual clustering method for resources expressed in Web ontology languages. In the index, the available resource descriptions are located at the leaf nodes, while inner nodes represent intensional descriptions (generalizations) of their child nodes. The match process is executed by following the tree branches whose nodes satisfy the query. Query answering time may be strongly improved as the steps may be O(log n).
I. M OTIVATION The goal of Semantic Web [5] (SW) is to make Web resources machine readable and interoperable besides of human readable. This is obtained by semantically enriching resources with metadata that refer to shared ontologies. Expressive ontology languages, such as OWL-DL1 grounded on Description Logics [4] (DLs), have been introduced to formally define domain conceptualizations and to describe the resources that are available in these domains. Some examples are, (i), the software domain where resources may be Web services [24], [15], (ii), the business transactions domain where resources are business products [16], or, (iii), the multimedia domain where resources may be multimedia objects. The adoption of such representations for knowledge bases (KBs) enabling a series of advanced tasks performed at a semantic level. Particularly, a semantic search rather than a keyword-based syntactic search can be performed thus ensuring a reduction of the information overload phenomenon. For instance, a user may exploit a reasoning engine to ask questions like, “which are the low cost companies that fly from Bari to Cologne?”. This means to ask for all resources (i.e. (Semantic) Web services) that satisfy the request. They can be semantically retrieved by performing, by a reasoner, concept retrieval [4] which consists in determining all individuals (representing the resources) [24] that are instance of the query concept description. Concept retrieval (independently from optimizations [4]) is 1 www.w3.org/2004/OWL/
performed by executing the instance checking2 [4] for each individual in the ontology. However, as proved in [13], for DL languages with qualified existential restriction (as the one supporting OWL-DL), instance checking suffers from an additional source of complexity which do not show up other inference services such as concept subsumption. To decrease the complexity of the semantic retrieval process, concept subsumption rather than instance checking could be used. For each resource, its most specific concept3 (MSC) [4] can be computed, hence semantic retrieval can be performed by checking, for each msc, if the subsuption relationship between the query concept and the MSC holds (matching process). The MSCs will be computed only once, hence this do not add complexity to the semantic retrieval process. Moreover, considering MSCs also allow to manage the (more seldom) case of resources modeled as concept descriptions [15] rather than instances. However, for a large number of resource specifications, the naive approach of matching the query w.r.t. each specification becomes highly inefficient. In databases, the match between a specified object (e.g. tuple) and a first-order query (e.g. SQL) is made efficient by appropriate tree-based indexing of database objects. Similarly to databases, the core idea of this paper is to exploit a tree-based index for resource specifications in DLs, to improve the efficiency of the resource retrieval task. We focus on resources whose MSCs are described as ALE concept descriptions and that refer to an ALC ontology acting as KB of reference (see Sect. II for details about these logics). Though the intuition of this idea is very straightforward, its realization has proved very difficult in the past for several reasons4 : (a) First, the need for description logics tree indices has not been recognized by the community, because subsumption reasoning allows for computing a subsumption hierarchy of known concepts. However, for large numbers of resources, 2 Instance checking is an inference procedure which assesses if an individual is instance of a concept or not. 3 The most specific concept (MSC) of an individual a, w.r.t. the KB in which it is contained, is the concept description M such that: 1) a is instance of M ; 2) if there exists another concept S of which a is instance of, it is more general (w.r.t. subsuption) than M . The definition of MSC ensures that if an individual is instance of a concept than the MSC will be subsumed by the concept as well. 4 This paper extends [12].
it is necessary to invent concepts such that an efficient tree can be build. Stollberg at al. [33] have illustrated this effect by manual invention of concepts for the purpose of efficient Web service discovery. In this paper we show an automatic construction of inner nodes of a DLs tree index by conceptual clustering. (b) Second, building DLs tree indices on the ground of a clustering algorithm requires the availability of suitable semantic similarity (or distance) measures. Only over the last three years similarity measures for DLs knowledge bases have been discussed in the literature, and, as showed in [11], most existing similarity measures are semantically unsound or are not suitable for measuring the similarity of disjoint, but highly similar concepts (e.g. Man and Woman). Both criteria, semantic soundness and similarities between disjoint concepts, however, are needed for the successful construction of a tree index. (c) Third, once a new node is constructed by clustering sub-nodes, it is necessary to give this node a generalizing specification of it w.r.t. its sub-nodes. Disjunction is too weak to fulfill this task. We have exploited the so called good common subsumer, proposed by Baader et al. [3] that let us generalize from several more specific concepts in a meaningful way. In the remainder of the paper, we explain and evaluate our proposal for the DL-tree index in detail. In Sect. II the basics of DLs will be summarized. In Sect. III the conceptual clustering algorithm is presented. In Sect. IV the retrieval procedure grounded on the DL-tree index is illustrated. In Sect. V, the experimental evaluation of the proposed method is reported. Related works are examined in Sect. VI while further developments are discussed in Sect. VII. II. R EFERENCE R EPRESENTATION Description Logics [4] are the theoretical foundation of OWL language. They comprise a family of languages of different expressive power. Two of these are ALC and ALE logics whose basics are summarized in the following. In DLs, descriptions are inductively defined starting with a set NC of primitive concept names and a set NR of primitive roles. The semantics of the descriptions is defined by an interpretation I = (∆I , ·I ), where ∆I is a non-empty set representing the domain of the interpretation, and ·I is the interpretation function that maps each A ∈ NC to a set AI ⊆ ∆I and each R ∈ NR to RI ⊆ ∆I × ∆I . The top concept > is interpreted as the whole domain ∆I , while the bottom concept ⊥ corresponds to ∅. In the sequel, the canonical interpretation [4] will be considered. It has the set of individuals mentioned in the KB as domain and the identity function as interpretation function. Complex descriptions can be built in ALC using primitive concepts and roles and the following constructors: 1) full negation, denoted ¬C (given any description C), it amounts to ∆I \C I ; 2) concepts conjunction, denoted C1 u C2 , yields an extension C1I ∩ C2I ; 3) concept disjunction, denoted C1 t C2 , yields the union C1I ∪ C2I ; 4) existential restriction, denoted ∃R.C, is interpreted as the set {x ∈ ∆I | ∃y ∈ ∆I ((x, y) ∈ RI ∧ y ∈ C I )}; 5)
value restriction ∀R.C, has the extension {x ∈ ∆I | ∀y ∈ ∆I ((x, y) ∈ RI → y ∈ C I )}. ALE is a sub-language of ALC. Concept disjunction is not allowed in ALE since only the atomic negation can be used. Def.[subsumption] Given two descriptions C and D, C subsumes D, denoted by C w D, iff for every interpretation I it holds that C I ⊇ DI . When C w D and D w C then they are equivalent, denoted with C ≡ D. A knowledge base K = hT , Ai contains a TBox T and an ABox A. T is the set of definitions C ≡ D, meaning C I = DI , where C is the concept name and D is its description. A contains assertions on the world state, e.g. C(a) and R(a, b), meaning that aI ∈ C I and (aI , bI ) ∈ RI . General inclusion axioms D v C are also allowed in the TBoxes as partial concept definitions. An important inference procedure is instance checking, that is deciding whether an individual is instance of a concept or not [4] while instance retrieval (also known as concept retrieval) returns all the individuals that are instances of a given a concept description. Conversely, the Most Specific Concept (MSC) is the most specific description (w.r.t. the subsumption relationship) of which an individual is instance of. For some DL languages, the MSC may only be approximated [4]. A possible approximation has been proposed in [25]. The Least Common Subsumer (LCS) is the reasoning service that returns the most specific concept (w.r.t. the subsumption relationship) subsuming a given set of concept descriptions. Depending on the DL, the LCS may not exist. When it does, it is unique up to equivalence. In ALC and ALE, the LCS always exists [2], [4]. In ALC, the LCS is given by the disjunction of the considered concepts. In ALE (where disjunction is disallowed), the LCS is syntactically computed [2], by taking the common concept names in the descriptions without referring to their definitions in the TBox [2]. As it often results to be very general (equivalent to >), the computation of the LCS w.r.t. the TBox5 [3] can be considered. A brute force algorithm for computing the ALE LCS w.r.t. an ALC TBox has been proposed [3], however, it resulted to be hardly usable in practice, due to its complexity. Hence, an algorithm for computing its approximation has been introduced. The generated approximation is called Good Common Subsumer (GCS) and it is computed by determining the smallest conjunction of (negated) concept names subsuming the conjunction of the top level concept names of each considered concept. The GCS is more specific than the LCS computed by ignoring the TBox, though it need not be the LCS w.r.t. the TBox [3]. To make clear the difference between the ALE LCS computed without considering the TBox and the GCS, the following example is considered. Example 2.1 (LCS and GCS): Let us consider the following ALC TBox T = { Transportation ≡ BookingWS u ∃from.City u u ∃to.City u ∃operatedBy.Vehicle; GroundVehicle ≡ Car t Bus t Truck t Van; GroundTransportation ≡ TransportationWS u 5 The
TBox can be described in a more expressive DL than ALE.
u∃operatedBy.GroundVehicle; BusRide ≡ Transportation u ∃operatedBy.Bus; Flight ≡ Transportation u ∃operatedBy.Plane; EnglishCity v UKCity } and the following ALE concept descriptions: UKFlight ≡ Flight u ∃to.UKCity, and EngBusRide ≡ BusRide u ∃to.EnglishCity. The ALE LCS for these concepts is: LCS(UKFlight, EngBusRide) = LCS(Flight u ∃to.UKCity, BusRide u ∃to.EnglishCity) = > since they do not share any concept names (also in the scope of the existential restrictions). Instead, the GCS (which considers the TBox) is given by: GCS(UKFlight, EngBusRide) = = GCS(Flight u ∃to.UKCity, BusRide u ∃to.EnglishCity) = = Transportation u ∃to.UKCity This is because the smallest conjunction of (negated) concept names that subsumes (w.r.t. the TBox T ) both Flight and BusRide is the concept Transportation, while the contribution of the existential restrictions is to.UKCity as the axiom EnglishCity v UKCity holds for the concepts in the scopes.
Fig. 1. DL-Tree returned by DL-L INK applied to: A ≡ Flight u ∃to.GermanDestination, B ≡ Flight u ∃to.ItalianDestination, C ≡ Train u ∃to.GermanDestination, D ≡ Train u ∃to.FrenchDestination.
III. C LUSTERING DL D ESCRIPTIONS In this section, the building of a DLs tree by means of a conceptual clustering method is illustrated. Clustering methods organize collections of objects into meaningful groups (clusters) [18] such that the intra-cluster similarity is high and inter-cluster similarity is low. Conceptual Clustering methods focus on techniques for supplying intensional descriptions of the clusters. In the following, a hierarchical agglomerative conceptual clustering algorithm for grouping DLs concept descriptions is presented. In our framework, resources may assumed to be described either as classes (DL concept descriptions) or as individuals (class instances). In this latter case, the MSC approximation (see Sect. II) of each individual is computed. Hence, in the following we will generically deal with resources represented as concept descriptions, regardless of their original actual description. A. Conceptual Clustering with
DL-L INK
Several hierarchical clustering algorithms have been proposed [18], mainly for grouping objects represented by feature vectors. The hierarchical approach starts by considering each object as the unique element of a distinct cluster and it iteratively merges the two most similar clusters until a single one remains. The output is a dendrogram6 representing the nested grouping of objects. In order to cluster DL concept descriptions we adapt the algorithm as follows: (a) concept similarities are computed by the use of the GCS-based similarity measure (see Sect. III-B) rather than the usual euclidean measure. The GCS-based measure is able to capture the expressiveness of the DL representation; (b) a conceptual clustering step is introduced: intensional cluster descriptions are computed as the GCS (see 6 A dendrogram is a nested grouping of objects and similarity levels at which grouping changes; it is mainly a tree and could be broken at different levels to yield different clustering of the data.
Fig. 2. Flattening the DL-Tree returned by DL-L INK where F, G and H are supposed to be equivalent concepts.
Sect. II) of the merged clusters. We name our algorithm DLL INK. It clusters ALE(T ) descriptions referring to an ALC TBox. Even if these DLs are less expressive than OWL, they are considered a good tradeoff between expressiveness and computational complexity. The algorithm is sketched in the following. DL-L INK(S) input S = {R1 , . . . , Rn } the set of available concept descriptions; output DL-Tree: dendrogram of the clustering process Let C = {C1 , . . . , Cn } be the set of initial clusters obtained by considering each Ri in a single cluster Ci ; DL-Tree = {C1 , . . . , Cn }; n := |C|; while n 6= 1 do for i, j := 1 to n Compute the similarity values sij (Ci , Cj ); Compute (Ch , Ck ) = argmaxi,j sij Create Cm = GCS(Ch , Ck ) the intensional descr. of the new cluster; Set Cm as parent node of Ch and Ck in DL-Tree; Insert Cm in C and remove Ch and Ck from C; n := |C|; return DL-Tree;
DL-L INK starts by considering each description in a single cluster (available clusters). The similarity between all couples of clusters is computed and the couple having the highest similarity value is selected, hence their GCS is computed and it is first set as their parent node, then it is inserted in the list of the available clusters while the selected ones are removed. Note that the GCS represents a cluster made by a single element, this means that, at each level of the clustering process, the clusters are always made by a single concept description. These steps are iteratively repeated, until a unique cluster (describing all resources) is obtained. The resulting dendrogram is named DL-Tree (Fig. 1).
Fig. 4. Concepts C ≡ CreditCardPayment, D ≡ DebitCardPayment are similar as the extension of their GCS ≡ CardPayment does not include many other instances besides of those of their extensions.
Fig. 3. Update of the DL-Tree of Fig. 1 when Z ∃to.FrenchDestination is added.
≡ Flight u
The dominant operation in DL-L INK is the computation of the similarity values. It is performed n2 times, n being the number of the available resources. This operation requires determining the GCS of two concept descriptions (see Sect. III-B), which is an NP-complete problem. Even though the worst case computational effort required by DL-L INK is considerable, this does not represent a real problem since the construction of a DL-Tree is performed only once (or rarely), in a batch process, preliminarily to a likely long series of retrieval tasks. DL-Tree is a binary tree since, at each step, two clusters are merged. However, it may happen that some sibling nodes as well as some parent and child nodes share the same intensional description. To remove this kind of redundancy, a post-processing step is performed which consists in merging sibling nodes and/or parent and child nodes represented by equivalent concepts, and adjusting the related children. The result of this flattening process is an n-ary DL-Tree as depicted in Fig. 2. When a new resource is made available after the construction of the DL-Tree, it has to be included in it. The DL- LINK needs not to be executed again, the DL-Tree can be updated by simply computing the similarity of the new resource with each leaf node of the DL-Tree and then adding the new concept as a sibling of the concept having the highest similarity and then recursively recomputing the description (namely the GCS) of the parent nodes until the root is reached. This means that only a branch is involved rather than the whole DL-Tree (see Fig. 3). B. The GCS-based Similarity Measure The choice of the similarity measure is a crucial point in the clustering process. It can affect the quality of the clusters obtained. Assessing the similarity in DL representations is being investigated [7], [11], [19]. For the DL-L INK algorithm, we adopt the GCS-based similarity measure [12]. It is grounded on the concept extensions restricted to the individuals occurring in the KB. Instead of counting the instances of the overlap concept [9], the similarity is determined by the variation of the number of instances in the extensions of the considered concepts w.r.t. the number of instances in
Fig. 5. Concepts C ≡ CarTransfer, D ≡ DebitCardPayment are different as the extension of their GCS ≡ Service includes many other instances besides those in the extension of C and D.
the extension of their common super-concept represented by the GCS (see Sect. II). The measure is formally defined as follows [12]: Def.[GCS-based similarity] Given an ALC TBox T and C, D ALE(T )-concept descriptions, the similarity measure is s : ALE(T ) × ALE(T ) → [0, 1] m g m s(C, D) = · 1− I · 1− g |∆ | g where m = min(|C I |, |DI |), g = |(GCS(C, D))I |, and (·)I represents the concept extension w.r.t. the interpretation I (canonical interpretation). The GCS has been chosen because it is neither too specific as the ALC LCS (that simply amounts to the disjunction of the considered concepts), nor too general as the ALE LCS computed without considering the KB (see Sect. II). Moreover, in the measure definition, the minimum extension of the concepts is considered to avoid the incorrect case of high similarity value when one of the two concepts is very similar to the super-concept while the other is very different. It can be easily proved [12] that the function of Def. III-B is a similarity measure, following the formal definition in [6]. The rationale of the measure is that if two concepts are semantically similar their GCS is close to them in the subsumption hierarchy, namely the GCS and the input concept extensions share many instances (Fig. 4). If the concepts are very different, their GCS will be high up in the hierarchy and will have many instances that do not belong to the extensions of the input concepts (Fig. 5). Opposite to other semantic similarity measures, this rationale does not require the overlap of the compared concepts [31], [7], [10], and does not take into account the structural path distance between concepts [30], [23], [26]. The measure combines extensional
size of concept expressions (to reflect their model semantics) and an intensional generalization (the GCS) of the considered concepts, so to exploit also the given KB. IV. T HE R ETRIEVAL P ROCEDURE Resource retrieval is the task of locating available resources that satisfy a given request. Usually, this is done by matching the request with all available resource descriptions. We will refer to this approach as to the linear matching approach, since the computational complexity depends linearly on the number of available resources. We present an algorithm that speeds up the retrieval process by exploiting the DL-Tree index (see Sect. III). Similarly to the logarithmic search methods, the rationale of the retrieval procedure is to cut off entire sections of the search space to reduce the number of comparisons (matches) that are required for finding the resources. The proposed procedure is presented in the following. retrievalProcedure(R, C) input R: request; C: DL-Tree root output retrievedConcepts: list of retrieved concepts
matchTest(RC, N C) input RC: request concept; N C: DL-Tree node concept output result: boolean
retrievedConcepts = []; if (RC v N C) then if matchTest(R, C) then return true if not (C.isLeafNode()) then else childNodes = C.getChildNodes() return false for (i = 0; i < childNodes.size(); i++) childNode = childNodes[i] if matchTest(R, childNode) then conceptsFromNode = retrievalProcedure(R,childNode) retrievedConcepts.add(conceptsFromNode) if retrievedConcepts.size() == 0 then retrievedConcepts.add([C]) else retrievedConcepts.add([C]) return retrievedConcepts
Both the request and resources are DL concept descriptions. Recall that in the DL-Tree the leaf nodes contain the descriptions of the available resources, while inner nodes (including the root) contain intensional descriptions of the groups of resources represented by their child nodes (see Sect. III). Given a request R, the retrievalProcedure checks immediately if the matchTest is satisfied by the concept at the root of the DL-Tree. As a match test, concept subsuption (specifically if the concept subsumes the request) is exploited [24]. The reasons are twofold: 1) computational complexity can be saved with respect to the usage of concept retrieval (see discussion in Sect. I); 2) only resources that are able to fully satisfy the request are selected. Since the root of DL-Tree contains the intensional description accounting for all the available resources, if the match test is not satisfied, this means that there are no available resources that can satisfy R and the procedure stops at the first step7 . Conversely, if the matchTest is satisfied, retrieval must search through each child node. When a child node does not satisfy the matchTest, the rest of the sub-tree that is rooted in that node is discarded. Otherwise, all the child nodes in the branch are recursively explored until 7 Note
that this early stopping behavior greatly differs from the mentioned linear matching approach that would require n tests.
Fig. 6. Retrieval of a service satisfying a request R. Red bold boxes represent nodes satisfying the MatchTest.
a leaf node is reached or until a match failure determines the end of the search in that sub-tree. The final list of concepts representing the retrieved resource descriptions is given by the leaf nodes that satisfy the match test or, if no leaf nodes are found, by the deeper inner nodes that satisfy the request. As it will be experimentally shown (Sect. V), resource retrieval with the DL-Tree index allows to decrease the number of match test w.r.t. the linear approach. Indeed, the exploitation of the DL-Tree drastically reduces the size of the search space to be explored, since search is restricted only to the branches whose nodes satisfy the match test w.r.t. to the request. A good clustering of n available resource descriptions may reduce the number of matches needed for finding the right resources from O(n) (as for the linear matching approach) to O(log n) depending on the specificity of the request, consequently shortening the response time for a request. An example of the retrieval process performed by retrievalProcedure is reported in Fig. 6, where available resources are descriptions of services performing flights and/or supplying several kinds of accommodation and a request for a flight from Cologne to Bari is specified. Once the final list of concepts representing the retrieved resource description is returned by retrievalProcedure, their instances, which eventually are the actual resources, are collected (by means of the concept retrieval (Sect. II) inference procedure) in order to assess which of them are also instances of the request8 . In order to do this, for each collected individual, the instance check (see Sect. II) w.r.t. to the request R is performed. As it will be shown in the experimental evaluation (Sect. V), resource retrieval with the DL-Tree index allows to decrease the number of instance checks, since the descriptions found are very specific. retrievalProcedure using the DL-Tree index is sound and complete. For further improving the efficiency of the resource retrieval process, its completeness could be weakened as follows. Performing retrievalProcedure, it may happen that, 8 Since the match test consists in checking if the resource description subsumes the request, there can be some individuals that are instances of the resource description but that are not instances of the request.
at some level, more than one node satisfies the matchTest. In this case all paths rooted in such nodes should be explored. As an alternative (not considered for the experiments in Sect. V), an heuristic could be adopted for suggesting the path to follow i.e. the path rooted in the node that is most similar to the request.
TABLE I AVERAGE NUMBER OF MATCH TESTS (AM) AND EXECUTION TIME (AMET) FOR FINDING ALL RESOURCES SATISFYING A REQUEST IN THE THREE SETTINGS . DATA S ET
retrieval criterion DL-Tree
U NIVERSITY
C. Hierarchy
V. E XPERIMENTAL E VALUATION
Linear
In this section we experimentally show that the proposed method improves the resource retrieval task w.r.t. to the traditional linear search. Moreover, we compare our method to a possible alternative approach based on the exploitation of the tree structure deriving from the subsumption hierarchy of the concepts in the KB.
DL-Tree
A. Data Sets The retrievalProcedure (see Sect. IV) has been applied to a number of data sets available on the Web: (a) SWSD, the S EMANTIC W EB S ERVICE D ISCOVERY9 data set, is a collection of Web services described for the DL-based framework presented in [15]. It consists of an ALC ontology representing the KB of reference and 100 services described in ALE(T ); (b) W INE, an ontology10 written in ALCIO(D) and made up of 112 concepts, 9 object properties and 188 distinct individual names; (c) U NIVERSITY, an ontology, built from the Lehigh University Benchmark11 , written in ALR+ HI(D) logic and made up of 43 concepts, 25 object properties and 555 distinct individual names; (d) F INANCIAL, an ontology12 that was employed as a testbed for Pellet, written in ALCIO(D) and made up of 112 concepts, 9 object properties and 1000 distinct individual names. B. Methodology The service descriptions (described as DL concept descriptions) in the SWSD data set have been clustered by means of DL-L INK algorithm, obtaining a DL-Tree with a maximum depth equal to 7 (even if most of the leaf nodes were at level 4) and an average branching factor equal to 5. As alternative tree, the subsuption concept hierarchy given by the SWSD data set has been considered. It has a maximum depth of 4 (even if most of the leaf nodes are at level 3) and an average branching factor equal to 7. 100 queries, corresponding to the actual resource descriptions, have been considered. As regards W INE ontology (where resources are the individuals in the ontology), the MSCs13 of all individuals have been computed and 93 distinct MSCs have been collected. They have been clustered by the DL-L INK algorithm, obtaining a DL-Tree with a maximum depth of 5 (even if most of the leaf nodes were at level 3 and 4) and with an average branching 9 https://www.uni-koblenz.de/FB4/Institutes/IFI/AGStaab/Projects/xmedia/ dl-tree 10 http://protege.stanford.edu/plugins/owl/ontologies.html 11 http://swat.cse.lehigh.edu/projects/lubm/ 12 http://www.cs.put.poznan.pl/alawrynowicz/financial.owl 13 For the purpose of the experiments, the MSCs approximations have been computed in ALE(T ) also for those ontologies described by languages more expressive.
SWSD
C. Hierarchy Linear
measure
results
AM AMET AM AMET AM AMET
16 199 ms 16 134 ms 21 397 ms
AM AMET AM AMET AM AMET
46 314 ms 41 224 ms 100 791 ms
DATA S ET
retrieval criterion DL-Tree
W INE
C. Hierarchy Linear DL-Tree
F INANCIAL
C. Hierarchy Linear
measure
results
AM AMET AM AMET AM AMET
36 244 ms 22 172 ms 93 868 ms
AM AMET AM AMET AM AMET
37 2793 ms 16 1077 ms 116 3657 ms
factor equal to 5. As alternative tree, the concept hierarchy given by the W INE ontology has been considered. It has a maximum depth equal to 7, even if most of the leaf nodes are at level 5 and an average branching factor equal to 5. The retrieval has been performed by the use of 93 queries, corresponding to the MSCs. Analogously, the MSCs of all individuals in the U NIVER SITY ontology have been computed and 21 distinct MSCs have been obtained. They have been clustered by the DL-L INK algorithm, obtaining a DL-Tree with a maximum depth of 4 (even if most of the leaf nodes were at level 3) and with an average branching factor equal to 5. As alternative tree, the concept hierarchy given by the U NIVERSITY ontology has been considered. It has a maximum depth equal to 5, even if most of the leaf nodes are at level 2 and 3, and an average branching factor equal to 5. The retrieval has been performed by the use of 21 queries, corresponding to the computed MSCs. As regards the F INANCIAL ontology, the MSCs of all individuals have been computed and 116 distinct MSCs have been collected. They have been clustered by the DL-L INK algorithm, obtaining a DL- Tree with a maximum depth of 5 (even if most of the leaf nodes were at level 3 and 4) and with an average branching factor equal to 8. As alternative tree, the concept hierarchy given by the F INANCIAL ontology has been considered. It has a maximum depth equal to 4, even if most of the leaf nodes are at level 2, and an average branching factor equal to 4. The retrieval has been performed by the use of 116 queries, corresponding to the MSCs. The efficiency of the method has been measured by counting 1) the average number of matches (subsumption tests) for finding all available resource descriptions satisfying a request, both in the trees (DL-Tree and concept hierarchy) and in the linear approach; 2) the average number of performed instance checks for assessing the individuals (instances of the retrieved resource descriptions) that are instances of the request; 3) the average execution time. The experiments have been performed on a PowerBook laptop, with a 1.67GHz G4 CPU and 1.5GB RAM. C. Results Results of the experiments are collected in Tab. I. Looking at the table it is straightforward to note that the retrieval
TABLE II AVERAGE NUMBER OF MATCH TESTS (AM) AND EXECUTION TIME (AMET) AND AVERAGE NUMBER OF INSTANCE CHECKS (AIC) AND EXECUTION TIME (AICET) AND THEIR TOTALS ( RESP. TM, TEC) FOR
the DL-Tree by merging more than two concepts at each steps could further improve the efficiency of the retrieval process.
FINDING ALL INSTANCES SATISFYING A REQUEST WITH THE PROCEDURES BASED ON THE DL-Tree AND ON THE C. H IERARCHY CRITERIA .
VI. R ELATED W ORK
DATA S ET
retrieval criterion
DL-Tree
U NIVERSITY
C. Hier.
DL-Tree
SWSD
C. Hier.
meas.
results
AM AMET AIC AICET TM TEC AM AMET AIC AMET TM TEC
16 199 ms 29 2063 ms 45 2262 ms 16 134 ms 135 6418 ms 151 6552 ms
AM AMET AIC AICET TM TEC AM AMET AIC AMET TM TEC
46 314 ms 2 10 ms 48 324 ms 41 224 ms 2 11 ms 43 235 ms
DATA S ET
retrieval criterion
DL-Tree
W INE
C. Hier.
DL-Tree
F INANCIAL
C. Hier.
meas.
results
AM AMET AIC AICET TM TEC AM AMET AIC AMET TM TEC
36 244 ms 2 24 ms 38 268 ms 22 172 ms 14 161 ms 36 333 ms
AM AMET AIC AICET TM TEC AM AMET AIC AMET TM TEC
37 2793 ms 28 4330 ms 65 6123 ms 16 1077 ms 139 20192 ms 155 21269 ms
performed by exploiting the DL-Tree index decreases the number of comparisons more than 50% (except for the U NI VERSITY data set where the the number of distinct resources is small) w.r.t. the linear approach. This consequently implies a reduction of the average execution time, as it is also evident from the table. However, the usage of DL-Tree requires a slightly higher number of comparisons w.r.t. the exploitation of the concept hierarchy. Nevertheless, since for finding the actual resources, the instance check of each individual in the extension of the found resource descriptions has to be performed w.r.t. the query concept, the average number of the instance checks and the corresponding mean execution time have been also considered (see Tab. II). Specifically, looking at Tab. II, it is possible to note that the DL-Tree based approach decreases the average number of instance checks of about 21% for U NIVERSITY data set, 20% for F INANCIAL data set and 14% for W INE data set. This is because, using the DL-Tree, more specific concepts w.r.t. those in the concept hierarchy can be found. The reduction of the average number of instance checks has a strong impact in the total execution time; indeed, looking at Tab. II, it is possible to see that, for U NIVERSITY, F INANCIAL and W INE data sets, the total execution time required by the DL-Tree based approach is lesser than that required by the concept hierarchy based approach. The considerations are different for the SWSD data set where, on average, the same number of instance checks are required while a fewer match comparisons are necessary for the concept hierarchy based approach. The main reason is that, in this case, queries correspond to concepts descriptions that are the leaf nodes of the concept hierarchy (this was not the case for the previous data sets), consequently we find the same number of instance checks for both approaches while the subsumption test are fewer for the case of the concept hierarchy which has a simpler structure w.r.t. DL-Tree. Even if this is a quite unusual situation, in cases like this, building
Various related works can be considered, depending on different point of views, namely: resource retrieval and service discovery, optimization techniques for TBox reasoning and ABox retrieval. Service discovery focuses on locating service descriptions that can satisfy a request. Generally, it is performed by matching a request against all available services, implying linear performance. Most of the works concerning service discovery focus on improving the effectiveness of the service matchmaking [24], [20], [21], [22], [1], only few works focus on improving the efficiency of the service discovery. In [33], a caching mechanism for outperforming the service discovery is proposed. This method is based on the exploitation of a graph obtained in two steps: 1) given a set of predefined template goals, a tree is built on the ground of the subsumption relationship; 2) available service descriptions satisfying the most general template goals are linked together. The main limitation of this approach is given by its dependence from the availability of template goals that are used to instantiate the actual request. Conversely, our approach does not require predefined template queries and it can be applied to heterogeneous services. In [32], an efficient discovery method, grounded on the adoption of an R-Tree index, is proposed. Differently from our work where resources are assumed to be described by DL languages, here services are firstly assumed to be described by OWL-S and then they are transformed into an interval representation which allows to perform range queries for searching services satisfying a certain request. If a new service is available, the R-Tree needs to be entirely recomputed, differently from our approach where only a branch of the DL-Tree needs to be updated. In [8], resource retrieval is improved by using a tree structure that is built exploiting the notion of interval constraints that are adopted for obtaining an upper bound and a lower bound descriptions of the available resources. Hence, they are linked by the use of logical inferences. This approach requires the creation of two descriptions for every available resource plus the creation of inner nodes for building the tree structure. In [29], efficient resource retrieval is performed by abstracting from a more expressive to a less expressive language, e.g. from OWL to DL-lite. This approach is semantically sound but, differently from our method, it looses completeness. Other efforts have been addressed for the optimization of reasoning and query answering. In [17], a set of optimization techniques for improving tableaux decision procedures for DLs are presented. They could be adopted for performing the matching task during the resource retrieval process. for performing service matchmaking. In [28], an algorithm for optimizing the query answering of SHIQ knowledge bases extended with DL-safe rules is proposed, by exploiting the reduction to disjunctive programs. M¨oller et. al. [27] propose optimization techniques for improving the scalability of the
instance retrieval task. This is orthogonal to our work as our method could be used also for performing instance retrieval by firstly clustering the MSCs of the considered KB and then querying for the concept of interest, by checking for nodes of the DL-Tree that are subsumed by the query concept. VII. C ONCLUSION AND F UTURE W ORK We have presented a sound and complete method for improving the efficiency of the resource retrieval task. Its validity has been experimentally shown. The method is based on the exploitation of a tree-index (the DL-Tree) that is built by applying a new conceptual clustering algorithm (DL-L INK) to available resource descriptions modeled by DL languages. For clustering resources, a semantic similarity measure has been exploited, while intensional cluster descriptions are generated by the use of the GCS of ALE(T ) concept descriptions referring to an ALC ontology, acting as KB of reference. As expected, we have experimentally showed that the method is especially effective when very specific resources are searched, as it is able to cut off the part of the search space where target resources are absent. The experiments also showed that, while linear search is clearly outperformed, the concept hierarchy deriving from the concept definitions may be a possible alternative, especially when resources are described as leaf nodes of an existing ontology or when the various concepts are evenly populated. Conversely unbalanced taxonomies (in terms of the concept extensions) may compromise the retrieval process also because of their structure. The DLTree index may be useful in case of flat or coarse-grained concept structures deriving from the TBox axioms, as they are built by a data-driven process. Further improvements of the presented method could be obtained along a number of directions: 1) merging more than two descriptions at each step during the construction of the DL-Tree to speed-up the clustering process; 2) building the DL-Tree using a top-down divisional clustering approach that ensures the disjointness of the nodes in the tree at the same level; 3) implementing an incremental clustering algorithm to cope with the availability of new resources; 4) testing the validity of the proposed method, modified for accepting different node-matching procedures; 5) assigning intensional cluster descriptions by exploiting supervised learning methods [14]. R EFERENCES [1] A. Averbakh, D. Krause, and D. Skoutas. Exploiting user feedback to improve semantic web service discovery. In Proc. of Int. Semantic Web Conference, pages 33–48, 2009. [2] F. Baader, R. K¨usters, and R. Molitor. Computing least common subsumers in description logics with existential restrictions. In T. Dean, editor, Proc. of IJCAI, pages 96–101. Morgan Kaufmann, 1999. [3] F. Baader, R. Sertkaya, and Y. Turhan. Computing least common subsumers w.r.t. a background terminology. In V. Haarslev and R. M¨oller, editors, Proc. of the Int. Workshop on DLs. CEUR-WS.org, 2004. [4] F. Baader et al., editor. The Description Logic Handbook. Cambridge University Press, 2003. [5] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American Magazine, 2001. [6] H. Bock and E. Diday. Analysis of symbolic data : exploratory methods for extracting statistical information from complex data. SpringerVerlag, 2000.
[7] A. Borgida, T. Walsh, and H. Hirsh. Towards measuring similarity in description logics. In I. H. et al., editor, Proc. of the Int. Workshop on DLs, volume 147. CEUR-WS.org, 2005. [8] I. Constantinescu, W. Binder, and B. Faltings. Flexible and efficient matchmaking and ranking in service directories. In Proc. of the IEEE Int. Conf. on Web Services., 2005. [9] C. d’Amato, N. Fanizzi, and F. Esposito. A semantic similarity measure for expressive description logics. In A. Pettorossi, editor, Proc. of CILC05, 2005. [10] C. d’Amato, N. Fanizzi, and F. Esposito. A dissimilarity measure for ALC concept descriptions. In Proc. of the 21st Annual ACM Symposium of Applied Computing, SAC2006, 2006. [11] C. d’Amato, S. Staab, and N. Fanizzi. On the influence of ontologies on conceptual similarity. In Proc. of the Int. Conf., EKAW2008, volume 5268 of LNAI, pages 48–63. Springer, 2008. [12] C. d’Amato, S. Staab, N. Fanizzi, and F. Esposito. Efficient discovery of services specified in description logics languages. In Proc. of the ISWC Workshop on Service Matchmaking and Resource Retrieval in the Semantic Web., 2007. [13] F. M. Donini, M. Lenzerini, D. Nardi, and A. Schaerf. Deduction in concept languages: From subsumption to instance checking. Journal of Logic and Computation, 4(4):423–452, 1994. [14] N. Fanizzi, C. d’Amato, and F. Esposito. DL-FOIL - Concept Learning in Description Logics. In Proc. of ILP2008, volume 5194 of LNAI, pages 107–121. Springer, 2008. [15] S. Grimm, B. Motik, and C. Preist. Variance in e-business service discovery. In Proc. ISWC Workshop on Semantic Web Services, 2004. [16] M. Hepp. A methodology for deriving owl ontologies from products and services categorization standards. In Proc. ECIS 2005, 2005. [17] I. Horrocks. Optimising Tableaux Decision Procedures for Description Logics. PhD thesis, University of Manchester, 1997. [18] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999. [19] J. Janowicz and M. Wilkes. sim − dla : A novel semantic similarity measure for description logics reducing inter-concept to inter-instance similarity. In Proc. of the Europ. Semantic Web Conf., pages 353–367. Springer, 2009. [20] U. Keller, R. Lara, H. Lausen, A. Polleres, and D. Fensel. Automatic location of services. In Proc. of ESWC, volume 3532 of LNCS, 2005. [21] M. Klusch, B. Fries, and K. Sycara. Automated semantic web service discovery with owls-mx. In Proc. of the int. conf. on Autonomous agents and multiagent systems, pages 915–922. ACM Press, 2006. [22] M. Klusch, P. Kapahnke, and I. Zinnikus. Hybrid adaptive web service selection with sawsdl-mx and wsdl-analyzer. In Proc. of Europ. Semantic Web Conference, pages 550–564, 2009. [23] J. Lee, M. Kim, and Y. Lee. Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 2(49):188–207, 1993. [24] L. Li and I. Horrocks. A software framework for matchmaking based on semantic web technology. In Proc. of the int. conf. on World Wide Web, pages 331–339. ACM Press, 2003. [25] T. Mantay. Commonality-based ABox retrieval. Technical Report FBI-HH-M-291/2000, Department of Computer Science, University of Hamburg, Germany, 2000. [26] D. Maynard, W. Peters, and Y. Li. Metrics for evaluation of ontologybased information extraction. In Proc. of EON 2006 Workshop, 2006. [27] R. M¨oller, V. Haarslev, and M. Wessel. On the scalability of description logic instance retrieval. In I. B. P. et al. (Eds.), editor, Proc. of DL2006, volume 189. CEUR, 2006. [28] B. Motik, U. Sattler, and R. Studer. Query answering for owl-dl with rules. Journal of Web Semantics, 3(1):41–60, 2005. [29] J. Pan, E. Thomas, and D. Sleeman. Ontosearch2: Searching and querying web ontologies. In Proc. of IADIS Int. Conf., 2006. [30] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric on semantic nets. IEEE Transactions on System, Man, and Cybernetics, 19(1):17–30, 1989. [31] P. Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95–130, 1999. [32] D. Skoutas, D. Sacharidis, V. Kantere, and T. Sellis. Efficient semantic web service discovery in centralized and p2p environment. In Proc. of ISWC 2008, pages 583–598, 2008. [33] M. Stollberg, M. Hepp, and J. Hoffmann. A caching mechanism for semantic web service discovery. In Proc. of ISWC 2007, 2007.