based on high-level metadata, mapping domain resources to an interchange for ... motivations triggering changes on a domain ontology, and we specify the type.
Detecting Ontology Change from Application Data Flows Paolo Ceravolo1 and Ernesto Damiani1 Universit` a di Milano - Dipartimento di Tecnologie dell’Informazione Via Bramante, 65 - 26013 Crema - Italy {ceravolo, damiani}@dti.unimi.it
Abstract. In this paper we describe a clustering process selecting a set of typical instances from a document flow. These representatives are viewed as semi-structured descriptions of domain categories expressed in a standard semantic web format, such as OWL [15]. The resulting bottom-up ontology may be used to check and/or update existing domain ontologies used by the e-business infrastructure.
Keywords: Knowledge Representation, Ontology Construction
1
Introduction
Many successful electronic commerce applications use Web-based mediators to augment or replace human middlemen. Mediators organize data interchange based on high-level metadata, mapping domain resources to an interchange format; these metadata enable a variety of sharing, profiling, and querying services on domain data. Inter-organization metadata rely on a shared domain vocabulary whose terms are identified by an authority imposing a normative standard, often called ontology. The adoption of a standard ensures that the description of every resource will be written using terms provided by the authority via the shared vocabulary. However, this top-down process of metadata creation is often difficult to automate, and can be ineffective if the standard vocabulary does not cover all the semantic areas of interest for applications. An alternative way that can be followed for producing a common vocabulary is bottom-up extraction of knowledge from business process data flow. A natural way of managing business process data flow is via web-services [7]. At this level, business processes can be described in terms of web-services providing functionalities and data. In business platforms based on web services, the data to be exchanged are described via XML Schema definitions. Individual SOAP messages contain data in XML format [13]; also, according to the ebXML standard for e-business, entire transactions are composed of business messages, whose payloads contain documents in XML [14] conforming to application-oriented XML Schemata. One could therefore think of using these application-level XML schemata to extract high-level metadata; unfortunately, experience has shown that this is
rarely possible. XML Schema definitions used for business message payloads need to cover a wide repertoire of possible interchanges; therefore, they largely use structural mark-up, rather than describing data semantics. For this reason, we propose a technique for extracting metadata updates based on structural patterns detected in semi-structured XML data flow. Our clustering technique has been described in other works, such as [4] and [2]. In these papers, we focused on the clustering process selecting a set of typical instances from a document flow. In the present work, these representatives are viewed as semi-structured descriptions of a domain category, and organized as an ontology expressed in a standard semantic web format, such as OWL [15]. The resulting ontology is used to check and/or update existing domain ontology used by data mediators. The paper is organized as follows: in Section 2 we address the problem of detecting instances from a data flow; in Section 2.1 we provide a brief discussion on motivations triggering changes on a domain ontology, and we specify the type of changes that are detected in our application; in Section 3 we describe the ontology construction process.
1.1
Related work
Ontology building methodologies restricting their attention to semi-structured data integration are usually dealing with XML schemata. Schema are exploited as semantic definition of portion of information and integrated in a broaden representation. To be precise, schemata integration methodologies are not always semantic-aware. For example the MIX project as well as the Grammar Based Model, formalize integration rules on canonical tree-based models used to represent local DTD schemata and integrated schemata. Anyway in this work we are not interested in mere structural approaches. The general strategy applied in semantic-aware approaches is to define an intermediate conceptual schema for mapping data to be integrated. The MOMIS project, propose an approach that merges, in a bottom-up way, structured and semistructured data sources. To achieve this, several rules and heuristics are applied in order to capture as much as possible the semantics of the elements of the DTDs. In the Clio project ([9]) data source schemata are transformed first into an internal representation, then, after the mappings between the source and the target schemas have been semi-automatically derived, the system materializes the target schema with the data of the source, using a set of internal rules, based on the mappings. DIXSE ([10]) follows a similar approach, transforming the DTD specifications of the source documents into an inner conceptual representation, with some heuristics to capture semantics. Most work, though, is done semi-automatically by the domain experts that augment the conceptual schema with semantics. The approach in [11] has an abstract global DTD, expressed as a tree, very similar to a global ontology. The connection between this DTD and the DTDs of the data sources is through path mappings: each path between two nodes in a source DTD is mapped to a path in the abstract DTD. Then, query rewriting is employed to query the sources.
2
Detecting new types of instances
In the remainder of the paper, we start from the assumption that the individual data items are XML fragments, complying with an application-level XML Schema. Also, we assume that this XML Schema is not intended to provide a partition of the domain instances into categories, and does not provide any hint on what ”typical” data instances will look like. By mining the data flow, we shall detect recurrent types of domain instances, based on how individual XML elements are detailed in the number and type of sub-nodes as well as in the nodes content. Typical instances are extracted by clustering data of the transaction flow, and then we will use them to add new classes to our domain representation. 2.1
Ontology changes
Evolving and updating a shared conceptualization such as an ontology is by no means an easy task, usually requires the knowledge of a qualified domain expert. Motivations triggering changes to ontology can be classified as follows: – Conceptualization tuning: the new concept roughly represents the same concept as the old one, but does not have exactly the same instances. This can depend on concept scope, extension, or granularity. – Expression tuning: depending on the relations among concepts, the same conceptualization can be expressed in different ways. For example, a distinction between two classes can be modeled using a qualifying attribute or by introducing a separate class. – Terminology tuning: the same concept is described by means of synonyms or in terms of different encoding values (for instance distances can be expressed in miles or kilometers). An extended discussion on ontology mismatches can be found in [12] and [3]. As we shall explain in detail in Section 3.2, in our approach new instance types will always comply with the application level XML Schema. For this reason, we shall focus on conceptualization tuning, i.e. on extension or granularity changes.
3
The ontology construction process
In order to describe the individual steps of our ontology construction process we shall ground our discussion in an example. Fig. 1 shows an XML Schema of a generic purchase order services. This schema defines the message level of a Web-service based business transaction. In the first step an initial set of basic classes are defined. This way a coarsegrained representation of the domain is provided. But this representation is not satisfiable and the classes are intended as candidate classes, to be detailed during the second step and eventually removed from the representation. This initial representation, although not very informative, can be created manually or easily
extracted from the XML Schema used for messages payloads. We shall briefly describe this procedure in Section 3.1. In the second step, new classes are induced by analyzing the data flow. Our clustering process dynamically partitions the flow in clusters of data items such as those in Fig. 3, each featuring a typical representative of the cluster. Then, we use such representatives to define new classes that can differ from the initial representation because of the cardinality of properties or because of the properties’ values. For example Message1 is a message without any order destination address, with an order composed of three products, and without an associated discount. These new classes can be expressed in terms of basic classes, specifying a restriction on the class properties. In particular the restrictions we are interested to express are related to properties cardinality, and properties range type or values. For these reasons, the minimum language expressivity we need is a ALQ (a first order language restricted to formulas with two variables and allowing qualified restriction on roles), and OWL, that is a SHIQ(D) language cover these expressivity requirements [6]. In a naive approach, we could translate new induced classes directly in OWL, creating a new Message class constrained in the properties cardinality. But such a translation policy would not be effective, as it would produce one class per representative. For instance, Message2 is a class very similar to the previous one, the only difference being the Discount property. For this reason, we adopt a lazy approach. Rather than transforming extracted information into change operations on our initial domain representation, we store them using an intermediate representation. At this intermediate representation level, candidate classes are maintained in a XML format: each candidate class is described via complex XML element composed of sub-nodes, as shown in Fig 2. This intermediate representation greatly simplifies the manipulation of the class hierarchy. More importantly, thanks to the intermediate layer the definition of change operations updating the ontology can be delayed to after the hierarchy gets stable. 3.1
Setting the initial ontology
Setting up an initial outline of a domain representation from scratch is a task at which human experts excel; also, it is known to be the fastest and least expensive phase of traditional domain modeling, and the most difficult to automate effectively [8]. As an alternative, if domain metadata such as database or XML schemata are available, quick-and-dirty automatic construction of a preliminary domain model in basic classes can be done by reverse engineering[1]. Specifically, when a reasonably well-behaved application-level XML Schema is available, it is just a matter of enumerating its elements belonging to complex types1 . Then, XML elements can be easily translated into OWL by transforming each XML 1
For the sake of conciseness, we shall not explain in detail here the naming conventions to be observed in XML elements and ComplexType s declarations for a XML Schema to be considered well-behaved w.r.t. knowledge representation. Unfortunately, these conventions are not always followed in real world applications.
Fig. 1. an XML Schema of a generic purchase order.
element into a class having as name the element name and as properties the sub-elements. If a sub-element contains its own sub-elements, the corresponding property is declared a owl:ObjectProperty and takes as range a class having the sub-element name. If a sub-element is a leaf (i.e., it does not contain sub-elements), the property is declared as a owl:DatatypeProperty. Classes produced by means of this process are inserted into the intermediate representation and are to be intended as candidate classes to be confirmed only after the optimization of the hierarchy maintained in the intermediate representation. Fig. 2 shows a sample initial ontology created from a partition of the XML Schema. Its classes provide an initial representation of the domain, to be detailed and updated by means of knowledge extracted from the data flow. 3.2
Detecting detailed class descriptions
We are now ready to use extracted knowledge to improve the quality of the initial model. Fig. 3 shows some examples of clusters extracted from the e-business data flow. These clusters are specific instantiations of the schema of the service messages. Note that clusters can be composed only of valid schema elements: a complex XML element may occur or not in a clusters, but if it does it necessarily complies with the schema. Obviously, XML elements belonging to SimpleTypes can contain different values. In order to extract knowledge from our clusters, we start by partitioning them in basic classes. Then we compare basic classes with existing candidate classes, and decide whether the cluster is providing new candidate classes; if so, they are added to the set of existing candidates. For example, by partitioning Message1 we obtain two new basic classes. We call the first class Message-I: it differs from
Fig. 2. Initial class induction.
the original Message because of the destination element. We call the second Order-I: it is an Order having three product elements and no discount. The Sender class does not differ from the initial Sender class extracted from the schema, and therefore provides no new candidate. Fig. 4 shows in a graphical way the new basic classes extracted from the flow in Fig. 3. Using new basic classes Message1 can be described as a Message-I class having as sub-nodes Order-I and Sender. Note that evaluating complex elements we only take into account direct sub-elements. This prevents overproduction of classes definitions.
Fig. 3. Examples of clusters from the business transaction.
This way, new classes are expressed in terms of restrictions on the initial domain representation. Assertion 1 shows how we can express this information according to a Description Logic formalism. Message-I v Message Message-I v ∀order.Order-I
(1)
The new class is described by means of a set of complex class definitions. Currently we use two types of definitions: – specialization, specifying that the extracted class is a subtype of a class already present in the domain – restriction, specifying that the extracted class is obtained from a class already present in the domain by restricting its properties cardinality or range. Note that we interpret specialization as necessary (although not sufficient) relation between the newly defined class and the class used in the definition. In fact we declared Message-I as a sub class of Message (instead of defining it as an equivalent class) because we have a more detailed vision on the instances description. Also, we used a universal operator for expressing the restriction on the property order, because the cluster is intended to be a representative of a set of transactions, and is description must to be valid for each instance.
Fig. 4. The new basic classes extracted from the flow.
The above example shows how the new class increases the knowledge embedded in the initial domain representation, and tunes it to the specific application setting. 3.3
Setting up the ontology hierarchy
Periodically, our system uses available candidate classes to set up or update a hierarchy. This can be done simply connecting each class to all the other classes
it subsumes. These operations are made on the intermediate representation that simplify our manipulation task. Classes are connected be means of is-a relations, and the basic task we have to execute is to decide when a class have to be connected with another. A class A is said to be is-a a class B when it is more general. For our purposes, we cannot use a standard DL reasoning engine. DL reasoners are based on algorithms for subsumption resolution, where the generality of an assertion is done in comparison to a set of axioms. In our application an assertion is said to be more general of another on the basis of its definition. We define is-a as follows: Definition 1. A class A is said be is-a a class B if A contains less elements than B and all the elements contained in A are also contained in B. It is easy to see that our definitions generates a number of redundant closure arcs: if a class A is-a B, it also is-a all other classes in is-a relation with B. But this information is maintained only in the intermediate representation, when the hierarchy is translated in OWL these redundant information are not exported. Fig. 5 shows a hierarchy obtained the definition 1, for the sake of clarity redundant relations are removed.
Fig. 5. The first hierarchy applied to the definitions of clusters.
The last phase of the process is the optimization of the hierarchy. If multiple classes share the same set of sub-classes, the hierarchy can be simplified. As stated in Section 3.2 (and shown in assertion 1) a cluster is defined as a complex class, i.e. a class defined by means of other class. or property restrictions. We deal with single complex assertions as class properties and use definition 1 for managing subsumption. The optimization process starts from more specific class definitions, and is managed in 3 step:
– find, if exists, an equal class definition; – compare with other definitions and create an is-a relation if the definition is more general; – insert the class definition in a list for avoiding further controls. Note that once the process is performed on a class definition more general definitions have not to take into account this class during the second step, i.e. the is-a relation setting. This way, the computational complexity of the process is strictly related to the number of class definition to be evaluated. Fig. 6 shows the application of optimization steps to the hierarchy in Fig. 5. The overall effect is to reduce the number of assertion lines, because classes sharing the same assertions are factored out, i.e. defined as super-classes of a sub-class composed by the shared assertions.
Fig. 6. An example of optimized hierarchy
Note that simplification of the hierarchy does not involve changes in the conceptualization, but only in the model expression. The expressive power of a simpler hierarchy is exactly the same of a more structured one. Differences concern only metadata production and querying: a more structured ontology allows for producing a complex metadata assertion in less steps then a simpler ontology, and can require for less complex query for accessing data. For these reasons, in our approach the user can choose whether to execute or not a hierarchy optimization, according to the application requirements.
Conclusions In this paper, we have presented some techniques enabling extraction of knowledge from XML data. Our approach is aimed at extracting hierarchies of typical XML messages (i.e., typical data items) from a flow of business transactions. First, we cluster data items building an intermediate knowledge representation; then, our intermediate representation is lazily compared to an initial normative schema, obtaining a more detailed specification of the domain, expressed as OWL complex class definition. We intend to develop this approach toward a compete bottom-up approach for building ontology schema on the basis of ebusiness data interchange, capable of checking and/or updating existing domain ontologies used by the e-business infrastructure.
Acknowledgments This work was partly funded by the Italian Ministry of Research Fund for Basic Research (FIRB) under projects RBAU01CLNB_001 “Knowledge Management for the Web Infrastructure” (KIWI). and RBNE01JRK8_003 “Metodologie Agili per la Produzione del Software” (MAPS).
References 1. Andersson M.: Extracting an Entity Relationship Schema from a Relational Database through Reverse Engineering. LNCS, vol 881, proceedings of the 13th International Conference on the Entity-Relationship Approach, (1994). 2. Ceravolo P., Nocerino M. C., Viviani M.: Knowledge extraction from semi-structured data based on fuzzy techniques. Eighth International Conference on KnowledgeBased Intelligent Information & Engineering Systems (KES 2004), Wellington, New Zealand, (2004) 328–334. 3. Chalupsk H.: OntoMorph: A translation system for symbolic logic. In Cohn, A. G., Giunchiglia F., Selman B., editors, KR2000: Principles of Knowledge Representation and Reasoning, San Francisco, CA. Morgan Kaufmann. (2000) 471–482. 4. Damiani E., Nocerino M. C., Viviani M.: Knowledge Extraction from an XML Data Flow: Building a Taxonomy based on Clustering Technique. EUROFUSE Workshop on Data and Knowledge Engineering (EUROFUSE 2004), Warszawa, Poland, (2004) 22-25. 5. Heflin J., and Hendler J.: Dynamic ontologies on the web. In Proceedings of the Seventeenth National Conference on Artificial Intelligence. AAAI/MIT Press, Menlo Park, (2000) 443–449. 6. Horrocks I., Sattler U., and Tobies S.: Practical reason- ing for very expressive description logics. J. of the Interest Group in Pure and Applied Logic, 8(3), (2000) 239–264. 7. Koushik S., Joodi P.: E-Business Architecture Design Issues, IT Professional, vol 2, num 3 IEEE Educational Activities Department, Piscataway, NJ, USA, (2000) 38–43. 8. Maedche A., Staab S.: Ontology Learning for the Semantic Web , IEEE Intelligent Systems, (2001). 9. Popa L., Velegrakis Y., and Miller R.J.: Translating Web Data, In the proceedings of VLDB02, (2002) 598–609. 10. Rodriguez-Gianolli P., and Mylopoulos J.: A Semantic Approach to XML-based Data Integration. In ER, volume 2224 of Lecture Notes in Computer Science, (2001) 117–132. 11. Reynaud C., Sirot J.P., and Vodislav D.: Semantic Integration of XML Heterogeneous Data Sources. In IDEAS, IEEE Computer Society, (2001) 199–208. 12. Visser P. R. S., Jones D.M., Bench-Capon T. J. M., Shave, M. J. R.: An analysis of ontological mismatches: Heterogeneity versus interoperability. In AAAI 1997 Spring Symposium on Ontological Engineering, Stanford, USA, (1997). 13. SOAP Version 1.2 W3C Recommendation 24 June 2003, http://www.w3.org/TR/soap/ 14. ebXML SPECS http://www.ebxml.org/specs/ 15. OWL Web Ontology Language Overview, W3C Recommendation 10 February 2004 http://www.w3.org/TR/2004/REC-owl-features-20040210/