XML looks very much like HTML but has nothing to do with it. HTML only marks ... is not a problem to take an XML tree and convert it to another completely different one. 4 Using .... programs use their own proprietary formats (.doc, .xsl, .jpg, â¦).
An XML Framework Proposal for Knowledge Discovery in Databases Petr Kotásek, Jaroslav Zendulka {kotasekp, zendulka}@dcse.fee.vutbr.cz Brno University of Technology, Department of Computer Science and Engineering, Bo et chova 2, 612 66 Brno, Czech Republic
Abstract. In recent years, the XML language has been receiving much interest among IT community. It has many nice properties that make it a great candidate for representation of different kinds of data. In this paper we will propose an XML framework for the domain of knowledge discovery in databases. This is not a specification document; this article tries to be a compendium of ideas and remarks concerning the broad area of knowledge discovery in databases (KDD). It tries to identify some common high-level problems of this area from a higher perspective and then to outline a possible solution, showing an example with the definition of data interfaces for respective KDD steps using XML.
1 Introduction There are huge amounts of data stored in various repositories (databases) and it is behind human capabilities to reasonably process it. It is no longer possible for us to look at the database, see any useful patterns in the data and consequently derive some potentially useful knowledge from our observation. The knowledge discovery in databases addresses the problem mentioned above by developing automatic and intelligent techniques for automated discovery of useful and interesting patterns (called knowledge) in the database. The main effort in the knowledge discovery community has so far been devoted to development of efficient mining algorithms, but there are not many contributions to the complex solution of the problem. There is no set of widely accepted techniques and methodologies to support the entire process. Many knowledge discovery systems exist and each of them is using its own methodology. It is quite understandable, as most of the systems were designed for quite a narrow application area (e.g. healthcare, business or image data analysis). The knowledge discovery in databases is a complex, interdisciplinary, data-centered and human-centered task. So, on one hand, it is naturally desirable to have an unifying platform
(preferably built on formal basics) for the process. On the other hand, this inherent complexity makes a development of such a framework very difficult, if not even impossible. However, the need for systematic description of the knowledge discovery process has been recognized in the KDD community. In this paper, we try to summarize some of the problems that could be addressed by the availability of such an unifying view, and outline a proposal of an implementation-level solution exploiting XML. The remainder of this paper is organized as follows: Section 2 summarizes some high-level problems present in the KDD domain today and suggest a solution through the use of ontologies. Section 3 describes the XML language. Readers familiar with XML can skip this section. Section 4 presents ideas how XML could be used in the KDD process. We conclude and outline the future work in Section 5.
2 Some common high-level problems in KDD We believe that it is worth trying to propose a framework for a systematic approach to KDD process. As the knowledge discovery is a wide, open and evolving topic, the solution must reflect its needs; it has to be open and extensible, too. Below are outlined some problems that might be addressed by the unifying framework: 1. It may seem surprising, but we still do not have precise definition of basic terms and concepts appearing in the area. A verbal definition of basic terms (like knowledge, pattern, interestingness etc.) can be found in [1]. However, it would be better to have a definition of these concepts and their relations based on formal approaches. 2. The KDD process requires the user to deal intensively with huge amounts of data. This happens especially during the preprocessing step, when target data have to be identified, integrated and cleaned. A conceptual view of the raw data is needed to navigate through the data. Also different techniques are used to process the data; typically, mathematical statistics techniques are used extensively. 3. Next, the data mining task has to be identified and the proper data mining method chosen. These two steps (2 and 3) are highly iterative: the formal description of various data mining tasks and methods, together with the formal conceptual view of the data, would help in the task identification and mutual matching between the task and the data. 4. The results have to be presented in human-readable form. It is usually accomplished by a combination of graphical and textual primitives. Formal definition of different kinds of
results would provide for their easier management and transformations (e.g., classification tree to rules). 5. When some of the results are identified as a knowledge, it should be possible to manipulate it the way we are used to manipulate knowledge: share it, consolidate it, report it to interested parties etc. 6. It is natural that domain knowledge is used by an expert to guide the process, especially during the initial data preprocessing and during the final knowledge identification. These days, the domain knowledge usually resides in an expert's brain. It would be convenient to be able to use the domain knowledge stored in a knowledge base, then integrate it with the newly discovered knowledge coming as the outcome of the KDD process, and possibly refine it. We believe that the above problems might be addressed by creating the system of the formal ontology for the knowledge discovery domain. For detailed discussion on ontologies see [2] or [3]. For our purposes, we can say that ontology is a system that defines categories of things in the domain of interest and their mutual relationships, possibly in axiomatic way. For example, in [2] the KIF (Knowledge Interchange Format) language was proposed to interchange knowledge among disparate programs and to create ontologies (see their Ontolingua Server [4]). Ontologies play an important role when we want to describe a domain and to process it, communicate or share knowledge about it. For example, it is obvious that ontologies describing the data being analyzed are essential during the preprocessing phase. The conceptual description of the data (mentioned under problem 2 above) is a must - it enables for navigation through and management of the data during the KDD process, to say the least. Similarly, if we had an ontology describing the characteristics of the KDD process itself, we could try to address the above problems. We can either start building this ontology and all the supporting mechanisms from the scratch, or we can use some existing technology. One of such existing technologies is that being developed by the Knowledge Sharing Effort (KSE) [2]. They are creating an ontology library that includes ontological descriptions of various domains. Moreover, the fact that the KDD ontology, the ontology of the domain being explored plus all other supporting ontologies are built on the same platform will result in several primary benefits: firstly, it should be easier to integrate the conceptual description of data and the domain knowledge of the domain under exploration into the KDD process. Secondly, it should enable for the domain knowledge to be used in the KDD process and for
discovered knowledge to be incorporated back into the domain knowledge. Thirdly, if any supporting ontologies are present, they can be used in the process too (typically, for graphical representation or statistical evaluation). This whole idea is briefly depicted in Fig. 1.
Fig. 1. The idea of the ontological library with KDD ontologies So now we can suppose that we have a huge library containing ontologies for the domain of interest, and ontologies of areas that support the knowledge discovery process in various ways. We are still left with the task of defining the ontologies that would describe the domain of knowledge discovery itself (with the idea of easy integration and use of different other ontologies within the knowledge discovery process in mind). Let us show how these ontologies, together with those existing in the ontology library, might help with the problems listed at the beginning of this section. Firstly, to solve the problem mentioned under number 1 above, we have to create an ontology covering the intrinsic concepts like knowledge, interestingness etc. It will be used by following ontologies, especially by those describing different knowledge types (classification, association rules etc.). As for the problem 2, the conceptual description of data is a part of the ontology for the domain under exploration, and mathematical statistics might be covered by one of the supporting ontologies. To address the problem 3, we will need ontologies describing characteristics of different data mining tasks, data mining methods and the architecture of the desired results (which is tightly coupled with the data mining task).
Regarding problems 4 and 5, if we represent the newly discovered knowledge in compliance with a previously defined ontology, belonging to a family based on awidely accepted technological platform (like that proposed in [2]), it will be possible to integrate it with domain knowledge (provided it is represented by an ontology built using the same platform). The whole architecture should be open to changes and extensions. Formal ontologies will be defined using some formal language like KIF. However, these languages are not meant to be used on the implementation level. Rather, we should use a more suitable format for physical data, for example XML. Figure 2 shows a possible architecture for ontologies defining different knowledge types (and thus addressing problems 4 and 5). The overall structure is hierarchical, with basic terms and basic knowledge types ontologies on top - these are general purpose ontologies. It is obvious that at least the two bottom ontologies will play a physical role in the KDD process; they represent the discovered knowledge - the final product. Therefore, they will have to be implemented using XML. In the remainder of this paper, we will propose the possible use of XML in the context of the whole KDD process.
Fig. 2. An example of ontologies for description of knowledge types
3 The XML language 3.1 Brief History of XML The Extensible Markup Language (XML) is a simplified subset of the Standard Generalized Markup Language (SGML). The main goal of SGML is to provide mechanism for platformindependent representation of structured data. Unfortunately, SGML is very complex and
therefore the cost of its implementation is high. The Hypertext Markup Language (HTML), on the other hand, has very poor representation capability. Actually, it is only oriented towards presentation and as such it is becoming insufficient for growing demands of the World Wide Web. Therefore an initiative was started by The World Wide Web Consortium (W3C) [5] to build XML. It is much simpler than SGML (and therefore easy to implement) but still very powerful to represent structured data. It has been recognized by many researchers as a promising solution to their problems.
3.2 Brief Description of the XML Concept XML is a method for putting data in a text file. Actually, it has evolved into a whole family of technologies. It is not meant to be read by humans but it is very easy to read. So if you are an expert or a programmer debugging applications, you can use simple text editors to look at XML files or even repair them. XML looks very much like HTML but has nothing to do with it. HTML only marks up the data for a browser to visualize it, whereas XML purely represents the logical structure of the document regardless of its possible future presentation form. Tags in HTML are predefined and users can do nothing about it – in other words, HTML is an application of SGML. In XML, users can define their own tags and attributes (the grammar of an XML document) – in other words, XML is a subset of SGML.
3.3 Advantages of XML XML users will find many advantages depending on their field of interest, but there are some more general advantages that become obvious in every application domain. •
Platform Independence XML is a text format that can be displayed or processed on any device. The device will need to know the Document Type Definition (DTD) of the given XML document. DTD represents the definition of the grammar – the XML document has to comply with its grammar. If the DTD is publicly available (e.g., through WWW), device can withdraw it, parse the document and transform it in any possible way.
•
Robustness XML documents have to be well-formed, meaning each starting tag must have a corresponding closing tag, there must be only one root element, and so on. As the format is textual, it is more resistant to transport errors.
•
Extensibility This is another important feature, together with platform independence. Via DTDs, the XML technology serves as a metalanguage for definitions of other languages. Moreover, one of XML technologies called Extensible Stylesheet Language (XSL) provides means for transforming XML documents. A part of XSL called XSL Transformations (XSLT) is an XML vocabulary for doing such transformations. Then it is not a problem to take an XML tree and convert it to another completely different one.
4 Using XML in the Knowledge Discovery Process Let us describe the potential of XML in the domain of knowledge discovery and data mining. The process of knowledge discovery in databases has several steps that are shown in Figure 3.
Fig. 3. The Knowledge Discovery Process Its goal is to retrieve some potentially useful and usable information (knowledge) from large amounts of data, and to use this knowledge for decision support, marketing, etc. It would be very nice if we could run the data mining algorithm against the raw data from the database. Unfortunately, many steps have to be taken prior to actual mining can take place: 1. Relevant data have to be selected; this is a joint work for the data mining expert and the domain expert. At this point we assume that we have already decided which type of knowledge we want to discover and what particular data mining algorithm we will use. Different knowledge types need different algorithms and different algorithms require different data. If we ran data mining algorithm against data which is irrelevant, we could
receive useless results (which is the better case) or even results that are confusing and therefore potentially harmful, if applied in real world (this is the worse case). 2. Once the relevant data are known, they have to be preprocessed and transformed into the shape that will be understood by the data mining algorithm. Typical preprocessing activity is the elimination of the erroneous data (data cleaning). In the vast majority of experiments, transformation involves extracting data from the database source and saving it in plain text files. This is the task for the data mining expert and without a doubt it is the most time-consuming part of the whole process, as the format of these files is proprietary to given mining algorithms. 3. Now we can run the data mining algorithm against the data. It produces results (in some proprietary format again) that have to be visualized somehow and interpreted by domain experts. Actually, the three steps mentioned above under 1. and 2. (selection, preprocessing and transformation) overlap each other in real process. Selection can be understood as identification of relevant data without actually touching it. Preprocessing and transformation are activities that deal with physical data. Most researchers in this field concentrate on developing new data mining techniques, which is only one step along the long path. To our knowledge, little attention has been paid to formats of data being processed.
4.1 An XML Framework Proposal for Data Interfaces During the discovery process, data travels through many stages that all have well-defined functionality. It would be convenient to define input data formats (we will refer to these as data interfaces) for these stages using some platform-independent, robust, extensible and human-readable technology. What a task for XML! If we decide to use XML, we can define XML data interfaces for respective knowledge discovery steps. Then the overall architecture might look like this (please refer to Figure 3): Data Selection. The target data have to be extracted from the database system and stored in XML format. We will call this data interface XML-TargetData. There are basically two ways to do this. In first scenario, application has to be written that accesses the database, retrieves data through the standard interface provided by the DBMS (typically, SQL) and converts the data into XML. This attitude requires additional coding. Fortunately, major database
system vendors are beginning to realize the importance of XML and start to incorporate XML interface into their database engines. So extraction of data in an XML format should be a straightforward process in near future. Data Preprocessing. The data preprocessing phase receives data in XML-TargetData format. The data is checked, cleaned, whatever possible processing is performed, and the output is created in XML-PreprocessedData format. This interface contains data for mining that is semantically ready to be used by the data mining algorithm, but its syntactic structure might be different from the syntax understood by the algorithm. Transformation. In the original architecture from Figure 3 the transformation phase was a single, highly specialized procedure. Imagine this scenario: Preprocessed data are in text format. It is quite understandable because
preprocessing deals with data checking and
cleaning so it is desirable that the data mining expert can read the format easily. Unfortunately, the data mining algorithm is ready to receive data in his own proprietary binary format. So someone has to sit down and code a program to convert the human-readable text format into algorithm-friendly binary format. With XML, the transformation step is just a simple conversion from XMLPreprocessedData to XML-TransformedData – the input interface for the data mining algorithm. XML technology provides instruments for simple transformation of XML documents among each other (please recall the XSLT from Section 1.3). Actually, with XML, there is no unique transformation step in the discovery process any more. Rather, many local transformations can be performed easily and effectively on each data interface using simple and straightforward XSLT transformations. Data Mining. The data mining algorithm takes the data in the XML-TransformedData format and mines the data for knowledge. The output of this step is the XML-Patterns data format.
Interpretation/Evaluation. This step demonstrates the strength of XML in its full range. We receive data from the XML-Patterns interface and want to visualize it in human-friendly fashion. Using XSLT (or any other transformation tool or program), data can be transformed into any data format and displayed by different programs. Nowadays, all the end-user programs use their own proprietary formats (.doc, .xsl, .jpg, …). If we manage to create widely accepted and respected XML vocabulary for a specific domain, we will only be left with the task of defining transformations from/to these proprietary formats. Some steps are
being taken by W3C in this field (e.g., Precision Graphics Markup Language – PGML, Vector Markup Language – VML, Document Definition Markup Language – DDML, Mathematical Markup Language – MathML). Consequently, it is desirable to create a markup language for data representation on all the data interfaces of the knowledge discovery process. Another good thing to mention here is knowledge transformation. It is often necessary to transform one knowledge representation to another, e.g. classification tree to association rules. Again, there is no problem with XML and its transformation capabilities. In previous paragraphs, we have assumed that data would be extracted from the database and stored in XML formats between successive steps. We have accepted this assumption to emphasize the individualistic character of respective KDD steps. In real world applications, this attitude would result in an unacceptable waste of space. Why should we store the same (or almost same) data twice? Rather, the data will be transformed into XML on the fly while being read from a database. Actually, only the final product of the whole process, the XMLPatterns data, is expected to be stored permanently for future use. Figure 4 shows the KDD process again, but now with corresponding XML data interfaces.
Fig. 4. The Knowledge Discovery Process with XML Data Interfaces Here is the summary of benefits resulting from XML usage in the knowledge discovery process: 1. Each step in the process has its input and output XML data interfaces. If the input and output interfaces of two consequent steps do not match, data can be transformed easily using XSLT transformations. 2. Consequently, it is possible to combine different discovery components to perform the whole process. This feature becomes most appreciated in the data mining step for testing purposes. Now it is easy to compare different mining algorithms against each other. All that has to be done is transformation from the XML-TransformedData interface into the
proprietary format (which can but does not necessarily have to be XML) of the data mining algorithm.
4.2 An Example: XML-Patterns Interface There are different types of knowledge patterns that can be discovered in data: data generalization, summarization, characterization, association rules, classification trees, clustering analysis, regression analysis, time series, web mining, path traversal patterns (mining for user access patterns in interactive systems), etc. We will give a demonstration for association rules. An association rule is an expression of the form X ⇒ Y, where X and Y are sets of items. Intuitive meaning of such a rule is that sets of items that contain X as their subset tend to contain also Y. Association rules are being mined for in relational databases. The typical application is in analysis of sales data; databases contain huge amount of transactions typically consisting of the transaction date and items bought in the transaction. One of the well known examples is this association rule: nappies ⇒ beer. The idea behind this strange rule is that fathers who were sent out by their wives to buy nappies decided to reward themselves for their heroic performance by buying beer. When a good market specialist sees this rule, he or she immediately moves beer together with crackers closer to nappies to satisfy the thirsty husbands’ temptation. A fraction of an XML document representing this situation, and the DTD to which this document conforms, might look like the one in Table 1. It is a very idealized and non-realistic example. It only says that the association rule exists, nothing more. In real world applications, there are other data associated with rules; typically, some metrics that measure the value of the rule. Moreover, association rules can have different forms: quantitative, generalized, fuzzy etc. The XML document would have to be able to accommodate all these variants.
Table 1. An XML and DTD Example for the XML-Patterns Interface XML Document nappies beer DTD
Knowledge AssociationRule AssociationRule Antecedent AntecedentItem Consequent ConsequentItem Name
(AssociationRule+)> (Antecedent,Consequent)> ruleid CDATA #REQUIRED> (AntecedentItem+)> (Name)> (ConsequentItem+)> (Name)> (#PCDATA)>
Knowledge is a root element. It can include one or more AssociationRule elements. Each AssociationRule element has a unique identifier ruleid. AssociationRule consists of one Antecedent and one Consequent. Each Antecedent and Consequent must have one or more AntecedentItem and ConsequentItem elements. Each AntecedentItem and ConsequentItem has one name, which is a string.
4.3 An XML Framework Proposal for Communication Interfaces So far we have used XML to define data interfaces, i.e. the formats of input/output data for given discovery steps. The usage of XML in this context is implicit as XML is designed for data representation. XML can also be very well used to define communication interfaces. We can assume that respective KDD steps are performed by agents. By agent, we mean a software application which is able to communicate with other applications (agents).
A computing environment is becoming more and more parallel and distributed and this holds true for knowledge discovery as well. Therefore, it makes sense to think of various KDD steps as tasks being performed by specialized agents that expose their functionality to the rest of the world through an XML-defined communication interface. Typically, the data mining algorithm will reside on a computer and will define its communication interface through the XML document shown in Figure 5. AprioriItemset This is a data mining algorithm used for mining association rules among sets of items. DTD for one of input data interfaces accepted by AprioriItemset DTD for one of output data interfaces produced by AprioriItemset Fig. 5. An Example of Communication Interface of the Data Mining Agent The interface in Figure 5 says that there is a data mining agent called AprioriItemset located at BaseUrl. It can accept documents conforming to the DTD stored in the file named AprioriItemsetInput1.dtd. The output of the agent conforms to the DTD stored in the file named AprioriItemsetOutput1.dtd. The AprioriItemset agent accepts several input parameters: URL of DTD to which the input XML data conforms, URL of the XML input data, and URL of DTD for output data. This example assumes that the communication between agents is built on top of the http transport protocol. Again, this is a very simplified view of the problem to show how XML could be utilized.
5 Conclusion and Future Work We tried to show the potential of XML and related technologies in the domain of knowledge discovery in databases, in the context of a wider formal attitude to the KDD process. There is no 'one and only' solution for this problem. In our approach, software applications (agents) are used to perform successive knowledge discovery steps. These agents have to define their communication and data interfaces. As these interfaces are defined using XML, the environment is open and easily extensible. It is easy to build new components and it should also be easy to accommodate the existing ones. The XML solution resides on an implementation level. Above it, a general formal architecture is built by means of formal ontologies. These ontologies describe data being analyzed (which comes naturally) and newly they are used to describe the domain of the KDD process itself. Moreover, having a formal platform, following problems should become solvable (in addition to those mentioned in Section 2): •
integration of different KDD systems
•
comparison of different KDD systems
•
integration of KDD systems into existing environments
Thus, the future work will lie in definition of an unifying approach that would embrace the KDD process as much as possible. It will require the deep investigation of the whole area, identification of key concepts and their relationships, and their description via ontological structures. The Knowledge Interchange Format (KIF) is an example of aconvenient formalism to describe these ontologies, and XML can serve as an implementation technology in the way similar to that outlined briefly in this paper.
References 1. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, 1996. 2. The Knowledge-Sharing Effort Consortium, http://www.cs.umbc.edu/kse 3. LADSEB-CNR, http://www.ladseb.pd.cnr.it/infor/ontology/ontology.html 4. Ontolingua Server, http://www-ksl-svc.stanford.edu:5915/ 5. World Wide Web Consortium XML page, http://www.w3.org/XML