DMPML Data Mining Preparation Markup Language Paulo M. Gonc¸alves Jr.
[email protected]
Adrian L. Arnaud
[email protected]
Roberto S. M. Barros
[email protected] Center of Informatics, Federal University of Pernambuco Cidade Universit´aria, 50.732-970 Recife, Brazil
Abstract
gent methods are applied for the extraction of patterns with useful knowledge on the investigated problem. And in the last phase the extracted patterns are manipulated to generate interpretable knowledge for humans. The development of standards to promote the reusability of the inputs and outputs of the different phases of the KDD process [13, 8] is a field of study that receives a lot of attention from researchers. Nowadays it is possible to find a significant number of emerging technologies based on XML (Extensible Markup Language) [4] specially created to provide an unified standard to the KDD process. This paper proposes an alternative language, also based on XML, to standardize specifically the data preparation phase. It uses declaratives rules of transformation stored in XSL (Extensible Stylesheet Language) [5] files. This choice appears to be the most appropriate since the XSL technology grows in parallel to the XML technology. Besides, XSL transformations allow direct mapping between raw data to processed data in XML files without additional software. To apply this approach to real projects of data preparation, it is necessary to elaborate a set of XML files especially tailored to allow direct use of XSL transformation rules. This is the main reason why DMPML (Data Mining Preparation Markup Language) was created. This XMLbased language is described in greater details in the following sections of this document. The rest of this paper is organized as follows: Section 2 surveys some technologies applied to the KDD process. Section 3 makes a brief introduction to XML and its associated technologies, and justifies the XML usage in this work. Section 4 gives more details on the concepts and foundations of DMPML. Section 5 shows a practical example us-
In this paper we propose the language DMPML as an alternative to the standardization of the data preparation phase in a KDD process. DMPML is based on XML and uses XSL transformations to map raw data into processed data. DMPML features, such as extensibility, robustness and platform independence, support exchanging of data preparation projects among DMPML producers in an efficient way. This promotes work reusability and experience interchange among similar projects.
1. Introduction For years, institutions have been storing enormous amounts of data in magnetic devices. However, in spite of the collective awareness that these great volumes of data possess a huge amount of information on the business of the institutions, intelligent technologies capable of extracting useful knowledge from these robust masses of data have not been available until the beginning of the 1980’s. In response to this need, the KDD (Knowledge Discovery in Databases) process was proposed in 1989. This process has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [6]. The KDD process includes the following phases: domain exploration, data preparation, data mining and interpretation of the results. In the first phase, the problem and the solution space are explored. In the second phase, the data are selected, cleaned and transformed to serve as input data to the data mining phase. In the data mining phase, intelli-
978-1-4244-1968-5/08/$25.00 ©2008 IEEE
116
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
• manipulation of the results of performed queries:
ing the DMPML language. Finally, section 6 presents our conclusions and proposes future work.
– search in an existing rule base (called rule mining). By the use of the SELECTRULES operand it is possible to search association rules in an existing rule base.
2. Related work This section surveys some proposed standards that can be applied to the KDD process. Some of them are based on the relational model and use query languages; others are based on XML.
– cross-over between data and association rules to make possible the identification of data subsets that satisfy or violate a given set of rules. This is performed by the use of the VIOLATE and SATISFY operators.
2.1. Languages Based on the Relational Model
• basic support to pre and post-processing.
The Data Mining Query Language, described in [10], is a query language used in a data mining system called DBMiner [9]. It adopts a syntax very similar to SQL and its objectives are the extraction of different types of knowledge, such as association rules, discriminant rules, classification rules, and characteristic rules in a relational database and data warehouses in multiple levels of abstraction. DMQL provides the means to specify:
MSQL could be extended to have a better support to the pre and post-processing steps. Even having support to operators named CREATE ENCODING (provides discretization of continuous attributes) and SELECTRULES (selects rules in a base rule), it does not provide support to complex post and pre-processing operations, like sampling, for example. The MINE RULE [14, 15] operator was created as an extension to SQL. It supports the extraction of association rules of relacional databases and its storage in a separate relation (supports closure). The main characteristics of this operator are:
• relevant data, • types of knowledge to be mined, • previous knowledge to be used in the data mining process,
• allows the selection of the relevant data set in the database (like in DMQL and MSQL), so the association rules will be based only on the chosen data.
• limits and metrics to evaluate the discovered patterns, • visual representation of the discovered patterns.
• definition of the structure of the rules to be mined and the restrictions that will be applied to them. This way, only rules with specific characteristics will be selected.
Two positive characteristics of DMQL are: (1) based on the results of queries, new queries can be executed interactively; and (2) it was developed to work with traditional databases. The main weakness of DMQL is that it is essentially centered on the extraction phase. So, to perform pre and postprocessing operations, the language uses SQL or additional tools because it does not have the resources to make these operations. MSQL [11] is a query language for association rules created as an extension to SQL. Its main characteristics are:
• definition of which data can be part of an association rule. • basic support of post-processing, with search and selection of found association rules. As in MSQL, data pre-processing in the MINE RULE is limited to operations that can be performed in SQL. It is not possible to obtain samples of data before extraction, and discretization might be done by the user. A positive point is the well defined semantics of its operations, like few other languages offer. To check a detailed comparison between these three data mining languages, see [2] or [3].
• hability to nestle SQL expressions, such as ordering and grouping, being able to divide a query in many parts to facilitate the creation of the query commands. • support to the closure property and availability of operands to manipulate the results of previously executed queries.
2.2. Languages Based on XML
• creation of association rules based on data in response to a query. Based on a data set, MSQL returns a set of rules that satisfy the set chosen in the query. This is performed by the use of the GETRULES operand, which creates rules and saves them in a rules base.
One of these standards is a query language called KDDML (KDD Markup Language) [19]. It is “a middleware language and system designed to support the development of final applications or higher level systems which deploy a mixture of data access, data preprocessing, extraction and
117
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
deployment of data mining models” [17]. By a model, it understands the output of the data mining process – the knowledge. It was introduced [1] as an environment where the knowledge extraction problems (the input to the data mining phase) and their results (the output) were represented as XML documents. Later on, it was updated to support some operations of the preprocessing phase. The KDDML language has been designed by considering the KDD process as a query process, where the operations within a query can be nested. This means that with KDDML it is possible to interactively create queries, execute them, obtain results and process new queries on the results obtained. The execution of the queries can be done by external programs or by supported operators implemented in an interpreter of the KDDML system. As stated in [17] “a KDDML query is an XML-document where XML tags correspond to operations on data/models, XML attributes correspond to parameters of those operations and XML subelements define arguments passed to the operators”. Another XML-based initiative to standardize the KDD process is PMML (Predictive Model Markup Language) [18]. It started as a language to represent predictive models produced by data mining systems [7]. Later on, it was updated to support other data mining models, like decision trees, neural nets, polynomial regression and others [20]. It is based on XML, like KDDML, and focuses on the output of the data mining phase of the KDD process, describing the inputs to data mining models, the transformations used to prepare data for data mining, and the parameters which define the models themselves. It is being developed by the Data Mining Group (DMG), which is formed by data mining systems developers like IBM, Microsoft, SPSS and Oracle. The KDDML and PMML approaches are both relevant initiatives that contribute to the standardization and formalization of the KDD process. However, both have the same problem: the standardization effort usually concentrates on data mining techniques and algorithms, data mining query languages, knowledge semantics, optimization techniques, post-processing, etc. In fact, little attention is given to the data preparation phase, especially to the cleaning step. However, we believe this is a big mistake since the data preparation phase is responsible for approximately 80% of all efforts in a KDD process applied to real problems [16]. It is in this phase that the information is consolidated and the complexity of the data mining tasks is reduced. If the data in this stage is not prepared properly then all the following process may become a huge waste of time, energy and money. The first relevant initiative especially created to standardize the overall KDD process1 using XML-based technologies is the DMSL (Data Mining Specification Language) 1 With
approach [12]. DMSL identifies five main primitives that play the major roles in the KDD process, represented by five sections in a DMSL document: • The data model: represents a data schema that defines the shape of the initial input data to be mined, together with other data mining specific information like data type, data form, granularity, and data scale. • The data mining model: defines transformations of the initial input data into whatever shape is needed for data mining. This is where everything about the data preparation and transformation is stored. • The domain knowledge: knowledge that can be used by a data mining task. • The description of the data mining task: specifies a data mining task over a data mining model. • The knowledge: contains the result of the data mining task. The DMSL is based on a theoretical framework that can be used to represent the whole KDD process. The framework is built on three mathematical pillars: • relations are used to represent data and data mining matrices (and their fields); • graphs are used to capture the structure of matrix and field dependencies; • functions are used to realize executive functionality and to express all the existing internal relationships within the framework. Although DMSL is based on a sound theoretical framework, to the best of our knowledge there is no software implementation of DMSL available yet. We can see that many languages have been proposed to standardize different phases of the KDD process. Yet, until now, more than 15 years after the definition of the knowledge discovery process, none of these technologies is in wide use. The benefits of having a standard language to represent the whole process are obvious. A good aproach to achieve a standard is to analyze the proposed languages (like the ones presented before), compare its advantages and disadvantages and mix them together trying to maintain their strenghts and minimize the effetcs of their weaknesses.
3. The XML Language 3.1. XML Concepts XML is a language which has become very popular as the means to store data in text files as well as the meaning of these data. With XML it is possible to represent the
the emphasis on the data preparation and transformation steps.
118
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
structure of the document, with no regard to the way it is to be presented. The data are organized in elements, which are similar to HTML tags. However, in HTML, the tags are predefined and immutable. In XML it is possible to redefine and extend the definition of elements and its attributes. These characteristics mean XML is an extensible metalanguage, a language that is used to define other languages and is not static. One of the most popular applications of XML documents is to use them as the means to transfer data among different applications. Also, because XML documents are text files, its contents can be read using any text editor.
phase duplicate and/or corrupted data are identified and removed. The execution of this phase corrects data, eliminating unnecessary queries that would be executed by the mining algorithm and would affect its processing and efficiency. The data enrichment consists of aggregating more information into the actual data so that it can contribute to the process of knowledge discovery. In other words, the data enrichment is any process capable of increasing the quality of the information available where the main goal is to improve the mining algorithm efficiency. In the codification sub phase, the goal is to transform data so that they can be used/processed by the mining algorithm, which can be a decision tree, an artificial neural network, a clustering algorithm, etc. A conventional neural network, for example, normally accepts as input only numerical values (or scalars) that are between 0 and 1, or between -1 and 1, depending on the activation function of the artificial neurons that are inside the RNA layer. Because of this restriction, the numerical attributes are usually mapped to normalized numerical values inside the required interval. The categorical attributes are coded into binary representations using, for example, a binary representation of length M and with N bits turned on or equals to 1. In decision trees, it is usually needed that the numerical attributes be “categorized” using percentils or even nominal representation intervals. The number 10, for example, could correspond to the interval named “from 0 to 20”. The DMPML aims to cover all these data preparation phases. To reach this goal, the DMPML proposes the creation of four different types of XML files. These are:
3.2. XML and Data Preparation in KDD XML files seem a good alternative to the standardization of data preparation in KDD because it offers flexibility and adequate structural organization to have persistence and good documentation, in an efficient and cheap form, in all the sub phases of data preparation. Besides this, it is possible to create schemas to validate the content of the files, using DTD’s (Document Type Definition) and/or XML Schemas, and to transform the content of an XML file using declarative rules defined in XSL files. DTD’s and XML Schemas are useful to reach the desired level of standardization and formalization, while the use of XSL files is an efficient alternative to eliminate the necessity of developing specific code to implement data transformation. Even with these benefits, some concerns arise. The most frequent is the size of the XML documents. XML documents always use more space than the corresponding raw data because we attach to each value its semantics tipically using elements and attributes. With the capacity of magnetic disks increasing every day and the time used to access data inside the disks decreasing, this should not be a big problem. There is also the solution of data compression that performs very good rates concerning text files.
• IDDP (Input Data for Data Preparation). This file is constituted by the original data selected and extracted from any type of source data. This source data may be a relational database, a datawarehouse, a datamart or even a text file with some kind of formatting for its fields and records. The IDDP files have the original data that will be processed and later transformed into an appropriate input to a data mining algorithm.
4. DMPML
• DPDM (Data Processing for Data Mining). To process and transform the information contained in IDDP files, it is necessary to have a data preparation project adequate to the relevant data. Data preparation projects encapsulate directives and processed data and are stored in DPDM files. A unique DPDM file may have many data preparation projects for different IDDP files. And the IDDP files can only be processed and transformed if a data preparation project was specifically built to be applied to process it.
The data preparation phase is subdivided in three sub phases: data selection, cleaning and data enrichment, and data codification. In the data selection sub phase the primary objective is to choose only the relevant attributes from the complete set of attributes available in the raw data sources. The selected subset is sent to the mining algorithm. The main motivation to this selection is to try to optimize the processing time of the mining algorithm by shortening the search base. The cleaning sub phase includes a verification of information consistency, the correction of possible mistakes, and the insertion or deletion of null or redundant values. In this
• XSL Transformation file. The transformation or mapping of values from IDDP files is performed with
119
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
the application of a XSL file. It is expected that this XSL file contains declarative rules that permit mapping the original values contained in the IDDP files into data ready to be used in data mining. According to the structure anticipated by DMPML, each XSL file contains the transformation rules necessary to generate input data to one type of data mining. This way, there should be one specific XSL file to generate transformed data to neural networks, another to inducted associative rules, another to induction trees, and so on.
the transformation of values of IDDP files into values of IDDM files. All the data preparation projects contained in a DPDM file are generated based on values of an IDDP file. The guidelines of the data preparation project and the prepared data are delimited by elements named that are inside the root element named . The element contains two subelements: one called and the other called . The first one encapsulates information about the data preparation project itself where as the second one contains all the guidelines and necessary data to map IDDP files into IDDM files. Inside a DPDM file, the element is composed by a list of sub elements specification named . Each element is related to a variable contained in the associated IDDP file and contains at least three sub elements: , and . The first one contains a list of specifications of the methods adopted to generate the transformed values, the second one contains some statistics about the data and the third one contains a list of distinct values of the related variable. This list is formed by the values of the IDDP file and by their transformed values, calculated through the use of the transformation methods specified in . Typically, categorical variables will have a restricted number of distinct values, whereas the numerical values and the date variables will often form a huge list of distinct values.
• IDDM (Input Data for Data Mining). IDDM files have an internal structure that is very similar to the IDDP files. Nevertheless, IDDM files store transformed data ready to be used as input data by any mining algorithm. IDDM files are generated from IDDP files that are processed and transformed using the information and transformation directives contained in DPDM and XSL files. Figure 1 shows the sequence in which the phases of data preparation are executed so that it generates an IDDM file based on an IDDP file. In the following subsections, the contents of the IDDP, DPDM and IDDM files are described in deeper details.
4.1. IDDP Documents The IDDP files of DMPML store the original (not processed) data, usually extracted from a relational database (RDB), a datawarehouse (DW) or even a text file. IDDP files provide the input data for the generation of data preparation projects contained in DPDM files. Every IDDP file contains a root element named . And, inside this root element, there are two other elements: and . The element contains information about the IDDP file and about the original data source used to create it (RDB, DW or text file). The element contains the data as well as information about the variables (or attributes) that exist based on the extracted view. Each variable is identified by the element. And each element encapsulates a list of sub elements, where each of these sub elements encapsulates the values extracted from the data source of the IDDP file. This way, if the data source is a text file with 200 columns and 1000 records then the element will have 200 elements. And each will contain 1000 sub elements, one for each value of the data source.
4.3. IDDM Documents The IDDM files of DMPML store the transformed data, ready to be used as input data to some data mining specific algorithm. IDDM files are generated based on three distinct files: one IDDP file with the original (non processed) data, one DPDM file with the related data preparation project and one XSL transformation file to the selected mining algorithm. Every IDDM file contains a root element named . Inside this root element, there are two other elements: and . The former encapsulates information about the IDDM file and the latter contains the transformed values of each variable contained in the IDDP associated file. The element contains a list of sub elements. And each element encapsulates a list of sub elements which encapsulate the transformed data, ready to be used by some specific data mining algorithm. The number of elements and of an IDDM file are always the same as those observed in the IDDP file. This happens because the variables and the stored values in the IDDM files are simply the IDDP values
4.2. DPDM Documents DPDM documents store data preparation projects which, in turn, encapsulate guidelines and processed data used for
120
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
Figure 1. Transformation of IDDP files (original data) in IDDM files (processed data) during the sub phases of the data preparation phase.
after the transformation is applied.
33 SINGLLE MARRIED SINGLE WIDOW SINGLE 10/05/2003 09/04/1999 20/05/2002 04/08/2000 12/07/2001 0 1 1 0 1
5. Example In this section, we present a working example of our framework to make it easier the understanding of the preprocessing phase using XSLT. We will shorten the size of the files, emphasizing on the main components of the framework to make its understanding easier. We present: an IDDP file, containing the raw data and how its creation is done; a DPDM file, containing the pre-processing directives created based on the IDDP file; a XSL transformation file, which will obtain the codifications to be used by the neural networks algorithm; and finally an IDDM file, resulting from the data transformation, which contains the codifications ready to be used by a data mining algorithm.
5.1. IDDP File
In the example above, there are four variables to represent a client. The first one is a numeric variable that represents the age of the bank’s client. The next variable is categorical and indicates the marriage status of the client. The third variable is a date and represents the date the client has opened an account in the bank. The last variable is categorical and represents if the client is a good or bad payer based on the usage of his/her bank credit card, where 0 (zero) is considered a bad payer and 1 (one), a good one.
Initially the raw data are stored in a database, a data warehouse, etc. The relevant data should be selected and exported to the IDDP file format. Nowadays many databases export data in some XML format. A XSL transformation file can then be used to convert the data that are in the format exported by the database to the IDDP format. If a XML database is used, we provide the XML Schema of the IDDP file and it will use it to check the validity of the file automatically. Below is an example of an IDDP file that contains the original data from some bank clients.
5.2. DPDM File A DPDM file is generated based on the IDDP file presented in the previous subsection. To make this convertion, it is necessary a program that will ask the specialist for the outliers and missing intervals, to which value it should convert outliers and missing values (typically average and extreme values), which are the inconsistent values and how they should be treated. The program will then use the val-
30 35XX 40 50
121
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
5.3. XSLT File
ues presented by the specialist to generate the file below, performing the codifications needed by the data mining algorithm through an option in the program. The program actually process data to the neural network data mining algorithm. It is being improved in the number of codification values that it can generate and the graphical user interface to provide better usability. If we look carefully at the aforementioned example, we can see that some values are inconsistent. For example, there is one age with value 35XX and a marital status with value SINGLLE. The specialist ought to somehow tell the program to consider SINGLLE values as SINGLE. Below, we show an example of a generated DPDM file based on the IDDP file presented before.
The data transformation process works as follows: each value of the IDDP file will be coded to a value that will be used by a specific data mining algorithm. The XSLT file will choose which specific value from the coded list is to be used. It will also check if the value is an outlier or missing and, if so, it will query the element and retrieve the proper value that must substitute the original value, like average or extreme. The intervals and their codification are specified in the element. Below we present a template used to get the codification of outliers and missing values for artificial neural networks.
... average extreme ...
In the element above we present the information regarding the AGE variable. It contains the that indicates the intervals of the outliers and missing values informed by the specialist in the DPDM program generator. The element contains the original and coded values to be passed to the data mining program. It may contain a list of coded values. In addition, the same DPDM file can be used as an input to many data mining algorithms. We can see that the subelements with value 35XX can be substituted by the value 35, which was informed by the specialist. The list of coded values can contain any values that will be used by data mining algorithms. In the next subsection we present the code that extracts the correct codification for the neural networks algorithm.
The next template will extract the correct codification based on the type of the variable. For each value in the IDDP file it will find a element which contains a subelement whose name attribute is equal to “original” and its value attribute is equal to the value presented in the IDDP file. If it is a categorical variable, it will obtain the binary code through the element whose name attribute value is equal to “binaryCode”. If it is a data variable, it will obtain the normalized value of the number of days since a specific data, in this case, 01/01/1970. If the variable is numeric, it will use the template presented before to check if the value is an outlier or missing and, if so, discover the proper value to substitute it.
122
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
5.4. IDDM File
missing outlier
The document that follows is the output file from the transformation process using the XSLT file over the IDDP and DPDM files. It is easy to see that it is very similar to the IDDP file. All its variable contents are coded values ready to be used by an artificial neural network. 0.2 0.5 0.4 0.8 0.2 001 100 001 010 001 1 0.877524216 0.970858644 0.917172878 0.945247086 1 0 0 1 0
6. Conclusion In this paper we proposed an unified format for the standardization and documentation of the preprocessing phase in the KDD process. This format, based on the DMPML specification, does not depend on a specific data mining software system and can be applied to projects of data preparation regardless of its level of difficulty. Besides the standardization and good documentation, other relevant benefits offered by DMPML are:
Here is the big advantage of the proposed framework: based on a single DPDM document it is possible to associate many different XSLT files to convert raw data into transformed data ready to be used by data mining algorithms. Each XSLT file needs to be defined only once and it only retrieves the information that is necessary to its specific data mining algorithm. There can be one XSLT file for artificial neural networks, other for decision trees, and so on. All without any change in the DPDM document. When a XSLT is created for a specific data mining algorithm, it can be used to get the information to that algorithm in any DPDM file.
• No need to use a relational database to store information generated by the preprocessing sub phases (selection, cleaning, enrichment and codification) because data will be stored in XML files. So, we are not locked in a database that may not offer tools to perform necessary tasks to the preprocessing sub phases or may offer tools that do not perform the tasks satisfactorily; • No need to implement special software to transform raw data into data ready to be applied in a specific data mining algorithm. This eliminates the necessity
123
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
• To test the efficiency of this program in situations where the quantity of variables and values of raw data is very high. In this case, the usage of DOM or SAX to create the file of pre-processed data and data transformation would be compared to identify the most efficient option;
to develop proprietary code, resulting in less time of development and correction of possible bugs that may appear in the implementation of this code. The data transformation tasks will be done by XML parsers, making it possible to test many parsers to verify which one satisfy the necessities of the project like efficiency, portability, etc.;
• To change this program and make it show a list of possible values that can substitute an inconsistent value and offer the possibility to the user indicate explicitly the value to be substituted, if none of the values presented by the program is correct;
• It is possible to create specific transformation rules (XSL) to specific data mining algorithms without redefining the data processing project (DPDM) related to the raw data (IDDP). The XSL transformation rules are similiar. Their only difference is on the values that they will obtain based on the data mining algorithm.
• Inclusion of new elements whose interpretation is defined by the user, so that he/she can add data not supported by DMPML;
• It is not necessary to create a DPDM file to each data mining algorithm. One DPDM file can contain as many codifications as needed to match different data mining algorithms;
• Integrate the DMPML language with other XML languages developed to represent the KDD process. We could use DMPML to perform the data preparation phase, send the data to the KDDML engine where the data mining process would occur and generate a PMML document with the knowledge discovered.
• Great potential of reusing XML files with preprocessed data (DPDM) and XSL rules of a mining project in another one that uses data with similar attributes. It is not necessarily needed to rebuild the DPDM file if the IDDP file changes. If the new data are similar to the ones of the IDDP file, their codification and information to perform the transformation are present in the DPDM file, because it is separate from the raw data. Only if the new raw data are very different from the original ones it is needed to rebuild the DPDM file.
• Create a plugin for a XML database to integrate the DMPML into it. By doing this, the data preparation phase would happen inside the database, without the need to create a separate environment to pre-process data.
At this point, the DMPML is composed of:
References
• A program that receives an IDDP file as input and outputs a DPDM file with the necessary information to the data transformation. It creates the necessary codifications to the neural networks algorithm, coding numeric variables (through the normalized similar value), categoric variables (by the usage of a binary code) and a date type (normalizing the quantity of days counted based on a specific date), among other codifications;
[1] P. Alcamo, F. Domenichini, and F. Turini. An XML Based Environment in Support of the Overall KDD Process. In Proceedings of the Fourth International Conference on Flexible Query Answering Systems (FQAS2000), pages 413– 424, Warsaw, Poland, 2000. [2] M. Botta, J.-F. Boulicaut, C. Masson, and R. Meo. A Comparison between Query Languages for the Extraction of Association Rules. In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, pages 1–10. Springer-Verlag, 2002. [3] M. Botta, J.-F. Boulicaut, C. Masson, and R. Meo. Query Languages Supporting Descriptive Rule Mining: A Comparative Study. In Database Support for Data Mining Applications, pages 27–54. Springer-Verlag, 2004. [4] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible Markup Language (XML). 1.0. (Third Edition). W3C Recommendation, February 2004. http://www.w3.org/TR/REC-xml/. Accessed on September, 16th, 2007. [5] J. Clark. XSL Transformations (XSLT). Version 1.0. W3C Recommendation, November 1999. http://www.w3.org/TR/xslt. Accessed on September, 16th, 2007.
• DTD’s and XML Schemas to the three kinds of files that compose the DMPML structure so it is possible to guarantee that these files are well formed and valid; • A XSL transformation file to the neural networks data mining algorithm. As future work to be added to the DMPML language are the following topics: • To extend the program that generates DPDM files to create more codifications to other mining algorithms, in addition to neural networks, such as decision trees, regression, genetic algorithms, etc.;
124
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.
[6] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge Discovery in Databases: An Overview. Ai Magazine, 13:57–70, Fall 1992. [7] R. L. Grossman, S. Bailey, A. Ramu, B. Malhi, P. Hallstrom, I. Pulleyn, and X. Qin. The Management and Mining of Multiple Predictive Models Using the Predictive Modelling Markup Language. Information and Software Technology 41(9)., pages 589–595, 1999. [8] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining standards initiatives. Communications of the ACM, 45(8):59–61, 2002. [9] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Zaiane. DBMiner: A System for Mining Knowledge in Large Relational Databases. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 250–255, Portland, Oregon, USA, 1996. [10] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In SIGMOD’96 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’96), Montreal, Canada, 1996. [11] T. Imielinski and A. Virmani. MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery, 3:373–408, 1999. [12] I. P. Kot´asek and J. Zendulka. Describing the Data Mining Process with DMSL. Advances in Database and Information Systems. Volume 2: Research Communications, pages 131–140, Bratislava, SK, STUBA, 2002. [13] I. P. Kot´asek and J. Zendulka. An XML Framework Proposal for Knowledge Discovery in Databases. The Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop Proceedings Knowledge Management: Theory and Applications, pages 143–156, Lyon, FR, nezn´am´a, 2000. [14] R. Meo, G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. In The VLDB Journal, pages 122–133, 1996. [15] R. Meo, G. Psaila, and S. Ceri. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery, 2(2):195–224, 1998. [16] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., 1999. [17] A. Romei, S. Ruggieri, and F. Turini. KDDML: A Middleware Language and System for Knowledge Discovery in Databases. Data and Knowledge Engineering, 57(2):179– 220, 2006. [18] The Data Mining Group. Predictive Model Markup Language (PMML), 2005. Version 3.1. http://www.dmg.org/. Accessed on September, 16th, 2007. [19] F. Turini. KDDML: Knowledge Discovery in Databases Markup Language, 2007. http://kdd.di.unipi.it/kddml/. Accessed on September, 16th, 2007. [20] D. Wettschereck and S. M¨uller. Exchanging Data Mining Models with the Predictive Modelling Markup Language. In Proceedings of the ECML/PKDD-01 Workshop on Integration of Data Mining, Decision Support and Meta-Learning, pages 55–66, September 2001.
125
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 2, 2009 at 07:51 from IEEE Xplore. Restrictions apply.