Automating Data Preprocessing with DMPML and ... - Semantic Scholar

2011 10th IEEE/ACIS International Conference on Computer and Information Science

Automating Data Preprocessing with DMPML and KDDML ∗† Paulo

M. Gonçalves Jr. and ∗ Roberto S. M. Barros

∗ Centro

de Informática, Universidade Federal de Pernambuco Cidade Universitária, 50.740-560, Recife, Brasil Email: {pmgj,roberto}@cin.ufpe.br † Instituto Federal de Pernambuco Cidade Universitária, 50.740-540, Recife, Brasil

SQL, the language used to extract data from databases, to perform some data mining tasks, like finding association and discrimination rules [4], [5], [6]. Another approach aims to represent each step of the KDD process using XML [7] documents. By using XML it is possible to store the chain of operations executed and the data generated at each moment, automating the process and ending with the need to execute all the operations manually. DMPML [8] is an XML based language used to represent the data preparation phase of the KDD process. It allows for sharing preprocessing tasks, like codification, cleaning, and discretization, that can be applied to data to transform them automatically, using XSL transformations (XSLT) for this task. It then reduces the time needed in the data preparation phase. In this paper we present DMPML-TS, an application that provides tool support for DMPML. We introduce DMPML-TS through an example performing the data preparation phase of a data mining project, from the raw data to processed data, ready to be used by a number of data mining algorithms. The rest of this paper is organized as follows: Section II surveys some languages and applications used in the KDD process and specifically in the data preparation phase; Section III gives more details on the concepts and foundations of DMPML; Section IV presents DMPML-TS together with a practical example; Section V presents the integration between DMPML and KDDML; and, finally, Section VI presents our conclusions and proposes future work.

Abstract—This paper presents a graphical application for the Data Mining Preparation Markup Language (DMPML), which is an XML application designed to represent the data preparation phase of the KDD process. DMPML supports the reuse of data preprocessing directives using XSLT to map raw data into data ready to be used by many data mining algorithms. The application presented here, DMPML-TS, automates the data preparation phase, speeding up the codification and transformation of data, and providing support to facilitate the use of different data mining algorithms in the same and/or similar data, based on their codification stored in separate XML documents. This paper also presents improvements made to DMPML like the adoption of XRFF for input and output data and the use of only one XSLT file for data transformation. We also present the integration of DMPML-TS and KDDML, an XML language used to represent data, mining models, and queries.

I. I NTRODUCTION Data mining is one of the phases of a process known as the Knowledge Discovery in Databases (KDD) [1]. It is divided in four main phases: domain exploration, data preparation, data mining, and interpretation of results. The first phase is responsible for understanding the problem and what data will be used in the knowledge discovery process. The next phase selects, cleans, and transforms the data to a format that is suitable for a specific data mining algorithm. In the third phase, the chosen data mining algorithm performs some intelligent techniques to discover patterns that can be of potential use. The last phase is responsible for manipulating the extracted patterns to generate interpretable knowledge for humans. Most of the research carried out in this area focus on the data mining phase, which uses artificial intelligence algorithms like decision trees, artificial neural networks, evolutionary computation, among others [2] to discover knowledge. On the other hand, the data preparation phase, responsible for integration, cleaning, and transformation of data, has not been the subject of much research. In fact, Pyle [3] argues that “data preparation consumes 60 to 90% of the time needed to mine data – and contributes 75 to 90% to the mining project’s success”. There are some approaches to perform the KDD process. One of them takes into consideration the relational data model. As this is the database model prevalent nowadays, several research projects have been carried out aimed at extending 978-0-7695-4401-4/11 $26.00 © 2011 IEEE DOI 10.1109/ICIS.2011.23

II. L ANGUAGES A PPLIED TO THE KDD PROCESS There are two main approaches that can be used to perform the KDD process. One of them is extending SQL with some new primitives that allow the database to perform tasks that are related to the KDD process, like finding discriminant rules, association rules, classification rules, and characteristic rules. Three query languages act as described above: MSQL [4], MINE RULE [5], and DMQL [6]. DMQL is the language that supports more data mining algorithms. They act very similarly. They have primitives to specify: 1) Relevant data: so it is possible to specify which parts of the database have the important data to be mined, so that it is not necessary to search the whole database; 97

2) Types of knowledge to be mined: specifies the data mining algorithms to be used; 3) Previous knowledge to be used in the data mining process: this is usually represented by concept hierarchies; 4) Limits and metrics to evaluate the discovered patterns: these reflect thresholds like support and confidence; and 5) Visual representation of the discovered patterns. Another approach to perform the KDD process tries to standardize how to represent each of its phases. With a standard, it becomes easy to use different tools for different phases of the process, exploring the potentials of each tool. The language that is being mostly used with this purpose is XML. Being a textual format, it allows the exchanging of data in different platforms and different operating systems easily. One application language based on XML is PMML (Predictive Model Markup Language) [9]. Its main objective is to represent data mining models (the knowledge) resulting from the data mining phase of the KDD process. KDDML (KDD Markup Language) “is a middle-ware language and system designed to support the development of final applications or higher level systems which deploy a mixture of data access, data preprocessing, extraction, and deployment of data mining models” [10]. KDDML is heavily based on XML as a representation language for data, models, and queries. KDDML has been designed by considering the KDD process as a query process, where the operations within a query can be nested. This means that it is possible to interactively create queries, execute them, obtain results, and process new queries on the results obtained, i.e., it supports the closure principle. When queries are nested, internal queries are executed first and their results are passed to their parent queries, until the query root is reached in a schema similar to a tree. To perform the data preparation phase, the most modern approach (as presented by the next two tools) is to select the preprocessing tasks and connect them into a directed graph where the nodes correspond to the operators and the edges correspond to the order which operators will be executed. The graph starts with the raw data set and the tasks change the original data until it is ready to be applied to a specific data mining algorithm. RapidMiner [11] is a free open-source environment (hosted at Sourceforge) for KDD and machine learning. One of its characteristics is to model knowledge discovery processes as operator trees (like in KDDML), but offering some concepts not present in KDDML, like loops, for example. RapidMiner graphically represents these models in a directed graph. Another tool used to perform the data preparation phase of KDD is Weka [12]. Weka is a collection of machine learning algorithms for data mining tasks and data preprocessing. It is an open source software and many other tools internally use its algorithms, like KDDML and RapidMiner. The data preparation phase in Weka can be performed in two different ways. If the KnowledgeFlow perspective is selected, users can select operators and connect them into a directed graph (like in RapidMiner) that processes and analyzes data. This perspective is more adequate to automatize the KDD

process. If the Explorer perspective is selected, it is possible to apply each operator separately, to check its results, to compare with other operators, and to analyze statistics about the results. This perspective is more adequate to test each operator separately, analyzing its results, before using the KnowledgeFlow perspective. The default format used to represent data sets in Weka is ARFF [13]. It is very similar to a CSV [14] file, including information about each attribute, typically its name and type. Nowadays, there is an XML file format based on ARFF, XRFF [15]. It supports more features than ARFF such as class attribute specification (defines which attribute should act as class attribute), attribute weights, and instance weights. The directed graph approach has some disadvantages too. If the data set changes, some (or even all) the preprocessing tasks must be executed again. If the number of tasks or the data set is small, there is no big loss in time and effort. Nevertheless, if the number of tasks or the data set is large, there will be a big waste of (1) effort, to recreate the data preprocessing tasks and tuning their parameters and (2) time, to re-execute all the steps again. It would be very useful if there were a form to minimize this effort. This was one of the first motivations to develop DMPML [8].

III. DMPML DMPML [8] presents a language that can be used to deal with the data preparation phase. The original version of DMPML uses four types of documents: IDDP (Input Data for Data Preparation) files that represent the original data selected and extracted from any type of source data; DPDM (Data Processing for Data Mining) files that store the codifications of the input data for possibly many data mining algorithms; an XSLT file that contains declarative rules that permit mapping the original values contained in the IDDP files into data ready to be used in data mining; and, IDDM (Input Data for Data Mining) files that store transformed data ready to be used as input data by a specific mining algorithm. IDDM files are generated from IDDP files that are processed and transformed using the information and transformation directives contained in the DPDM and XSLT files. One of the main advantages of DMPML over the other languages is that it stores the preprocessing directives in a separate XML document (the DPDM file). This schema allows the reuse of the same directives when a new data extraction is made and the data has the same characteristics. Because the source data and the preprocessing directives are both XML documents, it is possible to choose the correct codification for a specific data mining algorithm by just applying appropriate XSLT rules. This simplifies the data transformation subphase because there is no need for a specific software to perform it; an XML parser can execute the XSLT file. More information about DMPML, like language structure, elements and attributes, XSLT format, etc., are described in [8].

98

Fig. 1.

Transformation of XRFF input files (original data) in XRFF output files (processed data).

A. DMPML Improvements

Figure 1 presents the sequence in which the phases of data preparation are executed, beginning with an XRFF input file, creating a DPDM file, and finally applying an XSLT file to generate an XRFF output file for a specific mining algorithm. Generating another output file for another mining algorithm is quick and easy.

The first improvement made to the original DMPML language was the substitution of the IDDP and IDDM files with the XRFF file format. IDDP and IDDM files were used only in the DMPML language. Thus, to use DMPML as it was originally proposed, it was always necessary to make two conversions: first, from the original source file to IDDP, and later, from the IDDM result file to the file format needed by the data mining algorithm. Usually these file formats are CSV or ARFF. So, the objective of the substitution of IDDP and IDDM files with XRFF files was to improve its interoperability, making the integration of DMPML with other tools much easier. Weka and RapidMiner, for example, support XRFF natively. Another benefit of using XRFF is that it stores some characteristics not present in either IDDP or IDDM files like class attribute specification, attribute weights, and instance weights. Another improvement made to DMPML was concerning the XSLT usage. The original paper proposed that, for each new data mining algorithm to be used, there would be an XSLT file that contained the type of codifications to be applied. Based on this information, the specified codifications in the DPDM file would be used to create the IDDM file. So, to add support to new data mining algorithms, the user needed to directly provide the appropriate XSLT files. We simplified the original proposal by defining a single XSLT file that does not have the type of codification of the attributes hard-coded inside it and, thus, can be used by any data mining algorithm. The user can now choose the type of codifications to be applied to each type of variable of a data mining algorithm using a graphical interface (presented in the next section). After associating the codifications to the data mining algorithms, every time the user chooses to create an XRFF file, the application passes the types of codifications as parameters to the XSLT file so it can obtain the correct coded values in the DPDM file. Using this approach, it is not necessary to create new XSLT files to represent data mining algorithms. This simplifies the process and frees the user from manipulating XML files directly, which was one of the main drawbacks of the original DMPML proposal. To work, the codification must be already stored in the DPDM file, so the XSLT can get the appropriate value for a specific XRFF file.

IV. DMPML-TS This section presents the graphical application developed to provide tool support for the DMPML language and presents an example to show how to use it. The data used in this example is available in Weka and represents a credit problem. The data contains 600 events (rows) with 12 attributes (columns), being three numerical variables and nine categoric variables. The application tool presented here (DMPML-TS), which was used to perform the codification and transformation phase, starts with two tabs, corresponding to the input file and the DPDM file. Every time the tool applies the transformation, a new tab containing the result is created. The application has a menu with options to deal with these files. The first step to start the process of data codification is to choose the data that needs to be encoded and transformed by the application. The current version of DMPML-TS supports three types of files: XRFF, ARFF, and CSV files. The XRFF file is supported natively. When using an ARFF or CSV file, the application automatically converts it to XRFF and presents it in the Input tab. It is recommended to save the converted file, so that it is not necessary for the application to convert such files every time the file is used. But, even not saving the XRFF representation to a different file, the application still works with the XRFF representation included in the Input tab. Figure 2 presents the XRFF representation of an ARFF file opened by the program. Now that the XRFF input file is available, the next step is to create the corresponding DPDM file. This step implements the codification phase. The DPDM file is responsible for the storage of the encoded values generated by the application. They represent how the values of the input file should be converted in the transformation phase for all data mining algorithms supported by DMPML-TS. Each attribute is represented in a element and, inside this element, there is a collection of elements that represent the values of this attribute.

99

missing values. In this case, the variable is of numeric type. In addition to this value, the user needs to inform other values that will be used by the program to build the DPDM document.

Fig. 2.

XRFF file loaded and presented.

To automatically convert these values to a specific data mining algorithm, it is necessary to inform the application what the type of each attribute is. Based on the type of the variable, the program asks a number of questions to determine what kinds of values the data should be converted to. If the variable has an informed type (for example, in XRFF and ARFF files it is possible to describe the type of the variable in its header section), the application will suggest this value to the user for confirmation. Otherwise (for example, CSV files do not inform explicitly the types of the variables), the application suggests one type and asks the user to confirm or inform the correct one, as presented in Figure 3. The current version of DMPML-TS can deal with three kinds of variables: numeric, categoric and date.

Fig. 3.

Fig. 4.

Program asks the user to inform a series of values.

After informing the appropriate values, the program generates the DPDM file that can be viewed on the DPDM tab. It is also recommended to save this file so it will not be necessary to regenerate it every time a transformation is required. If a DPDM file was previously created and saved, it is possible to load it and to apply it to the respective XRFF file without any new effort. To allow the manipulation of large data sets, instead of loading it completely in main memory and then perform the codification, it is used SAX [16]. SAX parses the XRFF input file and raises events when it finds starting elements, closing elements, etc. So, it does not need to store a representation of the input file in memory and allows the use of large data sets. It is important to notice that this step does not have to be done every time a new data mining algorithm needs to be applied. Because a single DPDM file can store the codification for several data mining algorithms, if the value needed by that algorithm is present in the DPDM file, the transformation can be done automatically. Neither will this step be done again if another extraction is made and the same variables (with the same characteristics) are obtained. This greatly simplifies the codification and transformation phases of the KDD process and is one of the main strengths of DMPML and DMPMLTS. The application also provides support for joining multiple sources of raw data into one single DPDM file. After the creation of the DPDM file, the user can select how to transform the Input files, based on the data mining algorithm that will be used, as shown in Figure 5. A list containing the

Program suggests some variable types for the user to choose.

If a variable is numeric, the program asks the user to inform the boundaries of outliers and missing values, as well as how to treat these values. For example, whether it should use average or extreme values to represent them. This information will be used later to automatically choose the correct representation of a data value. Notice that it is also possible to inform similar values to substitute malformed or incorrect data, for all types of data. Figure 4 shows the user informing the inferior limit to

100

supported algorithms is presented to the user, so he/she can choose one of them.

Fig. 5.

Fig. 7.

Data mining algorithms presented by the application.

The user can configure how the options above will appear, using the window presented in Figure 6, which is composed of an editable list of algorithms to be viewed by the user. It also has six buttons. Below there is a description of what each of these buttons perform:

Codifications to be selected.

mining algorithm. It is possible to convert the output file to ARFF or CSV in DMPML-TS if the data mining application does not support XRFF. Finally, DMPML-TS has a command line version too, with basically the same functionalities of the graphical application. It can open an XRFF file, create a DPDM file based on an XRFF file, and create the XRFF output file specific to a data mining algorithm. Its use might be a good choice for experienced users if performance is an important requirement, as no graphical code will be loaded in main memory. V. KDDML I NTEGRATION

Fig. 6.

To facilitate the usage of DMPML, DMPML-TS was integrated within the KDDML application. KDDML was chosen because it can perform both the data preparation and data mining phases of the KDD process, it is a simple yet powerful tool and uses XML to represent data, models, and queries. Another reason to choose KDDML is because it has a good support of the PMML format, which simplifies the integration with DMPML-TS. The integration between the DMPML-TS and KDDML tools was performed by adding the DMPML operations in the KDDML menu bar. We added options to create an XRFF input file, to create a DPDM file, to select the type of data mining algorithm to be used, and to create an XRFF output file. All these operations perform as described in section IV. The order of usage is the same as in the DMPML-TS tool. After the execution of each of these operations, instead of showing the XML files in the KDDML window, these files are automatically saved in the KDDML folder and can be applied without having to choose them explicitly. This allows the user to graphically manipulate the DMPML files, without even seeing the XML files generated by the application. To allow the integration with the KDDML application that uses a proprietary data format, we added to DMPML-TS the ability to read KDDML data files. This integration permits benefiting from the advantages of both applications. By using KDDML, it is possible to store the sequence of operations made to prepare data. This sequence of operations are stored in XML files compatible with the PMML format. These XML files can be changed graphically using the KDDML application. The result of these operations can be exported to DPDM files that can serve as data source for many different data mining algorithms.

Edit data mining algorithm properties.

Save: Saves the names of the algorithms and the order they appear in the configuration file. Thus, when the user restarts the application, these information will still exist. • Edit: Permits editing the properties of an algorithm. • Up: Moves the selected algorithm one line up. • Down: Moves the selected algorithm one line down. • Add: Adds one line in the bottom of the list. • Del: Deletes the selected algorithm. When the user chooses to edit the properties of a data mining algorithm, the window in Figure 7 is presented to the user. Each tab contains the type of the attributes and presents a list of codifications according to its type. The codifications selected by the user will be passed to the XSLT file when he/she selects to generate the XRFF output file. Each time the user asks the tool to create an output XRFF file, a new thread is created. This thread creates a tab in the user interface and performs the transformation. Therefore, it is possible to create the output XRFF files for multiple data mining algorithms in parallel, saving time and taking advantage of actual multi-core processors. After the application of a DPDM file over the XRFF input file using XSLT, the data is ready to be submitted to a data •

101

The current version of KDDML does not support the storage of the codification for many data mining algorithms in a single file. So, if the user wants to use another data mining algorithm on the same data, he/she must change the sequence of operations, which usually takes much time because it demands a lot of user interaction, and execute it to generate the codification for that specific algorithm. Thus, DMPML is an appropriate tool to overcome this weakness of the KDDML application, making the application of another data mining algorithm very simple, without the need to create new data. Another point of integration was the creation of a DMPML operator in the KDDML language. This operator, named PP_DMPML_XSLT, allows the XSLT execution to be performed in the KDDML tool, receiving as input the data set resulting of the other operators and generating as output the codified data for a specific mining algorithm. The PP_DMPML_XSLT operator has four attributes. The dpdm_file attribute indicates the path of the DPDM file that contains the codified data to be used in the transformation. The algorithm attribute informs the data mining algorithm to be used. The xslt_file attribute indicates the path of the XSLT file that will perform the transformation. The last attribute, xml_dest is a default attribute of all operators in KDDML and allows exporting the data set as an XML file.

them many times without needing to regenerate them and (2) an automatic selection of processed data based on the data mining algorithms used to discover knowledge. Nowadays, the data preparation phase has to be completely or partially done every time a new data mining algorithm needs to be applied to data or if the raw data has changed. When using an application like Weka, KDDML, or RapidMiner, it is possible to store the sequence of data preparation operations applied on data. Using this functionality, the sequence of operations can be re-executed easily to generate new data for an algorithm if the raw data have changed. However, this only works for one specific data mining algorithm. If another algorithm is to be used, the sequence of operations is probably different. This leads to changes in the sequence of operations, its posterior execution, and then passing the results to the new data mining algorithm. DMPML overcomes this problem and this represents a major advantage over the alternatives. The DPDM file is capable of containing the codifications of the data for many data mining algorithms. Moreover, creating a DPDM file is a simple task which is performed by just informing some values about the data, like outliers and missing values, as well as how to convert them. After that, one data mining algorithm may be selected (simply choosing the corresponding codifications of the attribute types). When the user chooses to create data for a specific algorithm, the application informs the XSLT file (using parameters) the codifications to obtain in the DPDM file for that algorithm. These values are all stored in a configuration file, so when the user leaves and returns to the application, his/her preferences are still set. The user does not have to manipulate directly the XML files, so he/she does not need to know about XML rules to use DMPML-TS, which is one of the main concerns regarding the XML language. An XRFF file is automatically created with the appropriate data to that specific algorithm. The application tool presented here shows how this can be accomplished. All the steps are executed in a graphical environment. For example, the data mining algorithms can be chosen in a drop down list. Using XML to represent data is another positive point of DMPML. Because the majority of data mining applications (Weka, KDDML, RapidMiner, among others) nowadays uses XML, it is reasonably simple to convert between these formats and this is supported by DMPML-TS too. As future work that we consider adding to the DMPML application, we include the following: • Creating a drag and drop application for an XML database to perform the data preparation tasks (like removing attributes, creating new attributes based on operations of existing attributes, filtering rows, etc.) inside the database, without the need to create a separate environment to pre-process the data. This application would be based on the directed graph approach, like presented in Weka and RapidMiner. Another advantage is the fact that there will be no file format conversions between applications and databases, increasing the performance of

VI. C ONCLUSION This paper presented DMPML-TS, a graphical application that provides tool support for the XML-based language DMPML, used to represent the data codification/transformation subphases of the data preparation phase of the KDD process. It also presented an example of the utilization of DMPML-TS. The current version of DMPML-TS supports the following functionalities: • Importing ARFF, CSV, and KDDML files, converting and saving them as XRFF files; • Reading, saving and using XRFF files as input; • Creating DPDM files based on information provided by a specialist; • Use of SAX to parse XRFF input data, allowing the manipulation of large data sets; • Saving and loading existing DPDM files; • Graphical selection of codifications that represent data mining algorithms and applying them to the relevant XRFF and DPDM files to create an XRFF output file; • Saving and exporting XRFF files as ARFF and CSV files; • Open files and execute XSLT files in threads; • Integration with KDDML. The example presented here started with the conversion of an ARFF file to the XRFF format, followed by the construction of the DPDM file through options presented to the user, and the generation of processed data based on an XSLT transformation file that is transparently used by the application when a data mining algorithm is selected by the user. The main advantages of DMPML are (1) storing the encoded values in separate documents so it is possible to use

102

•

•

the preparation phase. Another minor advantage would be that performing the KDD process inside a database would mean the user would not need to learn another application, performing the KDD tasks in a known environment; Create the same operator developed to the KDDML tool to other tools like Weka and RapidMiner, integrating DMPML into these tools. Preparing a web site with information about the DMPML language and graphical tool, containing documentation, framework, publications, source code, tutorials, screenshots, etc. R EFERENCES

[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, Inc., 2006. [2] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. [3] D. Pyle, “Data collection, preparation, quality, and visualization,” in The Handbook of Data Mining, N. Ye, Ed. Lawrence Erlbaum Associates, Inc, Publishers, 2003, p. 366. [4] T. Imielinski and A. Virmani, “Msql: A query language for database mining,” Data Mining and Knowledge Discovery, vol. 3, no. 4, pp. 373–408, 1999. [Online]. Available: http://portal.acm.org/citation.cfm? id=593489 [5] R. Meo, G. Psaila, and S. Ceri, “An extension to sql for mining association rules,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 195–224, 1998. [Online]. Available: http: //portal.acm.org/citation.cfm?id=593462 [6] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane, “Dmql: A data mining query language for relational databases,” in SIGMOD’96 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’96), Montreal, Canada, 1996. [Online]. Available: http://citeseer.nj.nec.com/han96dmql.html [7] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau, “Extensible markup language (xml). 1.0. (fifth edition). w3c recommendation,” November 2008. [Online]. Available: http: //www.w3.org/TR/REC-xml/ [8] P. M. Gonçalves, A. L. Arnaud, and R. S. M. Barros, “Dmpml data mining preparation markup language,” IEEE/ACS International Conference on Computer Systems and Applications (AICCSA)., pp. 116–125, April 2008. [Online]. Available: http://ieeexplore.ieee.org/ xpls/abs all.jsp?arnumber=4493525 [9] D. M. Group, “Predictive model markup language (pmml).” [Online]. Available: http://www.dmg.org/ [10] A. Romei, S. Ruggieri, and F. Turini, “Kddml: A middleware language and system for knowledge discovery in databases,” Data and Knowledge Engineering, vol. 57, no. 2, pp. 179–220, May 2006. [Online]. Available: http://dx.doi.org/10.1016/j.datak.2005.04.007 [11] I. Mierswa, M. Scholz, R. Klinkenberg, M. Wurst, and T. Euler, “Yale: Rapid prototyping for complex data mining tasks,” in KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, Eds. New York, NY, USA: ACM Press, August 2006, pp. 935–940. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150531 [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, November 2009. [13] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005. [14] T. I. E. T. Force, “Common format and mime type for commaseparated values (csv) files,” October 2005. [Online]. Available: http://tools.ietf.org/html/rfc4180 [15] I. H. Witten and E. Frank, “Xrff: Extensible attribute-relation file format.” [Online]. Available: http://weka.wikispaces.com/XRFF [16] D. Brownell, SAX2. California, CA, USA: O’Reilly & Associates, Inc., 2002. [Online]. Available: http://docstore.mik.ua/orelly/xml/sax2/

103

Automating Data Preprocessing with DMPML and ... - Semantic Scholar

Automating Data Preprocessing with DMPML and ... - Semantic Scholar

Suggest Documents

Investigation of data clustering preprocessing ... - Semantic Scholar

Knowledge Driven Preprocessing - Automating mesh generation for ...

Automating Logic Transformations with ... - Semantic Scholar

Automating Defeasible Reasoning with Logic ... - Semantic Scholar

Automating Disk Forensic Processing with ... - Semantic Scholar

Automating the Retrieval and Analysis of Data from ... - Semantic Scholar

Discretization and grouping: preprocessing steps ... - Semantic Scholar

DMPML â Data Mining Preparation Markup Language

Automating and Validating Semantic Annotations - Semantic Scholar

Preprocessing and Segregating Offline Gujarati ... - Semantic Scholar

Automatic Preprocessing and Meshing ... - Semantic Scholar

Spatial Data Preprocessing for Mining Spatial ... - Semantic Scholar

preprocessing of experimental data for use in ... - Semantic Scholar

Data Preprocessing for Anomaly Based Network ... - Semantic Scholar

Preprocessing fMRI Data

Data Preprocessing in WEKA

Automating Interface Evaluation - Semantic Scholar

Automating Coherent Logic - Semantic Scholar

Model Search: Formalizing and Automating ... - Semantic Scholar

ADEM: Automating Deployment and Management ... - Semantic Scholar

Automating the Performance and Reliability ... - Semantic Scholar

Towards simplifying and automating business ... - Semantic Scholar

Automating and estimating glomerular filtration ... - Semantic Scholar

Automating data extraction in systematic reviews: a ... - Semantic Scholar

Automating Data Preprocessing with DMPML and ... - Semantic Scholar