characterized as data mining query languages, because their main objective is .... general concepts, common aspects, basic features, essential primitives, etc., ...
Describing the Data Mining Process with DMSL Petr Kotásek and Jaroslav Zendulka Faculty of Information Technology, Brno University of Technology, Božetěchova 2, 612 66 Brno, Czech Republic {kotasekp,zendulka}@fit.vutbr.cz
Abstract. The state of the art in the domain of knowledge discovery in databases (KDD) and data mining (DM) has reached the point where the existence of various languages is becoming highly desirable. This paper presents an XML-based language called DMSL (Data Mining Specification Language). Its purpose is to provide the framework for platform-independent definition of the whole data mining process, and exchange and sharing of DM projects among different applications, possibly operating in heterogeneous environments. We assume that the reader is familiar with the notions of XML, knowledge discovery in databases, and data mining.
1
Introduction
Till present, several data mining languages have been developed. Most of them can be characterized as data mining query languages, because their main objective is to support ad hoc and interactive data mining. The Data Mining Query Language (DMQL) is one of them. It is used in the DBMiner system [1, 2]. It is a language with an SQL-like syntax, which specifies the set of data relevant to a data mining task, the kind of knowledge to be discovered, background knowledge, type of visualization and presentation, and various kinds of thresholds depending on the kind of knowledge to be discovered. We can see the analogy with the field of relational databases: as SQL was the result of the standardization effort in the field of database query languages, we can also see a trend to develop standards for some parts of the data mining process. Probably two most well-known initiatives in this field are Predictive Model Markup Language (PMML) [3], which is a language developed by the Data Mining Group (DMG), and Microsoft's OLE DB for Data Mining [4]. The two initiatives follow different goals. PMML provides applications a vendor-independent method of defining data mining models, which allows the exchange of models between compliant applications. OLE DB for Data Mining, on the other hand, aims to be an industry standard for data mining so that different data mining algorithms from various data mining vendors can be easily plugged into user applications. Since OLE DM for DM enables to create a mining model from PMML, both approaches can be combined. In the field of languages for mining association rules, MSQL [5] and the MINE RULE [6] operator are interesting approaches.
Y. Manolopoulos and P. Návrat (Eds): ADBIS 2002, pp. 131-140, 2002.
132
Petr Kotásek and Jaroslav Zendulka
This paper presents DMSL. Unlike PMML, it allows describing not only the output of the data mining process (the knowledge), but it provides the vocabulary to describe the overall data mining process. The goal of the paper is not to provide an exhaustive definition of DMSL, but rather to show its philosophy and compare it with PMML. Throughout this paper, we use the term data mining as a synonym to knowledge discovery in databases; under another interpretation, data mining is seen as one step in the knowledge discovery process. The paper is structured as follows. Section 2 presents the major characteristics of PMML used later for the comparison with DMSL. The DMSL language – motivation, structure, and features - is presented in Section 3. Comparison with PMML appears continuously throughout this section. Section 4 contains summary of the current state and future work, and we summarize main contributions and conclude in Section 5.
2
PMML
PMML is an XML-based language providing a way for companies to define mining models and share the models between compliant applications. Figure 1 shows the major parts of a PMML document and their most important relationships. DataDictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data mining models. TransformationDictionary contains definitions of transformations of input values, such as normalization of numbers to a range [0..1], or discretization of continuous fields. The fields defined in DataDictionary and TransformationDictionary can be used in mining models. Each PMML document can include several mining models sharing DataDictionary and TransformationDictionary. The PMML language defines vocabulary for several D a ta D ictio n a ry
T ra n sfo rm a tio n D ictio n a ry
m in in g m o d el
M o d elS ta tistics
M in in gS ch em a
... ...
m o d e l-sp ec ific e le m en ts m in in g m o d el
(T ree M o d el, N eu ra lN etw o rk, C lu stering M o d e l, R eg ress io n M o d e l, G enera lR eg ressio n M o d e l, N a ive B a ye sM o d e l, A sso c iat io nM o d e l, S eq u enc eM in in g M o d e l)
Fig. 1. The major elements of a PMML document
Describing the Data Mining Process with DMSL
133
kinds of mining models (or, kinds of knowledge): association rules, neural networks, tree classification and prediction structures, clustering, naive Bayesian classification, regression, and sequence representation. In general, the mining model consists of MiningSchema, which lists fields as used in that model. This is a subset of the fields defined in DataDictionary and TransformationDictionary. In addition, it contains ModelStatistics characterizing the training set used for the model building. Moreover, each model contains other modelspecific elements – for example items, itemsets and rules for the association rules model. PMML is supported by some data mining tools, mainly by those provided by members of DMG, e.g. Intelligent Miner for Data by IBM.
3
DMSL
3.1 Motivation Data mining is a process that takes place in heterogeneous, modular kinds of environments. Generally speaking, platform-independent languages and formats are needed and desirable to exchange various kinds of information in such environments. In the field of data mining, there is no such language that would be capable of describing the whole process. Especially, there is no language for capturing the data preparation and transformation step. All existing studies and approaches concentrate on data mining techniques and algorithms, knowledge semantics, optimization techniques, post-processing, pruning strategies, etc. Regarding the data preparation and transformation, they limit themselves to the notion of 'getting the task relevant data by an SQL-like query', 'defining a mineable view', etc. Data is assumed to be prepared and all that has to be done is take the task relevant portion. We believe that data preparation and transformation is the most important step in the whole process – valuable knowledge can be obtained only from data that exposes its semantic content in a right way. This can be very rarely said about the initial data, so preparation and transformation must take place to expose the semantics of the data to the miner and to the data mining algorithm. DMSL represents an attempt to introduce a language for the description of the overall data mining process, with the emphasis on the data preparation and transformation step. Data mining has been here for quite a long time, and the state of the art has reached the point where an effort towards such a language seems to have a good chance to succeed: we know enough about the domain to be able to identify general concepts, common aspects, basic features, essential primitives, etc., of that domain, and to describe them by a language. 3.2 The Structure of a DMSL Document Since DMSL is XML-based, the language specification has a form of an XML DTD, and a data mining project described in DMSL has a form of a DMSL document.
134
Petr Kotásek and Jaroslav Zendulka
The basic structure of the illustrative DMSL document carrying information about a single data mining project is depicted in Figure 2. The picture shows how different parts of a DMSL document refer to (depend on) each other. These references are represented by arrows. DMSL identifies five main primitives that play the major roles in the data mining process; these primitives are represented by five sections of a DMSL document: − The data model − The data mining model
− The domain knowledge − The description of the data mining task − The knowledge
Thus, each DMSL document can carry the complete information about a single data mining project. Primarily, DMSL is aimed to capture all the information relevant to the whole data mining project, and to enable applications exchange and share projects. The notion of ‘data mining project’ refers to a particular instance of the data mining process. On the other hand, a DMSL document can also serve as a mere transportation container and carry different parts of different DM projects at the same time. The DataModel element represents a data schema that defines the shape of initial input data to be mined. The DataMiningModel element represents a data mining model that defines transformations of the initial input data defined by DataModel. (PMML uses the term ‘mining model’ to refer to the result of the data mining process – the knowledge.) The DataMiningTask element specifies a DM task over a data mining model. The task can employ domain knowledge that is described by the DomainKnowledge element. The result of the data mining is held by the Knowledge element. All the elements will be explained in more detail later. If we compare DMSL to PMML, we can see some structural similarities, but there are important differences resulting from objectives of the two languages, which are wider in case of DMSL: PMML describes the result of the data mining process – the knowledge (called ‘mining model’ by PMML), while DMSL aims to capture the description of the whole data mining process, including the steps that led to the resulting knowledge. Nevertheless, both languages share a common goal - enable exchange and sharing of what they describe: knowledge in case of PMML, and whole data mining projects in case of DMSL. 3.3 Data Model The data model is the starting point. It provides the platform-independent schema of the physical data source to be mined, together with other data mining specific information. The DataModel element specifies the data model as a collection of data matrices consisting of data fields.
DataModel (DataMatrix+) > DataModel name CDATA #REQUIRED > DataMatrix (DataField+, MatrixTreatment*) > DataMatrix name CDATA #REQUIRED > DataField (FieldProperties) > DataField name CDATA #REQUIRED >
Describing the Data Mining Process with DMSL DataModel
Knowledge
DataMiningTask
DataMatrix
Association Rules
Mine Association Rules
.. Decision Tree
...
.. DataMatrix
135
Mine Classification DataMiningMatrix
DataMiningMatrix
DataMiningField
DataMiningField
DataMiningField
…
DataMiningModel
Conceptual Hierarchy
…
…
DataMiningField
…
……
……
DataMiningField
Conceptual Hierarchy DomainKnowledge
DataMiningMatrix
Fig. 2. The major elements of a DMSL document
For each data field, certain properties (data type, data form, granularity, and data scale) must be defined. DMSL supports integer, real, string, and datetime data types. Data form can be either scalar or set. The latter allows the matrix to be unnormalized. The data scale property makes it possible to specify a scale on which scalar data field values or values of set data field items are measured. The following scales are available: nominal, categorical, ordinal, interval, and ratio. The granularity can be constant, binary, discrete, or continuous. All this information is stored in the FieldProperties element. The MatrixTreatment element is extremely important. It provides for explicit definition of domain values to be interpreted as valid, invalid, missing, empty, or outlier, and specifies how to treat these values and supported types of nulls when they are used as an input to some transformation process – either when computing a new matrix from existing ones ore when using a matrix as an input to a data mining task. For instance, it is possible to ignore values, replace values with other values, etc. DMSL sees data as a set of matrices (or tables), where columns hold field values and data rows represent tuples (instances). The analogy between this attitude and the relational data model is obvious. But there is one important difference here. DMSL matrices need not necessarily be normalized. DMSL allows a set of values in a field of a matrix. It can be useful, for example, for representing transactional data. 3.4 Data Mining Model The data mining model provides the miner with functionality to transform the initial data (as described by the data model) into whatever shape is needed for data mining. This is where everything about the data preparation and transformation is stored. As data preprocessing is the most important step of the data mining process, the data mining model can deservedly be called the heart of DMSL.
136
Petr Kotásek and Jaroslav Zendulka
The data mining model represents a schema extension of the underlying data model. If no transformations of the initial data are needed, the data mining model is simply empty. Similarly to the data schema, the data mining schema is represented by data mining matrices, consisting of data mining fields.
There are two transformation mechanisms supported by DMSL: 1. Data mining matrices are created from existing data and data mining matrices using SQL-like queries. DMSL does not introduce any specific language for these purposes. The statement must be a table expression (e.g., an SQL SELECT statement), which is understood by an application processing the document, and convertible to a syntax understood by the application managing the underlying physical data source (typically, some DBMS). 2. Various DMSL-supported scalar transformations (mappings) can be applied to existing data and data mining fields to create new data mining fields. Currently, DMSL supports 7 basic kinds of scalar expressions. That is, only scalars can be used as arguments to scalar expressions, and the expressions can produce only scalar values. No transformations of set fields are currently supported. The supported scalar expressions can be combined to build a required mapping for a field. They include: a scalar value, a reference to a scalar field, unary and binary operations, linear normalization, general function, and external function. The concept of general function (represented by the GenFunction element) allows for defining arbitrary n-ary functions by listing (n+1)-tuples specifying the function's mapping. Thus, it can be used to define typical data mining transformations like numeration of non-numeric values, discretization of continuous ranges, binning, etc. The concept of external functions is realized by the generic ExtFunction element. It is provided for definition of custom functions, thus allowing complex mapping of any kind. The current version of DMSL does not specify which functions must be supported by applications processing DMSL documents, neither recommends them. Although we do not have enough space to go into details, when comparing PMML and DMSL in terms of data transformations, it appears that DMSL outperforms PMML significantly. The DataDictionary of PMML can define only a set of fields used by the mining model. This corresponds to a single data or data mining matrix of DMSL. The transformation power of the TransformationDictionary is also limited compared to the mappings supported by DMSL.
Describing the Data Mining Process with DMSL
137
3.5 Valid, Invalid, Missing, Empty, and Outlier Values Valid, invalid, missing, empty, and outlier values (to save space, we will refer to these as VIMEO values) play a crucial role in data mining. Although handling of VIMEO values is quite a complex problem, DMSL supports it in an elegant and efficient way. Generally speaking, each domain value carries an interpretation tag (called the VIMEO tag) that identifies how to interpret the value. There are two ways of being assigned the VIMEO tag: either by the explicit definition of VIMEO values in the MatrixTreatment element, or by the implicit mechanism that is based on the semantics of transformation operations. All the domain values of all the data matrices are initially implicitly interpreted as valid. If a new value is computed, say, using two valid values, then it is interpreted as valid too. If one of the values is invalid, the result must be invalid too, and so on. The full description of the semantics for VIMEO values and their VIMEO tags is beyond the scope of this paper. For detailed semantics refer to [7]. Again, let us only mention that the range of the VIMEO values management implemented by DMSL is much broader than that of PMML. 3.6 Domain Knowledge Domain knowledge is represented by the DomainKnowledge element.
It can carry any possible content. This philosophy of arbitrary content is also applied to definitions of data mining tasks and resulting knowledge itself. What is the point? Well, there exist many vocabularies for definition of these primitives and DMSL should not prevent their users from using them. Of course, together with DMSL, we also define our own XML languages for description of these primitives that are tailored for the use with DMSL; they can be seen as add-on plug-in modules. Nevertheless, if the user does not want to give up what he is used to, but wants to make use of the transformation power of DMSL, then this open architecture will support such an approach. Regarding domain knowledge, we have defined an XML vocabulary for concept hierarchies for the present. 3.7 Data Mining Task Data mining tasks, represented by a specific data mining query in a specific data mining language, are carried by the DataMiningTask element.
Until now, we have defined XML languages for mining association rules, classification, concept description, and clustering that accompany DMSL. If a user prefers to use any other language, say, MINE RULE for mining association rules, the corresponding query should appear here. Regarding probably the most frequently addressed kind of knowledge – association rules – our add-on language supports mining of multidimensional, multiple-level
138
Petr Kotásek and Jaroslav Zendulka
(exploiting concept hierarchies) association rules, supports definition of meta-rules, clustering of body and head items, etc. 3.8 Knowledge The Knowledge element contains the result of a data mining task – the knowledge.
If the user is familiar with representing knowledge with PMML, this is the place where a PMML document could appear as the result of the data mining process. Nevertheless, we also define our language for knowledge representation. It does not introduce a special vocabulary for each knowledge type like PMML (trees, clustering, etc.), but rather uses a general table structure and specifies semantic constraints on how respective kinds of knowledge should be represented. Still, these constraints are not mandatory, only recommendatory, as semantics of knowledge representation is rich and always evolving. The table structure should be able to accommodate anything that is needed.
4
Current State and Future Work
The idea of using the XML technology in the context of the data mining process was introduced in [8]. Since then, the core of the DMSL has been specified, and the architecture of a data mining system that uses DMSL was proposed. It is depicted in Figure 3. The system consists of several modules that use and modify respective parts of DMSL documents. The Data Access module is responsible for retrieving data required by other modules from external physical data sources, or from an internal database, where data described by corresponding data mining models can be stored. The Model Definition and DM Task Definition modules are responsible for data and data mining models, and data mining task definition, respectively. The DM Algorithm module performs the actual mining. The results of mining tasks are evaluated and visualized by means of the Knowledge Explorer module. It can be seen from the picture that different kinds of physical data sources are supported. Although this first version of DMSL is primarily meant to support data mining from relational databases, it is important to stress that DMSL does not care about underlying physical data sources and is not dependant on them; as long as the physical data source is based on (or convertible to) matrix representation, DMSL does not care whether it is a relational database table or a text file - all is seen as a matrix by DMSL. Two prototype applications were implemented as M.Sc. theses: an application integrating Model Definition, DM Task Definition, and simplified Data Access modules, and generating DMSL documents, and a data mining application using DMSL documents to mine multiple-level association rules.
Describing the Data Mining Process with DMSL
Data Sources
Database
Text File
139
XML File
data
internal query
data
data
data
Data Mining System Components Data Access
Model Definition
DM Task Definition
DM Algorithm
Knowle dge Explore r Knowledge Explorer uses Knowledge
use SQL queries, file system operations, …
DataModel
DM Algorithm generates Knowledge
generate
DataMiningModel
DataMiningTask
DomainKnowledge
Knowledge
XDMSL Document
Fig. 3. The proposed DM system architecture
The future work will mainly address the following topics: − Extending DMSL to support • more types of data mining tasks a knowledge (through the add-on languages). • mining data warehouses and complex types of data. − Implementing the complete DM system of Figure 3. This encompasses: (a) definition of the internal query language for data and information exchange between modules of the DM system. This will be an SQL-like language, probably XML-based. It will be also usable for creation of data mining matrices in data mining models. (b) definition of the internal data format for data being transferred between respective modules of the system. It should be XML-based.
5
Conclusion
We see one important potential contribution that DMSL can bring to the field of data mining. The proposed open modular structure of DMSL (and the related data mining system architecture) should be able to accommodate wide range of data mining problems. To our knowledge, there is no similar language for the description of the data mining process with such a scope. This approach is able to facilitate the platform-independent data mining. The novelty and main power of DMSL lie in its
140
Petr Kotásek and Jaroslav Zendulka
ability to describe the data preparation and transformation process – the flow from the original source data to the data suitable for data mining. The most significant prospects brought by DMSL are: − Vocabulary for the description of the data transformation and preparation step, which is the most crucial part of every data mining project. Moreover, this vocabulary is supported by the underlying formalism that captures everything that happens in data and data mining models – matrix and field formation, VIMEO values, etc. This formalism is an important step towards the formal framework for the data preparation and transformation process. − The open architecture of DMSL. Actually, the only mandatory parts of a DM project that must be defined by DMSL are data and data mining models. Although DMSL is accompanied by add-on vocabularies for definition of domain knowledge, data mining tasks, and knowledge (and the user is encouraged to use them), any other existing languages can be used to specify these primitives (e.g., PMML, MINE RULE, etc.). − Exchangeability and shareability: different applications on different platforms can cooperate on and share data mining projects. The concept of platformindependency is extended from knowledge (as implemented by PMML) to whole data mining projects.
References 1. Han, J., Fu, Y., Koperski, K., Wang, W., O. Zaiane: DMQL: A Data Mining Query Language for Relational Databases, SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, June 1996. 2. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. 3. An Introduction to PMML 2.0. (available at http://www.dmg.org/). 4. OLE DB for Data Mining Specification Version 1.0. Microsoft Corporation. July 2000. (available at http://www.microsoft.com/data/oledb/dm.htm). 5. Imielinski, T., Virmani, A.: MSQL – A Query Language for Database Mining, Data Mining and Knowledge Discovery, 3(4), Kluwer Academic Publishers, December 1999, 373 – 408. 6. Meo, R., Psaila, G., Ceri, S.: A New SQL-like Operator for Mining Association Rules. In: Proc. of 22th International Conference on Very Large Data Bases, eds Vijayaraman, T., M., Buchmann, A., P., Mohan, C., Sarda, N., L.1996, Morgan Kaufmann, 1996, 122 – 133. 7. DMSL documentation. (available at http://www.fee.vutbr.cz/~kotasekp/dmsl). 8. Kotásek, P., Zendulka, J.: An XML Approach to Knowledge Discovery inDatabases, In: Knowledge-Based Software Engineering, IOS Press, Ohmsha, 2000, 141 – 148.