system provides an object oriented library DISCOVER OBJECT LIBRARY (DOL) [3], devel- ... The DISCOVER system is based on the Perl programming language. ... 2http://www.icmc.usp.br/Ëbiblio/index.php?destino=relatorios_tecnicos.
An Integrated Environment for Data Mining Ronaldo Cristiano Prati Marcos Roberto Geromini Maria Carolina Monard Institute of Mathematics and Computer Sciences at University of S˜ao Paulo Av. do Trabalhador S˜ao-Carlense, 400 — Centro Cx. Postal 668 — ZIP Code 13560-970 S˜ao Carlos — S˜ao Paulo — Brazil {prati,mrgera,mcmonard}@icmc.usp.br
Abstract. In the last years the size of databases, either scientific or business-like, has grown at a fast rate imposing a need for new generation techniques and methods to help data analysis and knowledge discovery. These methods and tools are object of study of a relatively new research area, called Data Mining. Aiming to help research in this area we are developing a free software integrated environment, called D ISCOVER. This paper describes the approach we are using to integrate the components in that system as well as the development of what is considered a new user interface approach on this area.
1
Introduction
The aim of Data Mining (DM) is to extract useful information out of large data collections [8]. There are several Machine Learning (ML) algorithms that can be used in DM tasks. However, it is not possible to apply such algorithms without careful data understanding and preparation, which may often dominate the actual Data Mining activity [22]. In addition, evaluating, understanding, and interpreting the extracted knowledge are also hard activities. Although several commercial tools, such as Mineset, IBM IntelliMiner, Clementine, Oracle Darwin and others are currently available, they are too expensive for some developing countries university budgets. Furthermore, commercial tools are generally developed as closed products from the end user point of view, making it difficult for researchers to integrate other DM methods. With these in mind, a group of researchers at our laboratory — (LABIC) Computational Intelligence Laboratory1 — is working on the development of an integrated environment, called D ISCOVER, that is being used to help our research related to DM and ML. The aim of the system is to integrate the community’s most frequently used ML algorithms with the data and knowledge processing tools developed as the results of our work. The system can also be used as a workbench for new tools and ideas. This paper briefly describes the D ISCOVER (Section 2) and the approach we are using to integrate the components in that system, as well as the development of what is considered a new user interface approach for Data Mining (Section 3). Conclusions and future work are presented in Section 4. 1
http://labic.icmc.usp.br
2 2
Prati, Geromini & Monard The Discover Project
Data Mining is not an algorithmic activity but a process, i.e., there is not a recipe we should follow to obtain good results. The DM process, as shown in Figure 1, consists of several phases, that can be gathered in three main groups: 1. data processing (or pre-processing); 2. discovery of patterns from data and, 3. patterns processing (or pos-processing).
Figure 1: The Data Mining Process
This process is inherently interactive and iterative: we cannot expect to obtain useful knowledge simply by giving a lot of data to a system. The user of a Data Mining system needs to have a solid understanding of the domain in order to select the right subsets of data, suitable pattern classes, and good criteria for evaluation and selection of interesting patterns. Thus, Data Mining systems should be seen as interactive tools, not as automatic analysis systems. The D ISCOVER, a free software project, aims the development of an integrated environment offering functionalities and support for the research on all phases of the Data Mining process. Its development approach is based on a collaborative work strategy, in order to develop a set of integrated tools to supply our research necessities, as well as those of other research groups. Figure 2 shows an overall picture of the D ISCOVER functionalities.
Figure 2: The D ISCOVER system
An Integrated Environment for Data Mining
3
The D ISCOVER, similarly to MLC++ [12] and Yale [9], and opposed to Weka [25], uses learning algorithms already implemented by the ML community. The use of already implemented algorithms, which have been thoroughly tested, not only avoids re-implementation but also ensures the results correctness. The D ISCOVER environment also incorporates specific tools we have developed, such as data (and text, in the case of Text Mining) processing for supervised and unsupervised learning, sampling, error and accuracy. Even though the cited systems offer similar features, the proposed approach has a strong focus on the understanding and evaluation of symbolic knowledge, such as rule interesting, novelty and quality, as well as rules merging. One of the advantages of D ISCOVER as a research system is the unifying vision on which the system handles the objects using standard formats. The development of new tools, for both data and knowledge processing, is related to the handling of a set of common objects. For instance, a tool for data acquisition in a WEB system and a tool for data sampling, despite belonging to different tasks (data acquisition and data sampling), have in common the fact that they manipulate data. Furthermore, those objects can be represented in different formats. For instance, a date can be stored as “month/day/year” in a specific base and as the number of days counted upwards from a specific data in another base. Another example is the different formats used by the learning algorithms to represent the patterns or concepts they induce To help the development of new tools, the D ISCOVER project adopted the standard format concept to its objects. An object can be a dataset, a classifier, a rule, a measure, etc. Besides that, the use of standard formats for objects allows to have an unifying vision of those objects, facilitating their understanding. For data representation, we adopted attribute-value standard format [2] similar to the MLC++ data format, but with extensions to Text Mining, unsupervised learning, inclusion of new data types, and the construction of new attributes, i.e, attributes calculated from another (original) attributes. In order to manipulate the data in this standard format, the D ISCOVER system provides an object oriented library D ISCOVER O BJECT L IBRARY (DOL) [3], developed using software patterns concepts [10]. Besides manipulating data in the standard format, DOL provides a set of classes and functions that can be used in the development of new data processing methods. Among other facilities, DOL offers the possibility of filtering features and instances, calculating descriptive statistics and correlation, data resampling, integration with Data Base Systems, integration with several Machine Learning System (including C4.5, C4.5rules, CN2, NewId, C5.0, Cubist, SNNS, Ripper, RT, M5, SVM Torch, Weka, Cart, MineSet, Trepan, Apriori), Tomek links identification, MTree indexing, value imputation for treatment of missing attribute values, histogram and scatter plot graphic generation, artificial data generation. Two other available tools can acquire data from different sources. The first of them, called P RE T EX, is a tool developed to pre-process text for Text Mining applications. This tool efficiently decompose text into words (stems) using the bag-of-words approach. Giving a set of texts, this tool applies this decomposition step to Portuguese, Spanish or English texts in order to create a dataset in the D ISCOVER format. This approach makes text accessible to most Machine Learning algorithms that require each instance be described as a vector of fixed dimensionality. This tool also contemplates several facilities implemented in order to treat the dimensionality reduction problem, such as stop-words filtering, which is common in this kind of mining. The second available tool is a filter that converts WEB-server logs into the D ISCOVER data format [4]. With this tool is possible to pre-process log files of several
4
Prati, Geromini & Monard
servers in order to perform Web logs Mining applications. We are currently implementing semi-supervised learning algorithms [23] as well as cotraining techniques [14]. One of the goals of semi-supervised learning training classifiers when large amount of unlabelled data are available together with a small amount of labelled data. We have also defined and implemented a standard format, called PBM, for classification rules [20], based on the CN2 learning algorithm [5, 6] rule format, but with a major addition: for each rule we have the contingency matrix, which is the base for most rule evaluation measures [20, 13]. In order to manipulate the rules in this standard format, the D ISCOVER system provides another object oriented library [18, 19]. Among other functionalities, this library can translate symbolic hypothesis induced by ML algorithms most frequently used by the community, including C4.5, C4.5rules, CN2, Ripper, NewID, OC1, to PBM format. There is also a tool that, using a set of rules in PBM format and a dataset, calculates a contingency matrix to each rule. Other available tools include the calculation of rule quality measures, including support, confidence, rule accuracy, satisfaction, novelty, specifity, sensitivity, and other information, such as which rules cover an instance or which instances are covered by a rule. The PBM format is also used by the XRULER [1], a system that aims to combine symbolic classifiers in another symbolic classifier. This system uses classifiers induced from various symbolic ML algorithms transformed to the PBM set of rules format in order to merge them into a final set of rules. This final classifier is generally more robust, containing less rules and presenting a better classification performance than any of the individual previously induced classifiers. The D ISCOVER also have a Rule Exploration Environment, called RulEE [16], which aims to facilitate, by the user, the exploration, analysis, and evaluation of the generated patterns. There is also defined and implemented a standard format for regression rules [21]. Some researches at our laboratory are working on the development of a class library to manipulate this standard format. Other research includes the definition of standard formats for association rules and unsupervised ML representation. The D ISCOVER system is based on the Perl programming language. Perl provides a regular expression engine, which is a powerful sub language that can perform complex text manipulation and data extraction tasks. Perl can also handle large amounts of data efficiently and is flexible with regard to data types, since it is based on general lists and dictionaries (association arrays), of which the latter are implemented as very efficient hash-tables. We are also using MySQL, one of the most popular open source database systems, to underlie the component database systems dependencies. We intend the D ISCOVER to work as a workbench in order to help Data Mining research. The advantage of the D ISCOVER as a research system is that the user has total control on the planing and execution of experiments. Such control is given by a powerful, yet flexible, user-centered interface, which can be considered as a new approach for this kind of system [11]. This interface is being developed under rigorous usability engineering methods and theories, following the Usability Engineering Lifecycle [15]. This GUI allows the user to achieve a totally graphical manipulation of the components of an experiment, something that has only recently been used for DM, and we consider should be better explored. The next section briefly describes the approach we are using to integrate the already implemented (and the new ones) components, as well as the
An Integrated Environment for Data Mining
5
D ISCOVER Graphical User Interface. 3
Integrating Components
As already mentioned, the D ISCOVER is an integrated environment build up of classes libraries. Classes are encapsulated as components, and the components composition is done throughout a Graphical User Interface. Although bringing some advantages, such as an easier user interface and a performance enhancement, this approach raises new problems related to the system integration. In order to D ISCOVER be an integrated environment, we need to treat many issues, such as the development process management, communication between the developers, and components interactions, like exception handling and others. In addition to that, we have to consider some architectural tradeoffs in order to develop a system that reflects our needs. The system should be extensible such that several tools, which amount to dozen of thousands lines of Perl code2 , that have been implemented by LABIC’s students as part of their M.Sc. and Ph.D. works3 , as well as future tools, can easily be incorporated into the system. Thus, the system’s architecture should impose strong interaction patterns that allow an efficient integration. In other words, we need a method to easily integrate components and a set of processes that holds and forms an integrated environment. To solve this problem we have adopted as a solution the development of a component integration framework [24], which contains a set of rules to be obeyed by components in a certain environment. These rules form a base on which multiple components can coexist in a single environment. Still, it should manage shared resources and provide communication between extension components. The framework has as its main part a set of component interface definitions (also known as contracts). The interfaces are structured in an inheritance hierarchy. This automatically provides the system with a modular architecture. New components can be added to the framework if they implement one of the interfaces (fulfill one of the contracts). Our proposed solution can also incorporate some other benefits inherited from the objectoriented framework approach [7], such as: modularity, reusability, extensibility and inversion of control. To compound these components the user uses the D ISCOVER Graphical User Interface [11]. In this GUI each component (e.g. datasets, inducers, data and knowledge processing, etc.) is represented as a group of possible instances that can be drag and dropped into a working area and linked altogether in order to set up an experiment. Once a valid experiment has been graphically configurated by the user, it needs to be represented on such a way that the framework can understand and interpret. Our approach has been to develop a DM process representation language using XML (Extensible Markup Language) [17]. This language acts as a translator (or interface) between the DM process user model and its real implementation within the system’s components at the integration framework level. With such language we can represent many process steps, each one with its specificities, on an unique and uniform way. Such standardization is an ideal way to establish the com2
http://www.icmc.usp.br/˜biblio/index.php?destino=relatorios_tecnicos. htm, technical reports: 33, 68, 81, 94, 101, 116, 121, 122, 133, 134, 135, 137, 139, 143, 145, 154, 155, 158, 159, 160, 161, 170, 173, 175, 183. 3 www.teses.usp.br/unidades.php?codUnidade=55
6
Prati, Geromini & Monard
munication between the GUI and the framework modules or components. The process constructed by the user using the GUI is then “translated” to this language, keeping all personal configurations of each step. Thus, it can be saved on a file allowing its later usage through the environment, modifying or executing it. Figure 2 shows a conceptual vision of the D ISCOVER system. In this vision, the framework acts as a Fac¸ade pattern [10] between the GUI and the DM components, providing a simplified interface to the complexity of the systems components.
Figure 3: A conceptual vision of the D ISCOVER system
4
Conclusions and Future Work
The D ISCOVER is a project driven by the needs of a group of researchers at the Computational Intelligence Laboratory (LABIC). With this system we try to improve and facilitate routine tasks in DM, TM and ML research, with an emphasis on design and interpretation of experiments. Using a flexible and easy-to-use tool can not only facilitates the DM activity, but also helps to unify the data and knowledge manipulation and to integrate different ML algorithms through a common interface. The fact of being a free software is another major characteristic of the D ISCOVER. The D ISCOVER currently includes tools and functionalities supporting the classification problem. Ongoing work at our laboratory includes implementations for association and regression rules as well as clustering, among others. References [1] J. A. Baranauskas. Combining symbolic classifiers from multiple inducers. Knowledge Based Systems, 16(3):129–136, 2003. [2] G. E. A. P. A. Batista. Data pre-processing in Supervised Machine Learning. PhD thesis, ICMC-USP, may 2003. In portuguese.
7
An Integrated Environment for Data Mining
[3] G. E. A. P. A. Batista and M. C. Monard. Design and architecture description of the D ISCOVER L EARNING E NVIRONMENT. Technical Report 187, ICMC/USP, 2003. in portuguese ftp://ftp.icmc.usp.br/ pub/BIBLIOTECA/rel_tec/RT_187.PDF. [4] R. Chiara and M. C. Monard. Design and implementation of a filter to convert web server logs into the D ISCOVER standard data format. Technical Report 183, ICMC-USP, 2003. in portuguese. ftp: //ftp.icmc.usp.br/pub/BIBLIOTECA/rel_tec/RT_183.pdf. [5] P. Clark and R. Boswell. The CN2 induction algorothm. Machine Learning, 3(4):261–283, 1989. [6] P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Procedings of the Fifth European Conference, pages 151–163, Springer Verlag, 1991. Y. Kodratoff. http://www.cs. utexas.edu/users/pclark/papers/newcn.ps. [7] M. Fayad and D. C. Schmidt. Object-oriented application frameworks. Communications of the ACM, 40(10):32–38, october 1997. Guest editorial. [8] U. M. Fayyad, G. Piatestsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery: An Overview, pages 1–30. AAAI/MIT Press, 1996. [9] S. Fischer, R. Klinkenberg, I. Mierswa, and O. Ritthoff. Yale: Yet another learning environment - tutorial. Technical Report CI-136/02, University of Dortmund, 2002. [10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1994. [11] M. R. Geromini. D ISCOVER graphical user interface design and implementation. ICMC/USP S˜ao Carlos, 2003. In portuguese.
Master’s thesis,
[12] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. International Journal on Artificial Intellegence Tools, pages 234–245, 1997. [13] N. Lavraˇc, P. Flach, and B. Zupan. Rule evaluation measures: A unifying view. Lecture Notes in Artificial Inteligence, (1634):174–185, 1999. [14] E. T. Matsubara. Semi-supervised learning using co-training. Master’s thesis, ICMC/USP S˜ao Carlos, 2003. in portuguese. [15] D. J. Mayhew. The Usability Engineering Lifecicle: A Practicioner’s Handbook for User Interface Design. Morgan Kaufmann, San Francisco, CA, 1999. [16] M. F. Paula. An environment for rule exploration. Master’s thesis, ICMC/USP, 2003. in portuguese. [17] R. C. Prati. The D ISCOVER component integration framework. Master’s thesis, ICMC/USP S˜ao Carlos, april 2003. In portuguese. [18] R. C. Prati, J. A. Baranauskas, and M. C. Monard. Extration of standardized information for the evaluation of rules induced from symbolic machine learning algorithms. Technical Report 145, ICMC-USP, june 2001. in portuguese, ftp://ftp.icmc.sc.usp.br/pub/BIBLIOTECA/rel_tec/RT_145. ps.zip. [19] R. C. Prati, J. A. Baranauskas, and M. C. Monard. A proposal of unification of the representation language of rules induced from symbolic machine learning algorithms. Technical Report 137, ICMC-USP, march 2001. in portuguese, ftp://ftp.icmc.sc.usp.br/pub/BIBLIOTECA/rel_tec/RT_137. ps.zip. [20] R. C. Prati, J. A. Baranauskas, and M. C. Monard. Standardization of syntax and information on rules induced from symbolic machine learning algorithms. Revista Eletrˆonica de Iniciac¸a˜ o Cient´ıfica (REIC), 2(3), september 2002. in portuguese, http://www.sbc.org.br/reic/edicoes/2002e3. [21] J. B. Pugliesi, D. G. Dosualdo, and S. O. Rezende. The D ISCOVER regression rules standard format. Technical Report 193, ICMC/USP, 2003. in portuguese. [22] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., 1999.
8
Prati, Geromini & Monard
[23] M. K. Sanches and M. C. Monard. Automatic instances labelling using few labelled instances and supervised and unsupervised machine learning algorithms. In WTDIA’02 Workshop de Teses e Dissertaes em IA, 2002. In portuguese. SBIA’02 Brazilian Symposium on Artificial Intelligence. [24] W. Weck. Independently extensible component frameworks. In M. Muehlhaeuser, editor, Proceedings of the 1st International Workshop on Component-Oriented Programming, pages 177–183, Heidelberg, June 1997. dpunkt Verlag. In Special Issues in Object-Oriented Programming –Workshop Reader of the 10th European Conference on Object-Oriented Programming ECOOP’96. [25] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, volume 1. Morgan Kaufmann, october 1999. http://www.cs.waikato.ac.nz/ ml/weka/index.html.