Fuzzy Data Mining from Multidimensional Databases Anne Laurent, Bernadette Bouchon-Meunier, Anne Doucet, Stephane Gancarski, and Christophe Marsala LIP6, mailbox 169, Universite Pierre et Marie Curie - Paris 6, 4 place Jussieu 75252 Paris Cedex 05, France { E-mail:
[email protected]
Abstract. Most of the existing learning systems work on data that are stored in
poorly structured les. This approach prevents them from dealing with data from real world, which is often heterogeneous and massive and which requires database management tools. In this article, we propose an original solution to data mining which integrates a fuzzy learning tool that constructs fuzzy decision trees with a multidimensional database management system.
1 Introduction Fuzzy decision tree based methods provide good tools to discover knowledge from data. They are equivalent to a set of if-then rules and are declarative since the classi cation they propose may be explained. Moreover the use of fuzzy set theory allows the treatment of numerical values in a more natural way. But most existing solutions to construct decision trees use les and it is wellknown that this approach is reasonable only if the amount of data used for knowledge discovery is rather small (e.g. ts in core memory). Often, these methods are not appropriate for data mining, which aims at discovering non trivial knowledge from very large real-world data stored in data warehouses which require speci c tools such as database management systems (DBMS). When data are stored in a DBMS, there is a need to retrieve eciently data relevant to data mining processes based on inductive learning methods. This is the reason why the integration of inductive learning algorithms is a crucial point in order to build ecient data mining applications. On-Line Analytical Processing (OLAP) provides means to implement solutions for extracting relevant knowledge (e.g. aggregated data values) from databases. However, it has been shown that the relational data model does not t OLAP approach and a new model has emerged: multidimensional databases [5]. Multidimensional databases and OLAP provide means to execute very complex queries, working on a large amount of data that are processed on an aggregated level, and not as the individual level of records. This kind of databases oer performance advantages, especially for multidimensional aggregate computations, which are highly interesting if we consider the automatic inductive learning such as, for instance, decision tree construction from data.
In this paper, we propose a new approach to perform OLAP-based mining using multidimensional DBMS and fuzzy decision trees. In Section 2 and in Section 3, we recall basic notions regarding multidimensional databases, OLAP, and fuzzy decision trees. In Section 4, we present our system of construction of fuzzy decision trees from a multidimensional database by means of a DBMS. In Section 5, we highlight the interest of the use of fuzzy decision trees in a data mining process. Implementation and obtained results on a large database are presented in Section 6, and we conclude with some future work.
2 Multidimensional databases and OLAP The relational model of data provides ecient tools to store and deal with large amount of data at an individual level. The OLAP terminology has been proposed by Codd [5] for the kind of technologies providing means to collect, store, and deal with multidimensional data, with a view to deploying analysis processes. In the OLAP framework, data are stored in data hypercubes (simply called cubes). A cube is a set of data organized as a multidimensional array of values representing measures over several dimensions. Hierarchies may be de ned on dimensions to organize data on more than one level of aggregation. A cube B is designated by means of m dimensions, each dimension associated with a domain Di of values, and by a set of elements SB (d1 ; :::; dm) (with di 2 Di ; i = 1; :::; m) belonging to a set V of values. Algebraic operations have been de ned on hyper cubes (e.g. [1]) in order to visualize and analyze them: roll up, drill down, slice, dice, rotate, switch, split nest, push and more conventional operations such as join, union, etc. Consolidation is used in order to speed up queries. It consists in precomputing all or part of the cube with an aggregation function. Some of the most expensive computations are made before the query and their result is stored in the database in order to be used when needed (see [4], [6], [8]).
3 Fuzzy decision trees Decision trees are well-known tools to represent knowledge and several inductive learning methods exist to construct a decision tree from a training set of data [9]. The use of fuzzy set theory enhances the understandability of decision trees when considering numerical attributes. Moreover, it enables to take into account imprecise values. Decision trees can be generalized into fuzzy decision trees when considering fuzzy values as labels of edges of the tree. Thus, classical methods have been adapted to handle fuzzy values either during their construction or when classifying new cases [2]. Let A = fA1 ; :::; AN g be a set of attributes and let C = fc1; :::; cK g be a set of classes. A and C enable the construction of examples: each example
ei is composed by a description (a N-tuple of attribute-value pairs (Aj ; vjl )) associated with a particular class ck from C . Given a training set E = fe1 ; :::; eng of examples, a (fuzzy) decision tree is constructed from the root to the leaves, by successive partitioning of the training set into subsets. The construction process can be split in 3 fundamentals steps [3]. An attribute is selected thanks to a measure of discrimination H (step 1) that orders the attributes according to their accuracy with regards to the class [7]. The partitioning is done by means of a splitting strategy P (step 2). A stopping criterion T enables us to stop splitting a set and to construct a leaf in the tree (step 3). To build fuzzy decision trees, the commonly used H measure is the star-entropy measure de ned as: H (CjAj ) = ?
XL P (vjl) XK P (ckjvjl) log(P (ckjvjl))
l=1
k=1
where vj 1; :::; vjL are values from E for attribute Aj . This measure is obtained from the Shannon measure of entropy by introducing Zadeh's probability measure P of fuzzy events. The algorithm to construct fuzzy decision trees is implemented in the Salammb^o system. The output of the system is a fuzzy decision tree that can be considered as a set of discovered classi cation rules [3].
4 Fuzzy decision tree construction from a multidimensional database The multidimensional database management system may either send only basic information on data, or compute complex fuzzy operations and aggregations, or any of the intermediate solutions. In the rst case, the integration of other inductive learning applications will be easier. They indeed all require at lowest level simple aggregate computation such as count (e.g. frequentist computations). So the exchanged basic statistics we propose will still be needed. All the other more complex computations will remain outside of the multidimensional database management system which will thus be generic. On the other hand, the second solution enhances the aggregates' computations (e.g. computation of complex functions as, for instance, entropy measure of a set) since multidimensional database management systems are designed for this kind of operations. In our system, an interface is implemented that consists in exchanging, for each node of the tree, statistics on the data associated with the current node. These data may either be singleton values or intervals, which enable us to construct fuzzy decision trees. These statistics are computed using OLAP queries to extract a sub-cube from the database and fetching statistics on it with aggregation functions. As detailed prevously, they either concern basic
count on the cells in the extracted data cube or may require more complex calculus, such as entropy computation. These statistics are then used by Salammb^o to choose the best attribute to go ahead in the construction of the tree. Then step 1 of the decision tree construction process (see part 3) is replaced by the following process (see Fig. 1): Salammb^o sends a query specifying the current node to develop in the tree. This query is composed by a set of questions on values of attributes that compose the path from the root to this node. step 1.2 The interface program queries the data cube and returns for each attribute value, for each class value, the corresponding statistics computed from instances of the database that ful ll the Salammb^o' query. step 1.3 Salammb^ o receives the statistics and chooses the best attributes. Depending on the solution, exchanged statistics are either frequentist count of instances, or discrimination measure of each attribute. step 1.1
INTERFACE
SALAMMBO
SERVER
CLIENT Initialisation of the communication
Reception of communication information Sending of statistics on attributes Reception of statistics on attributes While the tree is not constructed Reception of a path () List
Sending of associated statistics after querying the Express database
Sending of path of current node (set of questions)
Reception of statistics for the subset of examples associated with the node Test stopping criteria End While
Fig.1. Processus of Fuzzy Decision Tree Construction
5 Interest of fuzzy knowledge As previously mentioned, a fuzzy decision tree is equivalent to a fuzzy rule base. In real-world applications, to describe numerical data, or to handle imprecise data, fuzzy knowledge has been shown more understandable than classical knowledge. The fuzzy decision tree constructed from a training set composed by numerical data is rather smaller, in size, than classical decision tree that can be constructed from the same training set. So, the interest of fuzzy decision trees lies in their greater understandability than classical decision trees. Moreover, experiments have been conducted that highlight the greater generalisation power of such trees than classical ones. We place ourselves in the process of knowledge discovery to help users or experts in understanding data and to highlight high-level relations between
data. In this kind of process, the interest of the induced knowledge lies in its understandability. The induced rules are given to experts of the domain and are used as synthetic knowledge of the whole database. This approach of using induced knowledge is similar to the approach based on the computation of statistics on the database to gain understandability of the data. On the other hand, such discovered knowledge can be used later in the database management system itself. For instance, fuzzy decision trees can be used as a set of integrity constraints. Or, it can be associated with the database as external knowledge that enables particular queries to give more information on data. Moreover, in OLAP sessions, fuzzy rules can be given to users.
6 Implementation and results This system has been developed through a client/server architecture, including a multidimensional database management system (Oracle Express) on SunOS, an Oracle relational database server on a PC computer under Microsoft Windows NT, a C++ interface on Windows NT, and Salammb^o running on a second SunOS station (Fig. 2). SUN STATION #2 SUN STATION #1
ORACLE EXPRESS
SALAMMBO
ORACLE 8
RDB
OCI/ODBC SQL
Communication via sockets
SNAPI
Windows NT #2
INTERFACE C++/SNAPI Windows NT #1
Fig. 2. Proposed architecture This proposed system was evaluated on data from a data warehouse of the French Educational Department, containing individual results in high school certi cate through two years (about one million records). Instances of discovered rules are:
R1: If the high school is public, and the certi cate category is economics, then the proportion of candidates that succeed with mention (rather good, good, or very good) is small. R2: If the high school is private, and the academic year is 1996, and the specialty of the student is foreign Language 2, and diploma is \STI", then the proportion of candidates that succeed with mention (rather good, good, or very good) is big.
7 Conclusion and future work. In this paper, a new approach is presented to build fuzzy decision trees from a data warehouse using the OLAP technology. This approach leads to the de nition of a client/server architecture that associates the Salammb^o software, a fuzzy decision tree construction system, with Oracle Express, a multidimensional DBMS. Results are promising and encourage us to initiate further research, among which the tests with other databases, further comparisons between the dierent solutions we have presented here to exchange statistics, the integration of other learning methods, the implementation of a more friendly user interface and the handling of fuzzy data (imprecise and/or uncertain) in the hypercubes.
8 Acknowledgments. This work is supported by the LIP6 (Computer Science Department of the University Paris 6) and has been made possible thanks to the database provided by the French Educational Department.
References 1. Agrawal, Gupta A., and Sarawagi S., Modeling Multidimensional Databases in Proc. of the 13th Int. Conf. on Data Engineering, Birmingham, U.K., 1997. 2. Bouchon-Meunier B., Marsala C., Ramdani M., "Learning from Imperfect Data", in Fuzzy Information Engineering: a Guided Tour of Applications, D. Dubois and H. Prade and R. R. Yager eds, John Wileys and Sons pub., chap. 8, pp. 139-148, 1997. 3. Bouchon-Meunier B., Marsala C. "Learning Fuzzy Decision Rules", in Fuzzy Sets in Approximate Reasoning and Information Systems, J. Bezdek, D. Dubois and H. Prade eds, Kluwer Academic Pub., Handbooks on Fuzzy Sets Series, chap. 4, pp. 279-304, 1999. 4. M.S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from a Database Perspective", IEEE Trans. on Knowledge and Data Engineering, pp 866-883, 1996. 5. Codd E.F., Codd S.B. and Salley C.T., "Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT Mandate", 1993. 6. V. Harinarayan, A. Rajaraman, J. Ullman, "Implementing data cubes eciently", in Proc. of the 1996 ACM SIGMOD Int. Conf. on Management of Data, pp 205-216, 1996. 7. Marsala C., Bouchon-Meunier B., and Ramer A., Hierarchical Model for Discrimination Measures, in Proceedings of the IFSA'99 World Congress, Tapei, Taiwan, pp. 339{343, August 1999. 8. "An introduction to OLAP, Multidimensional Terminology and Technology", Pilot White Paper, http://www.pilotsw.com/olap/olap.htm#dsgl, 1999. 9. Quinlan J.R., "Induction of Decision Trees", in Machine Learning 1:1, pp. 86{ 106, 1986.