Database Mining through Inductive Logic

3 downloads 0 Views 58KB Size Report
Mining Web Access Data. As a test for our work we ... The web access log (or WebLog) is a record of .... Computer Science, K.U.Leuven, Belgium. Ramakrishnan ...
Database Mining through Inductive Logic Programming Himanshu Gupta, Iain McLaren*, Alfred Vella Department of Computing and IS, The University of Luton, Luton LU1 3JU, England. [email protected], [email protected] *Searchspace Limited, 60 Charlotte Street, London W1P 2AX, England [email protected]

Abstract Rapid growth in the automation of business transactions has lead to an explosion in the size of databases. It has been realised for a long time that the data in these databases contains hidden information which needs to be extracted. Data mining is a step in this direction and aims to find potentially useful and non-trivial information from these databases in the form of patterns. As the size and complexity of these database increases, the question that normally arises is “Are the existing data mining techniques efficient enough for large databases”? This paper addresses this issue and looks at an alternative approach, Inductive Logic Programming, and its integration with deductive databases. This integration leads to development of a system called GLIDE (Generalised Logical Inductive and Deductive Environment).

1. Introduction With the rapid growth in the automation of business transactions, there has been an enormous increase in the size of databases. It has been realised for a long time that the data contained in these databases can be used to extract some useful information in the form of patterns that can be useful in acquiring business intelligence. Early attempts in this direction were based on using statistical techniques which mainly focused on confirming expected patterns. As the size and complexity of databases increased, need of alternative techniques was realised and researchers made attempts to use artificial intelligence techniques for pattern extraction. This led to knowledge discovery in databases. Knowledge discovery in databases (KDD) is an attempt to convert high volume data into high value information (Frawley, Shapiro, and Matheus 1991). Data mining, a stage of the KDD process, is the process of automatically extracting potentially useful and non-trivial information from passive data using various artificial intelligence techniques (Williams and Huang 1996). The fact that there is potentially useful information hidden within large databases suggests a need for the acceptance of data mining as a method for obtaining business intelligence.

Though data mining seems to be an ideal solution, conventional techniques, with their background in artificial intelligence, emphasise the development of accurate models without giving much support for large data sets. For example, the algorithms developed often confine data to main memory. Although algorithms have recently been produced which no longer need the data to be loaded into main memory (Agrawal et al. 1993), these techniques are often restricted to a particular class of problem. In real world problems, databases can extend to hundreds of gigabytes and hence can not be fit into main memory, and require a variety of techniques to be mined effectively. Therefore, conventional data mining systems often provide poor support for real world data sets and the question in debate is "Are data mining techniques efficient when applied to very large databases?" This paper concentrates on alternative techniques which are capable of dealing with very large databases. For this, we first discuss a database mining framework in the next section. The alternative approach of Inductive Logic Programming is discussed in section 3. This technique within the database mining framework leads to development of GLIDE and is discussed in section 4. Finally this work is concluded with some directions for future work.

2. Database Mining Conventional database systems offer little functionality to support data mining applications (Holsheimer and Kersten 1994). At the same time, statistical and machine learning techniques usually perform poorly when applied to large data sets. These twin limitation appeals to the development of techniques which extract useful knowledge from large data sets efficiently and within a reasonable time. Database mining is a step in this direction. This section highlights some of the work done by researchers in the area of database mining. For a more complete review of database mining research see (Imielinski and Mannila 1996). Pioneering efforts in the field of database mining have been made at the IBM Almaden Research Centre. This work has covered a wide variety of topics, including

association (Agrawal et al. 1992) and classification (Agrawal et al 1993). These techniques were developed in the context of large databases. Another wide-ranging body of work in database mining has been developed by researchers at Simon Fraser University in Canada. This work is largely based upon the use of attribute oriented data mining techniques (Cai, Cercone, and Han 1991). These attribute oriented techniques appear to be particularly suited to database environments. This work has led to the development of the DBMiner system (Han et al. 1996a). The DBMiner system uses a much less tightly coupled environment than the IBM research. This is as you would expect as the techniques developed are supposed to be generic to a range of database systems. A third approach to database mining has been developed at CWI in the Netherlands. This work enhances the Monet database system (Van den Berg and Kersten 1994), also developed at CWI, to support data mining operations. This means providing low level support for the production of statistics to execute a range of database applications. Looking for trends in the above mentioned work, we find that each of the approaches is based on rule induction based techniques. This approach is largely chosen because of its fit with the SQL language and the fact that understandable models can be obtained from the algorithms. Secondly, each of the techniques is based on some distribution of labour between the database and machine learning systems. Database mining aims at executing the learning algorithms directly against the database rather than loading the data into main memory. Early attempts at database mining were based on a loosely coupled architecture (Agrawal and Shim 1996), where SQL commands are embedded within a host programming language. The machine learning algorithm retrieves records from the database in a tuple-at-a-time fashion, each time switching control to the database. Recent database mining systems have begun to integrate the machine learning and database components (Han et al. 1996b). The machine learning algorithms are pushed into the execution engine of the database, as shown in Fig 1. These algorithms are able to operate directly against data held in a relational database system. Database Machine Learning Model

Machine Learning

Control functions

Fig 1. Tightly Coupled Architecture This reduces execution time by avoiding context

switching between the database and machine learning algorithm and hence no longer restricts the size of the data set which can be manipulated. However, this research is largely based upon the use of attribute oriented data mining techniques. This means that to mine the data held in a database requires a universal relation to be formed before the data mining technique can be performed. This is an expensive operation and negates many of the advantages offered by the database query processor. To effectively combine data mining and database technologies has considerable implications on the learning techniques used. They must work in a manner suitable to database processing (Gupta, McLaren, and Vella 1997). This means that they must be record oriented and work in a set-at-a-time fashion. New learning styles must also be investigated. These include incremental and iterative learning. In incremental learning, we generate a model from an initial data set and then update this model with introduction of new records. Iterative learning exploits the iterative nature of the KDD process by initially generating a rough model and increasing its accuracy with each pass through the data. This enables considerable machine execution time to be saved, as many records can be discarded early in the modelling process. A technique that has potential in each of these areas is Inductive Logic Programming.

3. An Inductive Logic Programming Approach Inductive Logic Programming (ILP) is the intersection of logic programming and inductive learning (Muggleton 1992). ILP starts by generating hypotheses from a given set of observations and background knowledge and discovers patterns based on this hypothesis space. The hypotheses generated indicate the behaviour of the data and can be treated as a rough model. ILP leads to fast pattern discovery across a broad range of applications. Most conventional data mining techniques are suitable for a particular type of application. For example, back propagation algorithms are suitable for classification based data mining applications, Kohonnen networks are suitable for clustering and rule induction techniques are suitable for association type of problems. This means that for different type of applications, different artificial intelligence technique is needed. Instead all such applications can be solved by configuring a single ILP system. This feature strengthens use of ILP techniques for data mining applications. We used the CLAUDIEN ILP package for test purposes (Raedt and Dehaspe 1996). CLAUDIEN offers a provision of defining ones own declarative bias and background knowledge. Different rules were written in the background file in Prolog syntax

for classification, association, clustering, and sequencing. These rules are read while consulting the background file. Suppose all the visits to the web server are stored as Prolog facts with the predicate name visit, a rule for finding the association rules can be written as follows. asso(pagevisited1, Pagevisited2) :visit(IP1, Date1, Time1, Pagevisited1), visit(IP2,Date2, Time2, Pagevisited2), IP1 = IP2, Date1 = Date2, Time2 > Time1.

Similar style rules can be used to implement the other applications of data mining within an ILP system. Muggleton who used inverse resolution, a process of inverting resolution to find the generalised clauses, carried out initial efforts in ILP. He mainly used six operators, i.e. two 'V' operators (absorption and identification operators), two 'W' operators (intra construction and inter construction operators) and two truncation operators. V operators are used for generalisation whereas W operators are for predicate invention. Main packages developed in this process were CIGOL, GOLEM and PROGOL. Following the lines of the above work, Dzeroski proposed two types of ILP. i.e. empirical and interactive learning. Empirical learning takes all the data in the start and learns the patterns. These types of learning are ideal for large databases but the problem is that all the data has to be accessed in the starting of the process. Instead, interactive ILP accesses the data intermediately but are suitable for small data sets only. This lead to a trade off between the size of the databases used and the type of learning process. In the research we are carrying out, we aim to implement interactive ILP learning features within empirical learning atmosphere. A major advantage of ILP is the efficient use of background knowledge. Background knowledge, along with declarative bias (a bias defined by the user), helps in restricting the hypothesis search (Raedt and Dehaspe 1996), and has potential in allowing both the incremental and iterative learning styles to be implemented. The model generated in the first pass during hypothesis generation gives a rough indication of what constitutes the whole database. So in the next pass, instead of starting from the scratch, this hypothesis model generated in the first pass is further refined. This continues until a pre-defined accuracy level is reached. The hypothesis search space can grow exponentially if declarative bias or background knowledge is not defined. The ILP package CLAUDIEN provides the user a feature to express this declarative bias in the form of Prolog statements which can either be accessed in the form of background knowledge or separately. Another advantage is that ILP techniques are record oriented and can therefore work with several tables at a time, eliminating the need of forming a universal

relation. As the languages used by ILP techniques are compatible with database query languages, it also becomes easier to integrate ILP with existing database technology.

4. Integration of ILP and Deductive Databases As logic programming and relational databases have the same theoretical background, many researchers have identified logic programming as an ideal formalism for developing database applications. This has lead to development of deductive databases. By incorporating the rules found in logic programming languages like Prolog, deductive databases support more expressive data definition and query language than relational systems. A major difference between logic programming and deductive database systems is in their approach to evaluation. Logic programming systems, like Prolog, use a tuple at a time evaluation strategy, returning one answer before backtracking to get the next answer. In order to support efficient access to the database file store, deductive database like relational systems use set at a time evaluation strategies, returning all the answers at once. A deductive database supports all the features which are associated with a standard relational system through a declarative logic based language. A major problem that existing ILP systems have to efficiently process large data sets is that they also access data in a tuple-at-a-time fashion. Database systems are designed to work effectively when data is accessed in a set-at-a-time fashion. This leads to a mismatch between their approaches to processing data. To overcome this problem, we use a deductive database management system to access the data (Ullman and Zaniolo 1990) (Das 1992). The advantages offered by this are two folds. Firstly, the access method becomes set oriented, which implies faster access and proximity to the database mining framework. Secondly, as the ILP system generates patterns, features of the database query processor can be utilised to apply these patterns, in the form of queries, for verification of hypotheses. To achieve the effective integration of ILP and deductive database technologies requires a rethinking of the approach taken to processing data in ILP, so there is no longer a mismatch with the data access engine. The deductive database system used in this project is CORAL (COntrol, Relations And Logic) (Ramakrishnan, Seshadri and Srivastava 1990).

5. The GLIDE Architecture To achieve efficient database mining, the GLIDE (Generalised Logical Inductive and Deductive Environment) architecture is proposed, as shown in Fig 2.

GLIDE’s conceptual model is based on the integration of Inductive Logic Programming and deductive databases 6. Mining Web Access Data GLIDE Architecture within the database mining framework.Fig.2 GLIDE accesses As a test for our work we are applying GLIDE to the data directly from the database. A deductive database is analysis of web log data. The web access log (or WebLog) used as the data access tool because of its set oriented is a record of visitors visiting a particular web server. nature and use of logic for expression of rules and facts. These log record information about the name of visitor’s Accessing data directly from the database saves server, date & time of visit, pages visited and amount of considerable time as databases are optimised for information transferred. A typical WebLog looks like manipulating large amounts of data. following:While accessing the data, background knowledge is also triton.pouliadis.gr - expressed. Expression of background knowledge guides [30/Sep/1996:09:40:38 +0000] "GET / the data to be accessed in a way that only relevant data is HTTP/1.0" 200 6116 triton.pouliadis.gr - accessed. Expression of background knowledge is optional [30/Sep/1996:09:40:46 +0000] "GET though recommended. It helps in faster generation of the /Header.html HTTP/1.0" 200 210 first model. This knowledge is expressed in the form of triton.pouliadis.gr - rules and can be expressed with the observations or in the [30/Sep/1996:09:40:46 +0000] "GET /PaletteandContent.html HTTP/1.0" 200 302 form of a different file. Support for background knowledge triton.pouliadis.gr - enables both iterative and incremental learning to be [30/Sep/1996:09:40:47 +0000] "GET implemented. Rules learnt from previous modelling /Palette.html HTTP/1.0" 200 885 exercises can be fed back in to drive future models. triton.pouliadis.gr - Once the data and the background knowledge is accessed, Inductive Logic Programming techniques are used to generate an initial set of hypothesis set. There are two methods of generating hypothesis i.e. resolution and inverse resolution. Resolution works on a top-down approach (general to specific approach) whereas inverse resolution works on bottom up fashion (specific to general). This means that using inverse resolution as a hypothesis generating mechanism will overcome the problem of the existing mismatch between the bottom up working fashion of deductive databases and top down working fashion of Inductive Logic Programming techniques. Generating an initial hypothesis is favourable for iterative learning style as in the next pass, only model relating to this hypothesis search is looked for. This saves the time in looking for a newer model in whole of the data set. This hypothesis model generation and refinement approaches the iterative learning style. To achieve incremental learning style, GLIDE will generate an initial model with fair accuracy. When a new set of data is accessed, it is tested against the existing model. If the new data set proves out to be as a positive example, another set of data is accessed otherwise the model updates itself with the new set of data. This saves execution time in regenerating the whole model again and again. For example, at the present scenario whenever the machine learning code issues a data access call, deductive database returns a set of records at a time. But this set of data is accessed on tuple at a time fashion with the algorithms as they are designed to work on tuple at a time fashion. In GLIDE architecture, the data is returned in set at a time fashion and so will be the access method for the data. Pattern Discovery

Model

[30/Sep/1996:09:40:47 +0000] "GET /GraphicsBin/LutonHeader.gif HTTP/1.0" 200 9392 triton.pouliadis.gr - [30/Sep/1996:09:40:47 +0000] "GET /Content.html HTTP/1.0" 200 6000 triton.pouliadis.gr - [30/Sep/1996:09:40:49 +0000] "GET /GraphicsBin/Palette.gif HTTP/1.0" 200 7723 triton.pouliadis.gr - [30/Sep/1996:09:40:52 +0000] "GET /GraphicsBin/ClearingGraphic.gif HTTP/1.0" 200 13785

From the above set of data, it can be inferred that a visitor from Greece visited this particular server on 30Th September 96 and went to the header page. From that page, this visitor visited Paletteandcontent page and then went to Palette page. Companies are engaged in extracting patterns from the WebLog. Mainly these companies are extracting statistical information like load on the server on hourly, daily weekly, monthly basis. Artificial intelligence techniques can also be used to generate some more useful and complex patterns from the WebLog. These patterns can be useful in designing future web pages. This led to knowledge discovery from rapidly growing size of WebLog. WebLog data can be used to extract patterns relating to all type of data mining applications. These are classification, clustering, association, sequencing and forecasting. For instance, classification finds pattern based on classifying the visitors according to faculties visited or clustering will group frequently visited pages etc. However, all these applications demand use of separate data mining techniques. Instead, configurations of GLIDE

can be used to for all such type of applications. For the analysis of the WebLog, a sample of which is shown above, the following declarative bias was defined:dlab_template('0len:[header,graphicsbin,graduateschool,pale tte, humanities]

Suggest Documents