If inductive logic programming leads, will data mining ...

1 downloads 0 Views 150KB Size Report
Aug 9, 1996 - The increasing popularity of inductive logic programming (ILP) has pro- vided one clear ... integrate inductive reasoning methods. To provide a ...
If inductive logic programming leads, will data mining follow?  Randy Goebel, Department of Computing Science University of Alberta Edmonton, Alberta, Canada T6G 2H1 E-mail: [email protected] WWW: http://www.cs.ualberta.ca/~goebel voice: 403/492-2683 fax: 403/492-1071 August 9, 1996

This paper has appeared in the Proceedings of the Japanese Society for Arti cial Intelligence, Foundations of AI Special Interest Group Workshop on Inductive Logic Programming, pages 39{49, Hokkaido University, Sapporo, Japan, July 31{ August 2, 1996. 

Abstract The increasing popularity of inductive logic programming (ILP) has provided one clear demonstration that machine learning has become practical. Despite its relatively conservative basis, it has natural avenues of both theoretical and practical development. One more general area in which induction has a role is so-called knowledge discovery in databases (KDD) sometimes called data mining. There too induction has a role, but many of the current approaches are based on the creation of abstraction rules, guided by the use of explicit concept hierarchies and hypothesis rankings based on measures like \support" and \con dence" computed against extensional (ground) databases. We examine some of the directions in KDD, with the goal of identifying ILP research that can gracefully lead KDD to improved methods.

[D R A F T { comments welcome]

1

1 Introduction Data mining or knowledge discovery in databases (KDD) is relatively\hot" these days (e.g., [Ng96, FP96]) and there is a lot of work that is explicitly or implicitly motivated by the increased industrial awareness of the market potential (e.g., [AAB+ 96, Mic96]). Each research paper's motivating comments tout the advantages of automatically created abstractions, and the potentially lucrative insights from their use. But this general motivation impinges on an incredible breadth of related areas, so the research is subject to a wide variety of criticisms. For example, one typical criticism come from traditional research areas like statistical inference, whose practioners focus just beyond the the KDD production of association rules, to their justi cation in terms of statistical properties like \support" and \con dence." With the simple statistical nature of these KDD measures thus revealed, it is easy for the statistician to both criticize and propose elaborations for the wide range of existing methods for structuring relational attributes, and manipulating their properties towards apparently preferred KDD results. Another motivation concerns the desired form of data mining results. The overwhelming majority of work is concerned with the production of association rules expressed over relational attributes, e.g., in the grocery store purchase transaction database, an example of a single level association rule is \80% of customers who purchase milk also purchase bread" [FLG96]. The concensus is that ease of interpretation is also an essential feature of KDD| much like the longstanding claims within arti cial intelligence about the role of semantic network and iconic representations (e.g., [Fin79, Goe93]). After reviewing these primary motivations, it is easy to understand that inductive logic programming (ILP) must concentrate its contributions on the \logic" portion of ILP.

2 From data models to deductive data models To understand the connection between logic and databases, one must consider the historical refrain of database mangement itself, and its mission to provide \data independence" (e.g., [FS76, McL81, Cod81]). This concept requires that a data model|the essential speci cation of a database system|provide the capability for eciently using data independent of the way in which it is stored. This alternative history is important because KDD has matured to the point where some are beginning to seek a new,

[D R A F T { comments welcome]

2

general framework for KDD, based on suggestive names like an \inductive data model," [RCR96] or on important components thereof, like a \data mining query language" [HFW+ 96]. Some recent insightful analysis [Imi96] surveys various properties of data base mining projects, and concurs that most are largely \add ons:" ...The eld can progress only when the database component is fully understood and explored. It is the database component which will give the eld its identity. ([Imi96], page 36). The idea of a data model is useful because in can provide a framework to integrate inductive reasoning methods. To provide a avour of that structure, here are a few of the historical proposals:  Colombetti et al. [CPP78] { a formal speci cation of all operations on objects belonging to the data model, and { a formal speci cation of the semantics of the operations.  Codd [Cod81] { a collection of data structures { a collection of operators { a collection of integrity rules  McLeod [McL81] { a data space: a collection of elements and relationships among the elements, { a collection of type de nition constraints to be imposed on the data, { manipulation operators supporting the creation, deletion, and modi cation of elements, and { a predicate language used to identify and select elements from the database Hindsight reveals that a consequence of these de nitions naturally leads to another important historical thread: the redevelopment of the relational data model in terms of deductive logic (e.g., [Min75, Min78, GMN78, Rei78, Rei81, Rei83]), and eventually to logic programming (e.g., [Cla78, Kow81]). The essence of all this work was to place the concept of a data base and its use into the framework of the simple expression

[D R A F T { comments welcome]

3

database ` query: In addition to revealing how easy it was to consider the results of queries as more than mere retrieval, the companion logical expression

database j= query provided the basis for semantic analyses of dicult issues like so-called \null values" (e.g., [Vas79, Cod79]).

3 From knowing to guessing: adding ILP to data models Just as the deductive model of databases suggests, a query Q is answered by deductively proving that it follows from the existing database DB , viz. DB ` Q. So when does one know when a query is initiating data mining, or merely causing the retrieval of facts? From the ILP viewpoint, the development of ILP as practical inductive reasoning [Mug92, Rae93], doesn't easily t within the deductive interpretation of data models:

database [ inductive hypotheses j= query If inductive logic programming comprises methods for identifying and fabricating useful inductive hypotheses, just what is the relationship of those hypotheses to the traditional data model? Again, consider that the most common form of data mining is the creation of association rules from a database of tuples (e.g., [AIS93, HF95]). Rules such at the one expressed above, \80% of customers who purchase milk also purchase bread," would then be interpreted something like 8X purchase(X; milk)  purchase(X; bread) The \something like" means that the universal interpretation really doesn't hold, but suggests a kind of default hypothesis, suitable for making guesses about the behaviour of a typical shopper. One of the potentially many Of course this makes the connection to another vast research literature on nonmonotonic reasoning, whose discussion must be delayed for now. 

[D R A F T { comments welcome]

4

diculties is how and when does a user of a data model determine when to ask for inductive results or deductive results? To get a feeling for the number of issues that arise in adapting ILP for data models consider the suggestions of Roddick et al. [RCR96]:  typically data mining will not be user-guided but done out-of-line, during periods that a database is idle;  all inductive hypotheses consistent with existing data;  only a limited number of inductive hypotheses are saved;  the consistent inductive hypotheses must be kept consistent over update activities. These suggestions have some basis in rationality, yet there are reasonable arguments against all of them. For example, the size of even the simplest inductive search spaces are such that domain-dependent guidance is necessary; as in non-monotonic reasoning, there are clearly cases where hypotheses are not consistent with all data, but still extremely useful; and it may well be useful to store multiple sets of alternative consistent hypothesis sets each useful for di erent applications of the same database.

4 Directions in Inductive logic programming To provide a representative summary of all important work is challenging. However, it is clear that the robust foundation provided by logic programming has provided the basis for similar robustness within inductive logic programming.y This simplicity and robustness has given rise to the development of sophisticated applications (e.g., [HS92, MKS92, DSKH+ 96]), which certainly supports claims for ILP's practicality. However, as previously mentioned, to o er leadership for data mining is to o er inductive methods which can be cleanly integrated within the traditional database framework. One suggestion is to be relatively conservative, and exploit the view of the relational data model as a ground propositional theory. This is the strategy recently proposed by Bain et al. [BSSS96], who, while saying relatively little about issues such as those identi ed above, argue that the simplicity and scalability of the simplest ILP methods are of most immediate interest. y

For a quick introduction to the theory of ILP see [Gro96].

[D R A F T { comments welcome]

5

A less conservative approach would be to consider the integration of a system like Claudien [DD95] and address the integration of this clausal induction system as suggested by Blockeel and de Raedt [BD96]. In contrast to the conservative approach of [DD95], Blockeel and de Raedt note the obvious diculties with simpler propositional ILP approaches, and begin to address the required integration of a more sophisticated ILP method within a data base query system. Yet another issue requiring much more attention is the integration of methods for providing domain or user-dependent control within the inductive search space. As is already well studied, the identi cation of preferred inductive hypotheses is in fact the challenge in inductive reasoning, and the notion of relative least general generalization (RLGG), common in ILP, will have to be integrated with an inductive data model in a way that database users can relate to their own interpretation of their domains. Furthermore, recent work on the generalization of RLGG [LA96] demonstrates a hierarchy of such generalizations, which will prove even more challenging to incorporate into inductive data models.

5 Summary We have provided little more than a caricature of the issues surrounding the incorporation of ILP methods into data mining, and their required integration within the framework of an inductive data model. Issues unmentioned include the challenges of managing noisy data, the use of probability to provide measures to guide the investigation of inductive hypothesis space, the analysis of classes of hypotheses that are computationally obtainable, and the relative eciency of those that are. However, it has been argued that none of these challenges can be pursued without some attention given to the concept of an inductive data model, and its development of a framework to integrate ILP concepts. With even this cursory analysis, the answer to the question \if ILP leads will data mining follow?" is clearly \no," without the development of some concensus on what an inductive data model is.

Acknowledgements This work has been supported by the Natural Sciences and Engineering Research Council of Canada, and by the Canadian Federal Networks of Centres of Excellence Institute for Robotics and Intelligent Systems.

[D R A F T { comments welcome]

6

Luc de Raedt and Jean-Daniel Zucker provided some quick response to help identify recent related work, as did Stan Matwin. Tomasz Imielinski's presentation of [Imi96] has provided insight and motivation. I look forward to lots of help from the rest of my ILP, data mining, and inductive reasoning colleagues, to see if there is any substance to the claims herein.

References [AAB+ 96] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, and R. Srikant. The quest data mining system. In Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996. [to appear]. [AIS93] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on the Management of Data, pages 207{216, Washington, D.C., May 1993. [BD96] H. Blockeel and Luc De Raedt. Relational knowledge discovery in databases. In Johannes Furnkranz and Bernhard Pfahringer, editors, 13th International Conference on Machine Learning Workshop on Data Mining with Inductive Logic Programming. [see http://www.ai.univie.ac.at/ilp kdd/schedule.html], July 2 1996. [BSSS96] M. Bain, C. Sammut, A. Sharma, and J. Shepard. ReDuce: Automatic structuring and compression in relational databases. In Johannes Furnkranz and Bernhard Pfahringer, editors, 13th International Conference on Machine Learning Workshop on Data Mining with Inductive Logic Programming. [see http://www.ai.univie.ac.at/ilp kdd/schedule.html], July 2 1996. [Cla78] K.L. Clark. Negation as failure. In H. Gallaire and J. Minker, editors, Logic and Databases, pages 293{322. Plenum Press, New York, 1978.

[D R A F T { comments welcome] [Cod79]

7

E.F. Codd. Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4):397{434, 1979. [Cod81] E.F. Codd. Data models in database management. ACM SIGMOD Record, 11(2):112{114, 1981. [CPP78] M. Colombetti, P. Paolini, and G. Pelagatti. An overview and introduction to logic and data bases. In H. Gallaire and J. Minker, editors, Logic and Data Bases, pages 237{257. Plenum Press, New York, 1978. [DD95] Luc De Raedt and Luc Dehaspe. Clausal Discovery. Department of Computer Science, Katholieke Universiteit Leuven, 1995. [under submission]. [DSKH+ 96] S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, and D. Wettschereck. Applying ILP to Diterpene structure elucidation from 13 C NMR spectra. In Johannes Furnkranz and Bernhard Pfahringer, editors, 13th International Conference on Machine Learning Workshop on Data Mining with Inductive Logic Programming. [see http://www.ai.univie.ac.at/ilp kdd/schedule.html], July 2 1996. [Fin79] N.J. Findler, editor. Associative Networks: Representation and use of Knowledge by Computers. Academic Press, New York, 1979. [FLG96] S. Fortin, L. Liu, and R. Goebel. Multi-Level Association Rule Mining based on Dynamic Hierarchies. Department of Computer Science, University of Alberta, Edmonton, Alberta, Canada, 1996. [under submission]. [FP96] Johannes Furnkranz and Bernhard Pfahringer, editors. 13th International Conference on Machine Learning Workshop on Data Mining with Inductive Logic Programming. [see http://www.ai.univie.ac.at/ilp kdd/schedule.html], July 2 1996. [FS76] J.P. Fry and E.H. Sibley. Evolution of data-base management systems. ACM Computing Surveys, 8(1):7{42, 1976.

[D R A F T { comments welcome] [GMN78]

8

H. Gallaire, J. Minker, and J.M. Nicolas. Nondeterministic languages used for the de nition of data models. In H. Gallaire and J. Minker, editors, Logic and Data Bases, pages 3{30. Plenum Press, New York, 1978. [Goe93] R. Goebel. Spatial and visual imagery: humans versus machines. Computational Intelligence, 9(3):406{412, 1993. [Gro96] Machine Learning Group. Introduction to the Theory of Inductive Logic Programming. Oxford University Computing Laboratory, 1996. [see http://www.comlab.ox.ac.uk/oucl/groups/machlearn/ilp theory.html]. [HF95] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the 21st VLDB Conference, Zurich, Switzerland, 1995. [HFW+ 96] J. Han, Yongjina Fu, Wei Wang, Krzysztof Kperski, and Osmar Zaiane. DMQL: a data mining query language for relational databases. In R. Ng, editor, Proceedings of the 1996 ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 27{34, Montreal, Canada, June 1996. [Available as Technical Report 96-08, Department of Computer Science, University of British Columbia, Vancouver, Canada]. [HS92] D. Hume and C. Sammut. Applying inductive logic programming in reactive environments. In S. Muggleton, editor, Inductive Logic Programming, pages 539{549. Academic Press, 1992. [Imi96] Tomasz Imielinski. From le mining to database mining. In R. Ng, editor, Proceedings of the 1996 ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 35{39, Montreal, Canada, June 1996. [Available as Technical Report 96-08, Department of Computer Science, University of British Columbia, Vancouver, Canada]. [Kow81] R.A. Kowalski. Logic as a database language. Department of Computing, Imperial College, July 1981.

[D R A F T { comments welcome] [LA96]

[McL81] [Mic96] [Min75]

[Min78] [MKS92] [Mug92] [Ng96]

[Rae93]

[RCR96]

9

Jianguo Lu and Jun Arima. Inductive logic programming beyond logical implication. Institute for Social Information Science, Fujitsu Laboratories, Numazu, Japan, 1996. [manuscript]. D. McLeod. Tutorial on database research. ACM SIGMOD Record, 11(2):26{28, 1981. Microsoft. Pilot: Data Mining Overview. Microsoft Corporation, 1996. [see http://www.microsoft.com/exchange/exisv/pilotri.htm]. J. Minker. Performing inferences over relation data bases. In W.F. King, editor, Proceedings of the ACM SIGMOD International Conference on the Mangement of Data, pages 79{91, San Jose, California, May 1975. J. Minker. An experimental relational data base system based on logic. In H. Gallaire and J. Minker, editors, Logic and Data Bases, pages 107{147. Plenum Press, New York, 1978. S. Muggleton, R.D. King, and M.J.E. Sternberg. Protein secondary structure prediction using logic. Protein Engineering, 7:647{657, 1992. S. Muggleton. Inductive logic programming. Academic Press, New York, 1992. R. Ng, editor. Proceedings of the 1996 ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Department of Computer Science, University of British Columbia, June 1996. [Available as Technical Report 96-08, Department of Computer Science, University of British Columbia, Vancouver, Canada]. Luc De Raedt. A brief introduction to inductive logic programming. In Proceedings of the 1993 International Symposium on Logic Programming, pages 45{51, Vancouver, Canada, October 26-29 1993. John R. Roddick, Noel G. Craske, and Thomas J. Richards. Handling discovered structure in database systems. IEEE Transactions on Knowledge and Data Engineering, 8(2):227{240, 1996.

[D R A F T { comments welcome] [Rei78] [Rei81] [Rei83] [Vas79]

10

R. Reiter. Deductive question-answering on relational data bases. In H. Gallaire and J. Minker, editors, Logic and Data Bases, pages 149{177. Plenum Press, New York, 1978. R. Reiter. Data bases: a logical perspective. ACM Sigmod Record, 11(2):174{176, 1981. R. Reiter. Towards a logical reconstruction of relational data base theory. In J. Mylopoulos M. Brodie and J. Schmidt, editors, On Conceptual Modelling. Springer-Verlag, 1983. Y. Vassiliou. Null values in database management: a denotational semantics approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 162{169, Boston, Massachusetts, May 1979.

Suggest Documents