Logic-based Knowledge Discovery in Databases - Semantic Scholar

2 downloads 0 Views 60KB Size Report
1CNUCE Institute ... expressiveness is needed to support the process of knowledge discovery in databases in most .... “Is a promotional campaign effective in ... Development of methodologies for the design of intelligent data analysis systems.
Logic-based Knowledge Discovery in Databases (position paper) Fosca Giannotti1, Mirco Nanni2, Dino Pedreschi2 Pisa KDD Laboratory http://www-kdd.di.unipi.it/ 1 CNUCE Institute CNR – Italian Nat. Research Council Via S. Maria 36, 56126 Pisa, Italy

[email protected]

2

Dipartimento di Informatica Università di Pisa Corso Italia 40, 56125 Pisa, Italy {nnanni, pedre}@di.unipi.it

$EVWUDFW We report here the vision of the Pisa KDD Laboratory, a joint effort of University of Pisa and Italian National Research Council – a research group operating in the area of data mining and knowledge discovery in databases. We first advocate the need of a ORJLFEDVHG NQRZOHGJH GLVFRYHU\ VXSSRUW HQYLURQPHQW, capable of integrating knowledge extraction and knowledge manipulation. It is our conviction that a high degree of flexibility and expressiveness is needed to support the process of knowledge discovery in databases in most challenging data analysis applications. The KDD Laboratory is developing a data mining query language, which deals effectively and uniformly with data preparation, model extraction and model evaluation and analysis, thus providing a powerful formalism where methodologies for classes of challenging applications can be conveniently designed.

&RQWH[W. Most knowledge-intensive data analysis applications require the combination of two kinds of activities: knowledge acquisition, and reasoning on the acquired knowledge according to the expert rules that characterize the business. Data mining techniques are an answer to the first issue, in that they extract from raw data knowledge that is implicit and, more importantly, that is at a higher abstraction level. However, the ability of combining the results of knowledge extraction with expert rules is a key factor of success when building decision support systems.

Many applications of the above mentioned kind, in domains such PDUNHW EDVNHW DQDO\VLV and IUDXGGHWHFWLRQ, are challenging applications, for a variety of reasons. 1.

0XOWLSOH DEVWUDFWLRQ OHYHOV DQG VHSDUDWLRQ RI FRQFHUQV. Manipulation and reasoning over knowledge and data at different abstraction levels is required, in a way that closely follows the database design methodology: - a FRQFHSWXDO level, where the key issue is semantic integration of domain knowledge, expert (business) rules and extracted knowledge, as well as a uniform knowledge representation layer supporting the semantic integration of different analysis paradigms; - a ORJLFDO level, where the key issue is the mapping to relational databases, and the design of required analyses in terms of a query language suitably integrated with mining capabilities; - a physical level, where the key issue is interoperability among various system components: DBMS’s, data mining tools, desktop tools, as well as optimization of queries, mining operations and their combination, including choice of loose vs. tight coupling between the query language and the specialized mining tools.

2.

3.

0DQDJHPHQW RI WKH .'' SURFHVV DQG LWV WDLORULQJ WR VSHFLILF GRPDLQV. The specification and monitoring of a complex process, the so-called .'' SURFHVV, is required. Little support is provided in this sense by the currently available technology. The management of the overall process, from data selection and preprocessing (e.g., the definition of domain-dependent cost models associated with the data to be analyzed) to data mining to knowledge evaluation (e.g., the evaluation of the extracted knowledge w.r.t. domain-dependent metrics), is limited in currently available data mining platforms to version control of data mining experiments. What is needed is a high-level design and development environment for vertical data analysis applications, where data mining tools and models are combined, geared, presented and evaluated in a way which is pertinent to the specific domain. Integration of extracted knowledge and domain knowledge. The role of domain, or background, knowledge is relevant at each step of the KDD process: which attributes discriminate best, how can we characterize a correct/useful profile, what are the interesting exception conditions, etc., are all examples of domain dependent notions. Notably, in the evaluation phase we need to associate with each inferred knowledge structure some quality function that evaluates its information content in the specific domain. The notion of quality strictly pertains to the business decision process. However, while it is possible to define quantitative measures for certainty (e.g., estimated prediction accuracy on new data) or utility (e.g., gain, speed-up, etc.), notions such as novelty and understandability are much more subjective to the task, and hence difficult to define. More generally, a logically uniform representation of data, domain knowledge and extracted knowledge is required, in order to express easily high-level business rules.

All in all, the design and development of knowledge-intensive data analysis applications rise diverse problems that are poorly addressed by the current generation of commercially available systems.

3RVLWLRQ. The position that we maintain is that a coherent formalism, capable of dealing uniformly with induced knowledge and background, or domain, knowledge, would represent a breakthrough in the design and development of decision support systems, in diverse application domains. The advantages of such an integrated formalism are, in principle: •

a high degree of expressiveness in specifying expert rules, or business rules;



the ability to formalize the overall KDD process, thus tailoring a methodology to a specific class of applications;



the separation of concerns between the design level and the mapping to the underlying databases and data mining tools.

We believe that a suitable integration of deductive reasoning, such as that supported by logic database languages, and inductive reasoning, provides a powerful formalism, where methodologies for classes of data analysis applications are conveniently specified. We developed two case studies, from two relevant application domains, to the purpose of illustrating the benefits of a uniform representation of induced knowledge and domain knowledge: •

PDUNHWEDVNHWDQDO\VLV, which requires the development of business rules of value for the market analyst (see [4]), and



IUDXG GHWHFWLRQ, which requires the construction and the evaluation of models of fraudulent behavior (see [2, 5].)

We adopted the LDL++ [10, 11, 12] deductive database system, a rule-based language with a Prolog-like syntax, and a semantics that extends that of relational database query languages with recursion. Other advanced mechanisms, such as non-determinism, nonmonotonicity, and user-defined aggregates, make LDL++ a highly expressive query language, and a viable system for knowledge-based applications [6]. Data mining extensions, mentioned in this paper, are extensively studied in [2, 4, 5, 10]. How can a logic-based database language support the KDD process? Market basket analysis. Our KDD lab engaged in a data mining project with COOP, one of the largest Italian retailers, aimed at an intelligent system for the analysis of supermarket sales data supporting business rules for the market analyst [4]. In market basket analysis, association rules are a popular data mining tool [1]. However, association rules are often too low-level to be directly used as a support of marketing decisions. Market analysts expect answers to more general questions, such as “Is supermarket assortment adequate for the company's target class of customers?” “Is a promotional campaign effective in establishing a desired purchasing habit in the target class of customers?” These are business rules, and association rules are necessary, albeit insufficient, basic mechanisms for their construction. Business rules require the ability of combining association rule mining with deduction, or reasoning: reasoning on the temporal dimension, reasoning at different levels of granularity of products, reasoning on the spatial dimension, reasoning on association rules themselves. As an example of simple, yet useful temporal reasoning on rules, we can give an answer to the question: which rules have been established by a given promotion? Assume that the relation valid_in(interval,ass_rule) collects the extracted association rules that holds during the specified interval (observe that the notion of validity of a rules in an interval is domain-dependent, and should be properly specified.) The following LDL++ fragment computes an answer to the above question, by selecting the rules which were not valid before the promotion, which became valid during the promotion and remained so afterwards: interval(before, -∞, 7/9/1999). interval(promotion, 8/9/1999, 29/9/1999). interval(after, 30/9/1999, +∞). established_rule(AssRule)



valid_in(promotion, AssRule), ¬ valid_in(before, AssRule), valid_in(after, AssRule).

It is clear that a language where knowledge extraction and knowledge manipulation are integrated can provide flexibility and expressiveness in supporting the process of knowledge discovery in databases [9]. 3ODQQLQJ DXGLW VWUDWHJLHV LQ IUDXG GHWHFWLRQ A second case study is about fiscal fraud detection, and was developed within a project with the Italian Ministry of Finance, aimed at investigating the adequacy and sustainability of KDD in the detection of tax evasion [2]. Audit planning is a difficult task, which has to take into account constraints on the available resources, both human and financial. Therefore, planning has to face two conflicting issues:



maximizing audit benefits, i.e., define subjects to be selected for audit in such a way that the recovery of evaded tax is maximized, and



minimizing audit costs, i.e., define subjects to be selected for audit in such a way that the resources needed to carry out the audits are minimized.

The capability of designing systems for supporting this form of decisions poses technically precise challenges [3, 7]: is there a KDD methodology for audit planning which may be tuned according to these options – a desiderata also referred to as DXWRIRFXV data mining? What and how data mining tools may be adopted? How extracted knowledge may be combined with domain knowledge to obtain useful audit models? We adopted classification-based KDD with decision trees [8], and built predictive models of fraudulent taxpayers to purpose of directing the audits towards potentially more profitable subjects. Here, the ability of combining model construction and evaluation support autofocus data mining: for instance, a few lines of recursive LDL++ code support iterative construction of a sequence of predictive models, until a desired level of domain-dependent quality of the model is reached. 5HVHDUFK DJHQGD RI WKH 3LVD .'' /DERUDWRU\. Our lab is a joint initiative of the University of Pisa and the Italian National Research Council, and is formed currently by 3 senior researchers (Fosca Giannotti, Dino Pedreschi, and Franco Turini), 3 junior researchers (Giuseppe Manco, Chiara Renso, and Salvatore Ruggieri), and 2 Ph.D. students (Francesco Bonchi and Mirco Nanni). Our goals can be summarized as follows: •



Design and implementation of a KDSE – Knowledge Discovery Support Environment -

aimed at the development of intelligent data analysis systems with data mining techniques

-

oriented at the uniform treatment of extracted knowledge with domain-specific knowledge

-

based on a logic database language (rule-based, SQL-compatible) that facilitates the integration of mining and querying.

Development of methodologies for the design of intelligent data analysis systems tailored to specific classes of challenging applications, starting from experiences matured in the areas of market basket analysis and fraud detection.

References 1. 2.

3. 4.

R. Agrawal, R. Srikant)DVW$OJRULWKPVIRU0LQLQJ$VVRFLDWLRQ5XOHV. In Proc. of 20th Int. Conference on Very Large Databases, 1994.

F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. $ &ODVVLILFDWLRQ%DVHG 0HWKRGRORJ\ IRU 3ODQQLQJ $XGLW 6WUDWHJLHV LQ )UDXG 'HWHFWLRQ In Proc. 5th ACMSIGKDD Int. Conf. on Knowledge Discovery & Data Mining, KDD'99, pages 175184. ACM Press, August 1999. P.K. Chan, S.J. Stolfo. /HDUQLQJ ZLWK 1RQXQLIRUP &ODVV DQG &RVW 'LVWULEXWLRQV (IIHFWVDQGD0XOWL&ODVVLILHU$SSURDFK Machine Learning Journal, 1999, 1-27.

Giannotti, F., M. Nanni, G. Manco, D. Pedreschi, F. Turini. ,QWHJUDWLRQRI'HGXFWLRQ DQG ,QGXFWLRQ IRU 0LQLQJ 6XSHUPDUNHW 6DOHV 'DWD. In Proc. PADD'99, Practical

Application of Data Discovery, Int. Workshop, p. 79-94. The Practical Applications Company, London, 1999.

5.

6.

7. 8. 9.

Bonchi, F., Giannotti, F., G. Mainetto, D. Pedreschi. 8VLQJ'DWD0LQLQJ7HFKQLTXHVLQ )LVFDO )UDXG 'HWHFWLRQ. In Proc. DaWak’99, 1st Int. Conf. on Data Warehousing and Knowledge Discovery, Florence, 1999. LNCS, n. 1676, p. 369-376. Springer, 1999.

F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. 4XHU\ $QVZHULQJ LQ 1RQGHWHUPLQLVWLF 1RQPRQRWRQLF /RJLF 'DWDEDVHV. In Procs. of the Workshop on Flexible Query Answering, LNAI, n. 1395. Springer, 1998.

Fawcett, T and Provost, F. “Adaptive Fraud Detection”, 'DWD0LQLQJDQG.QRZOHGJH 'LVFRYHU\9RO1R, pp. 291-316, 1997. Quinlan, J. R., &3URJUDPVIRU0DFKLQH/HDUQLQJ, Morgan Kaufman, 1993.

J. Han. Towards On-Line Analytical Mining in Large Databases. In 6LJPRG5HFRUGV, 27(1):97-107, 1998.

10. W. Shen, K. Ong, B. Mitbander, and C. Zaniolo. Metaqueries for Data Mining. ,Q $GYDQFHVLQ.QRZOHGJH'LVFRYHU\DQG'DWD0LQLQJ, pages 375-398. AAAI Press/The MIT Press, 1996.

11. C. Zaniolo, N. Arni, K. Ong. 1HJDWLRQ DQG $JJUHJDWHV LQ 5HFXUVLYH 5XOHV 7KH /'/$SSURDFK. In Proc. DOOD93, LNCS vol.760, 1993. 12. C.Zaniolo, H.Wang. /RJLF%DVHG8VHU'HILQHG$JJUHJDWHVIRUWKH1H[W*HQHUDWLRQRI 'DWDEDVH 6\VWHPV. In The Logic Programming Paradigm: Current Trends and Future Directions. Springer Verlag, 1998.