Data mining is the process of pattern discovery in a data set from which ......
knowledge (of the game and of his team) and can ask Advanced Scout DM-.
Knowledge Discovery and Data Mining in Databases Vladan Devedzic FON - School of Business Administration, University of Belgrade, Yugoslavia Knowledge Discovery in Databases (KDD) is the process of automatic discovery of previously unknown patterns, rules, and other regular contents implicitly present in large volumes of data. Data Mining (DM) denotes discovery of patterns in a data set previously prepared in a specific way. DM is often used as a synonym for KDD. However, strictly speaking DM is just a central phase of the entire process of KDD. The purpose of this chapter is to gradually introduce the process of KDD and typical DM tasks. The idea of automatic knowledge discovery in large databases is first presented informally, by describing some practical needs of users of modern database systems. Several important concepts are then formally defined and the typical context and resources for KDD are discussed. Then the scope of KDD and DM is briefly presented in terms of classification of KDD/DM problems and common points between KDD and several other scientific and technical disciplines that have well-developed methodologies and techniques used in the field of KDD. After that, the chapter describes the typical KDD process, DM tasks and some algorithms that are most frequently used to carry out such tasks. Some other important aspects of KDD are covered as well, such as using domain knowledge in the KDD process and evaluating discovered patterns. Finally, the chapter briefly surveys some important KDD application domains and practical KDD/DM systems, and discusses several hot topics and research problems in the field that are of interest to software industry.
Introduction As a result of constant increase of business needs, the amount of data in current database systems grows extremely fast. Since the cost of data storage keeps on dropping, users store all the information they need in databases. Moreover, people believe that by storing data in databases they may save some information that might turn up to be potentially useful in the future, in spite that it is not of direct value at the moment [13]. 1
2
Handbook on Software Engineering & Knowledge Engineering
The Key Idea of KDD and DM Raw data stored in databases are seldom of direct use. In practical applications, data are usually presented to the users in a modified form, tailored to satisfy specific business needs. Even then, people must analyze data more or less manually, acting as sophisticated ``query processors". This may be satisfactory if the total amount of data being analyzed is relatively small, but is unacceptable for large amounts of data. What is needed in such a case is an automation of data analysis tasks. That's exactly what KDD and DM provide. They help people improve efficiency of the data analysis they perform. They also make possible for people to become aware of some useful facts and relations that hold among the data they analyze and that could not be known otherwise, simply because of the overload caused by heaps of data. Once such facts and relations become known, people can greatly improve their business in terms of savings, efficiency, quality, and simplicity. Typically, KDD/DM systems are not general-purpose software systems. They are rather developed for specific users to help them automate data analysis in precisely defined, specific application domains.
Definitions Knowledge discovery is the process of nontrivial extraction of information from data, information that is implicitly present in that data, previously unknown and potentially useful for the user [13]. The information must be in the form of patterns that are comprehensible to the user (such as, for example, If-Then rules). For a data set F, represented in a language L, if FS is a subset of F and c denotes certainty measure, then a pattern is an expression S in the language L with the certainty c about some relation(s) among the data in Fs. In order for the expression S to really be a pattern, it must be simpler than merely counting all the data in F. Knowledge is a pattern that is sufficiently interesting to the user and sufficiently certain. The user specifies the measure of interestingness (see below) and the certainty criterion. Discovered knowledge is the output from a program that analyzes a data set and generates patterns. A pattern's certainty is the measure of confidence in discovered knowledge represented by the pattern. Discovered knowledge is seldom valid for all the data in the data set considered. Pattern's certainty is higher if data in the data set considered are good representatives of data in the database, if they contain little or no noise at all, if they are valid, reliable, complete, precise, and contain no contradictions. A pattern's measure of interestingness is a quantitative indicator used in pattern evaluation. Only interesting patterns are knowledge. A pattern is interesting if it is new, non-trivial, and useful. Data mining is the process of pattern discovery in a data set from which noise has been previously eliminated and which has been transformed in such a
Knowledge Discovery and Data Mining in Databases
3
way to enable the pattern discovery process. Data mining is always based on a data-mining algorithm.
The Context and Resources for KDD Figure 1 illustrates the context and computational resources needed to perform KDD [20]. The necessary assumptions are that there exists a database with its data dictionary, and that the user wants to discover some patterns in it. There must also exist an application through which the user can select (from the database) and prepare a data set for KDD, adjust DM parameters, start and run the KDD process, and access and manipulate discovered patterns. KDD/DM systems usually let the user choose among several KDD methods. Each method enables preparation of a data set for automatic analysis, searching that set in order to discover/generate patterns (i.e., applying a certain kind of DM over that set), as well as pattern evaluation in terms of certainty and interestingness. KDD methods often make possible to use domain knowledge to guide and control the process and to help evaluate the patterns. In such cases domain knowledge must be represented using an appropriate knowledge representation technique (such as rules, frames, decision trees, and the like). Discovered knowledge may be used directly for database query from the application, or it may be included into another knowledge-based program (e.g., an expert system in that domain), or the user may just save it in a desired form. Discovered patterns mostly represent some previously unknown facts from the domain knowledge. Hence they can be combined with previously existing and represented domain knowledge in order to better support subsequent runs of the KDD process. Application KDD method Database
search / DM
User
evaluation Data dictionary
Domain knowledge
Discovered knowledge
Figure 1 - The Context and Resources for KDD (after [20])
Examples Some airlines companies use databases of their passengers to discover patterns in the way they fly (embarkation and destination ports, return flights, routes, frequency of flying to a specific destination, etc.). Exceptional passengers get
4
Handbook on Software Engineering & Knowledge Engineering
promotional prizes, which attracts more customers to the company's Frequent Flyer program. Another example is the way many banks use KDD/DM systems to explore their databases of credits and loans. Based on the patterns they discover from such databases, the patterns' certainties, and the measures of the discovered patterns' interestingness, the banks can more successfully predict the outcome and possible consequences of approving a loan to certain customers, thus increasing the quality of their business decisions. Marketing agencies use KDD/DM systems to discover patterns in the way customers buy retail products. Once they discover that many people buy product A simultaneously with product B, they can easily create an appropriate and potentially successful commercial or marketing announcement.
The Scope of KDD/DM and Typical Problems The fields of KDD and DM were developed quite gradually, and different names were used for them in the past (such as data archeology, knowledge extraction, information discovery, information harvesting, and pattern processing). The concept of DM has been known in the field of statistics long before database professionals have started to develop and use KDD. It is only in the late 1980s and early 1990s that the database community has shown its interest in KDD and DM. However, since mid-1990s both fields have gone through a rapid expansion, due to an extraordinary support and attention of software industry.
Classification of KDD/DM Problems Being the central activity of KDD, DM certainly represents the most challenging problem of KDD. However, KDD covers not only DM, but also many other problems and related concepts, topics, activities, and processes as well [1], [11][26]. They include (but are not limited to) the following. Integration of different methods of machine learning and knowledge discovery. Machine learning provides a number of algorithms for learning general concepts from specific examples, learning by analogy, learning classification rules, and so on, which are all useful in the KDD process for pattern discovery. Integration of knowledge-based systems and statistics. Knowledge-based systems provide a rich spectrum of knowledge-representation techniques, some of which are used for representing the patterns discovered by applying KDD processes. For example, discovered patterns are often represented in the form of rules or decision trees. Also, many KDD/DM systems nowadays are based on neural networks, since neural networks can also be used to identify patterns and to eliminate noise from data. Statistical measures are always necessary in KDD, because they help select and prepare data for KDD, as well as quantify the
Knowledge Discovery and Data Mining in Databases
5
importance, certainty, inter-dependencies, and other features of discovered patterns. Discovering dependencies among the data. A change of a data item in a large database can sometimes result in other data changing as well, and it is often useful to be able to predict them. Using domain knowledge in the KDD process. Domain knowledge can dramatically improve the efficiency of the KDD process (see the dedicated section below). Interpreting and evaluating discovered knowledge. Once some patterns are discovered using an appropriate DM algorithm, it is important to determine whether they have an important meaning for the user, or just represent useless relations among the data. Including discovered knowledge into previously represented domain knowledge. The idea here is to use the knowledge discovered in the KDD process along with the other facts and rules from the domain that help guide the process, and thus further improve the process' efficiency. Query transformation and optimization. Discovering useful patterns in data can help modify database queries accordingly and hence improve data access efficiency. Discovery of data evolution. Data analysts often benefit from getting an insight into possible patterns of changes of certain data in the past. Discovery of inexact concepts in structured data. Highly structured data (such as data records in relational databases) sometimes hide some previously unknown concepts which do make sense to domain experts, although they often cannot define the meanings of such concepts exactly; such concepts can be discovered by appropriate combinations of fields in data records, groupings of subsets of data in larger data sets, and so on. Process, data, and knowledge visualization. In some applications, patterns and activities in the KDD process are best described by means of various graphs, shading, clusters of data and animation. selecting the most appropriate visualization technique is an important problem in KDD. Error handling in databases. Unfortunately, real-world databases are full of inconsistencies, erroneous data, imprecision, and other kinds of noise. handling such data properly can be the key to success of the entire KDD process. Representing and processing uncertainty. Discovered patterns are never absolutely certain, and there are various ways to represent their certainty; Integrating object-oriented and multimedia technologies. Most KDD systems today work with relational databases. It is only since recently that KDD and DM extend to object-oriented and multimedia databases. Various ethical, social, psychological and legal aspects. Although KDD applications are useful, not all of them are 100% ethical. For example, mining various kinds of race-dependent data, some medical records, as well as some tax-payment databases, can sometimes provoke unwanted interpretations such as
6
Handbook on Software Engineering & Knowledge Engineering
jeopardy of privacy. There are also examples of giving up a KDD project because of possible legal consequences. There is another kind of classification that is often used in describing KDD/DM systems. It is based on typical kinds of knowledge that the DM process mines for, and on the corresponding types of DM activities (such as cluster identification, mining association rules, or deviation detection - see the dedicated section on DM tasks).
KDD/DM and Other Scientific and Technical Disciplines It follows from the above classification of KDD/DM problems that many other scientific and technical disciplines have some points in common with KDD and DM. KDD/DM specialists apply (possibly with some adaptation) many useful algorithms, techniques and approaches originally developed and used in other fields. Summing up the above list and and a number of discussions on influences from other fields on KDD/DM [11], [14], [21], one may note that the other fields that have much to do with KDD and DM are database technology itself, statistics, pattern recognition, knowledge acquisition and representation, intelligent reasoning, expert systems, and machine learning. All of these fields except of database technologgy and statistics are also about identifying, representing and processing knowledge, hence it is no wonder that they go together well with KDD. One other field of computer science and engineering requires a special note, because it is the one that probably most largely overlaps with KDD/DM. It is data warehousing, since data warehouses provide much of data preprocessing used in the KDD process as well (see the section on KDD as a process, later on in this chapter). A further comment is also necessary on the relation between KDD/DM and statistics. Although KDD and DM heavily rely on techniques from statistics and probability theory [15], [26], it is important to stress that statistics alone is not sufficient for KDD. It certainly does enable some data analysis, but it necessarily requires the user to take part in it. The goal of KDD/DM is automated data analysis. Statistics often gives results that are hard to interpret, and cannot handle non-numerical, structured data. It also doesn't enable the use of domain knowledge in data analysis.
KDD as a Process Knowledge discovery is a process, and not a one-time response of the KDD system to a user's action. As any other process, it has its environment, its phases, and runs under certain assumptions and constraints. Figure 2 shows typical data sets, activities, and phases in the KDD process [12]. Its principal resource is a database containing a large amount of data to be
Knowledge Discovery and Data Mining in Databases
7
searched for possible patterns. KDD is never done over the entire database, but over a representative target data set, generated from the large database. In most practical cases, data in the database and in the target data set contain noise, i.e. erroneous, inexact, imprecise, conflicting, exceptional and missing values, as well as ambiguities. By eliminating such noise from the target data set, one gets the set of preprocessed data. The set of transformed data, generated from the preprocessed data set, is used directly for DM. The output of DM is, in general, a set of patterns, some of which possibly represent discovered knowledge. Selection Preprocessing
Database
Interpretation / evaluation
Transformation
Target data
Preprocessed data
DM
Transformed data
Patterns
Discovered knowledge
Figure 2 - Phases in the KDD process (after [12])
The phases of the process are as follows. Selection is an appropriate procedure for generating the target data set from the database. Its major goal is to select typical data from the database, in order to make the target data set as representative as possible. The phase of preprocessing eliminates noise from the target data set and possibly generates specific data sequences in the set of preprocessed data. Such sequences are needed by some DM tasks (e.g., sequence analysis; see the next section for more detailed explanation). The next phase is transformation of the preprocessed data into a suitable form for performing the desired DM task. DM tasks are specific kinds of activities that are carried out over the set of transformed data in search of patterns, guided by the kind of knowledge that should be discovered. The kinds of transformations of the preprocessed data depend on the DM task. Typically, transformations include some appropriate reduction of the number of fields in the data records, since each DM task focuses only on a subset of data record fields. Also, some further modifications and combinations of the remaining data record fields can be done in order to map the original data onto a data space more suitable for the DM task that will be carried out in the next phase, the DM. In the DM phase, a procedure is run that executes the desired DM task and generates a set of patterns. However, not all of the patterns are useful. The goal of interpreting and evaluating all the patterns discovered is to keep only those patterns that are interesting and useful to the user and discard the rest. Those patterns that remain represent the discovered knowledge.
8
Handbook on Software Engineering & Knowledge Engineering
In practice, the KDD process never runs smoothly. On the contrary, it is a time-consuming, incremental, and iterative process by its very nature, hence many repetition and feedback loops in Figure 2. Individual phases can be repeated alone, and the entire process is usually repeated for different data sets. Discovered patterns are usually represented using a certain well-known knowledge representation technique, including inference rules (If-Then rules), decision trees, tables, diagrams, images, analytical expressions, and so on [1], [24]. If-Then rules are the most frequently used technique [12], [21]. The following example is a pattern from financial domain, discovered by applying the MKS system described in [3]: IF THEN
WITH
Home_Loan = Yes Post_Code = POST_RURAL and Gender = MALE and Marital_Status = MARRIED and Access_Card = Yes and Credit_Turnover = 4000_GTR and Account_Type = CURRENT and Credit_Amount = 4500_GTR certainty = 23.75%, support = 23.75%, interestingness = 0.806
Decision trees are a suitable alternative for If-Then rules, since with many machine learning algorithms the concepts the program learns are represented in the form of decision trees. Transformations between decision trees and inference rules are easy and straightforward. Rules are often dependent on each other, so the discovered knowledge often has the form of causal chains or networks of rules. KDD systems usually apply some probabilistic technique to represent uncertainty in discovered patterns. Some form of certainty factors often goes well with inference rules (see the certainty and support indicators in the above example). Probability distribution is easy to compute statistically, since databases that KDD systems start from are sufficiently large. Probability distribution is especially suitable for modeling noise in data. Fuzzy sets and fuzzy logic are also used sometimes. However, it is important to note that an important factor in modeling uncertainty is the effort to actually eliminate sources of uncertainty (such as noisy and missing data) in early phases of the process. The advantage of representing patterns in textual form, such as If-Then rules, is that it is simple to implement and well understood, since it reflects human cognitive processing. The main disadvantages are that textual patterns may be hard to follow and interpret (they may be rather complex), and that some patterns are very hard to express as text [13], [28]. Graphical forms of pattern representation, such as diagrams, images and shading, are visually rich and exploit human perceptual and cognitive abilities for pattern interpretation [10],
Knowledge Discovery and Data Mining in Databases
9
[18]. For example, a shaded region on the map of a certain country may denote the country's region where the KDD/DM system discovers that some increase of sale can be expected under certain conditions. As another example, domain experts may be capable of easily interpreting peaks on a diagram representing a pattern discovered in a patient database about the effects of a certain drug applied over a number of patients. However, patterns represented graphically are harder to implement and require more complex interfaces and supporting software. What technique exactly should be used to represent discovered knowledge depends on the goals of the discovery process. If discovered knowledge is for the users only, then natural language representation of rules or some graphically rich form is most suitable. Alternatively, discovered knowledge may be used in the knowledge base of another intelligent application, such as an expert system in the same domain. In that case, discovered knowledge should be translated into the form used by that other application. Finally, discovered knowledge may be used along with the previously used domain knowledge to guide the next cycle of the KDD process. That requires representing discovered patterns in the same way the other domain knowledge is represented in the KDD system.
Data Mining Tasks DM tasks depend on the kind of knowledge that the KDD/DM system looks for. Each DM task has its specific features and follows specific steps in the discovery process. The following DM tasks are among the most frequently used ones in nowadays KDD/DM application [1], [11]-[21], [24], [26]. Classification. The task is to discover whether an item from the database belongs to one of some previously defined classes. The main problem, however, is how to define classes. In practice, classes are often defined using specific values of certain fields in the data records or some derivatives of such values. For example, if a data record contains the field Region, then some of the field's typical values (e.g., North, South, West, East) may define the classes. Cluster identification. In this task the goal is to generate descriptions of classes of data (clusters), as opposed to classification, in which the classes are known beforehand. When identifying clusters, various Euclidean distance measures are used to compute how close are data items to each other in an ndimensional space defined by the fields in data records. For example, densly distributed points on a section of a 2-D graph can be interpreted as a cluster. Summarization. This is the task of discovering common and/or essential characteristics of groups of data in a compact form (e.g., ``all Olympics winners of 100m race were black and under 28"). Typically, the compact form is summarization rules, and more sophisticated techniques make possible to generate functional dependencies among data items.
10
Handbook on Software Engineering & Knowledge Engineering
Dependency modeling. The goal of this task is to describe important dependencies among the data in a data set. Dependency means that the value of a data item can be predicted with some certainty if the value of another data item is known (e.g., A → B, CF = 0.93). A collection of interrelated dependencies forms a dependency graph. It shows cascaded dependencies (dependency propagation). Change and deviation detection. One form of knowledge that can be discovered in a data set is related to specific deviations and changes of data values with respect to some expected values (e.g., the change in mean value, standard deviation, probability density, gradient, and other statistical indicators, as well as detection of outliers, changes of data values over time, and difference between expected and observed data values). Discrimination. In the context of KDD, discrimination means identification of distinct features that make possible to differentiate among two classes of data. For example, ``truly ascetic people are thin". Discrimination is naturally done before cluster identification, classification, change and deviation detection, and discovering reference values and attributes. Discovering reference values and attributes. In order to perform some other DM tasks, such as change and deviation detection, it is necessary to discover reference values and attributes first and then use them for comparisons with other data. For example, ``soldiers on duty wear uniforms and carry weapons". Sequence analysis. Some data have their full-blown meanings and exhibit some correlation's only if they are considered within a regular series (e.g., time series data). Selection and preprocessing phases of the KDD process are quite critical in this task, since data sets must represent a regular sequence of data. Mining association rules. This is a frequently used DM task. It detects correlations between two or more fields in a data record, or among sets of values in a single field. The correlations are represented as rules, such as ``From all data records that contain A, B, and C, 72% also contain D and E". This is important if the task is, for example, to discover what items consumers usually buy together in retail shopping (``shopping patterns"). Link analysis. It means detecting correlations among different data records, usually in the form of rules with certainty factors. It is similar to mining association rules, but here the correlations are between different records, not between different fields. Spatial dependency analysis. This task discovers patterns of spatial data in geographical information systems, astro-physical systems, ecological systems, and so on. For example, ``The price of real estate at location X always raises in September and drops in May". Discovering path traversal patterns. Sometimes dependencies among data items can be suitably modeled using graphs. In such cases it may be interesting to discover some paths in the graph that are important in a certain sense, and are not explicitly known. A typical example is discovering path traversal patterns in
Knowledge Discovery and Data Mining in Databases
11
accessing WWW pages. Knowing such patterns can help design Web-based applications better [8], [9].
Data Mining Algorithms DM algorithms are procedures for pattern extraction from a set of cleaned, preprocessed, transformed data. Many DM algorithms are not used in DM exclusively, but have been adopted from other disciplines and adapted for DM. However, some DM algorithms have been originally developed by DM experts. There is no such a thing as a universally good DM algorithm. While each DM algorithm corresponds to one specific DM task, there are usually several popular algorithms for performing each DM task. DM experts continue to invent new algorithms and improve existing ones. Two main aspects of each DM algorithm are pattern identification and pattern representation and description. Pattern identification is the process of discovering collections (classes) of data items that have something in common. Numerical DM algorithms are often used for pattern identification. All of them are based on maximizing similarities between the data within a class and minimizing that similarity between the classes themselves. Euclidean distance measures are used to compute the similarities. However, numerical DM algorithms work well only with essentially numerical data. For structured data, such as records and objects, numerical DM algorithms are not easy to apply. Moreover, it is hard to use domain knowledge with numerical algorithms (e.g., the knowledge about the shape of a cluster). Another kind of algorithms for pattern identification is algorithms for discovering conceptual clusters. They use similarities among the attributes of data items, as well as conceptual cohesion of data in the items. Conceptual cohesion is defined by domain knowledge. Such algorithms work well for both numerical and structured data. Finally, it is possible to identify patterns and clusters interactively, i.e. with an active participation of the user. The idea is to combine the computing power of the KDD/DM system with the knowledge and visual capabilities of the user. The system presents data sets to the user in a suitable graphical form, along various dimensions, and the user herself/himself defines the patterns and the clusters and guides the KDD/DM process in a desired direction. Pattern representation and description is a transformation of the procedure defined by a DM algorithm and the results obtained by applying the algorithm into the space of universally known algorithms, methods and techniques that are used in other disciplines as well. There are two ideas behind it. The first one is to transform the results obtained by applying a DM algorithm into a form that the user easily understands and in which it is easy for the user to tell the meaning of discovered patterns. DM algorithms in their basic form usually generate patterns that are suitable in terms of their representation in the computer and/or
12
Handbook on Software Engineering & Knowledge Engineering
processing efficiency. However, the user needs patterns in the form that she/he is used to, such as rules, decision trees, linear models, or diagrams. The other idea is to use some algorithms that are already used in other disciplines for pattern identification, cluster description, computation of probability density and correlation, etc., and apply them to DM. In that sense, DM systems use numerous well-established algorithms and techniques, such as statistical methods, neural networks, genetic algorithms, k-nearest neighbor algorithm, various techniques of using domain knowledge, case-based reasoning, Bayesian networks, and so on. In order to get a feeling of how a typical DM algorithm works, consider the famous Apriori algorithm for mining association rules [2], [8]. The purpose of the algorithm is to discover association rules of the following form: P ⇒ Q, i.e. P1 ∧ P2 ∧ … ∧ Pm ⇒ Q1 ∧ Q2 ∧ … ∧ Qn The meaning of this expression is that if a data record or one of its attributes contains the item set P, then it also contains the item set Q, and P ∩ Q = ∅. In fact, Apriori discovers patterns in the form of so-called large k-itemsets (k = 1, 2,…), which are essentially the sets of items that are often associated (often go together) within individual data records in the database. ``Often" means that there is a sufficient support s (above a predefined threshold) for large k-itemsets in the database. For example, if {A B} is a large 2-itemset, then at least s% of the data records in the database contains both A and B. The corresponding association rules are A ⇒ B and B ⇒ A. In other words, there is usually no oneto-one correspondence between large itemsets and association rules, but the transformation from large itemsets to association rules is straightforward. The rules that correspond to the large 4-itemset {A B C D} are A,B,C ⇒ A, or A,B ⇒ C,D, or A,C ⇒ B,D, and so on. Apriori discovers large k-itemsets for all k such that the number of large k-itemsets is greater than zero. The complete description of the algorithm is beyond the scope of this chapter, but the basic ideas can be easily understood from an example. Figure 3 shows a simple database. If the required support factor is s = 50%, then each large k-itemset must be present in at least 2 records. Large 1-itemsets are {A}, {B}, {C}, and {E} (but not {D}, since it exists only in the record with ID=10). Large 2-itemsets are {A C}, {B C}, {B E}, and {C E} (but not {A B}, {A D}, and so on, for similar reasons). There is only one large 3-itemset, {B C E}. ID
elements
10 20 30 40
A B A B
C D C E B C E E
Figure 3 - An example database (as in [8])
Knowledge Discovery and Data Mining in Databases
13
There are numerous variants, modifications and improvements of the original Apriori algorithm. One of the best known is proposed within the DHP algorithm (Dynamic Hashing and Pruning) that uses hash functions and hash tables in order to increase the efficiency of discovering large k-itemsets [8], [23]. Another interesting improvement is illustrated in Figure 4. It is concerned with the idea of introducing different levels of abstraction in the process of discovering association rules. Recall that mining association rules is often used to discover shopping patterns from shopping data records. If the idea is to discover what articles people often buy together in newsstands, a discovered rule might be: buy NBA News ⇒ buy Camel cigarettes However, it is highly likely that such a rule would have a small support factor. On the other hand, the rule: buy sport-related newspaper ⇒ buy strong cigarettes would probably have a larger support. Finally, the rule: buy newspaper ⇒ buy cigarettes would have even larger support. It all comes from different levels of abstraction. In general, associations on higher levels of abstraction are stronger but less concrete. Moreover, associations are easy to see only on higher levels of abstraction. Hence it makes sense to explore lower levels of abstraction only if the associations at the higher levels are strong enough. Different thresholds for support factor can be used at different levels of abstraction. The lower the level of abstraction, the more concrete the entities considered and the lower the threshold to be adopted. News stand article ... Cigarettes Newspapers ... ... Musical Sport Mild Strong newspapers newspapers ... ... ... ... NBA News Champ Camel HB
Figure 4 - An example of different levels of abstraction
Using Domain Knowledge in KDD Along with raw data, additional information is also useful in any KDD process, such as information about data themselves, links between them, their meanings, and their constraints. The data dictionary contains parts of that information links between data and internal organization of data within the database. In order
14
Handbook on Software Engineering & Knowledge Engineering
to use the other parts of that information in a KDD process, it is necessary to include domain knowledge in the process (see Figure 1). Domain knowledge in the KDD process is the knowledge about the contents of data in the database, the application domain, and the context and goals of the KDD process. It facilitates focusing of activities in all phases of the process from selection to DM, focusing of the search process (see Figure 1), and more accurate evaluation of discovered knowledge. However, users should take care of the fact that search focusing can result in failing to discover some useful knowledge that happens to lie out of the focus. KDD systems today typically use domain knowledge in the forms of data dictionary, field correlations (e.g., body height and weight are correlated), heuristic rules, and procedures and functions. It is usually a domain expert who provides domain knowledge, but it is also possible to discover it through KDD. In other words, newly discovered knowledge is also a part of domain knowledge. Moreover, it is sometimes possible to represent that newly discovered knowledge in a suitable machinereadable form (e.g., rules) and use them as input to other programs, such as an expert system and a query optimizer. Using domain knowledge in KDD enables data set reduction in performing search activities in the DM phase of the process, hypothesis optimization in various DM tasks, query optimization, and testing validity and accuracy of knowledge discovered in the KDD process [22]. For a simple example, if the task is to discover from a database of medical records whether a certain drug has effects on pregnant patients, domain knowledge can reduce the data set by using the fact that only records of female patients within some age boundaries should be searched. Similarly, domain knowledge can help eliminate redundant clauses from the hypothesis stated as the following rule (and thereby optimize the hypothesis): IF
THEN
age >= 30 gender = FEMALE pregnancy = YES … effect of drug X = POSITIVE
AND AND AND
// redundant!
The way domain knowledge helps in query optimization is easy to see from the following example that uses the fact that search hypotheses are usually easy to convert directly into a query:
Knowledge Discovery and Data Mining in Databases
IF
THEN SELECT FROM WHERE
age >= 70 AND physical condition = BAD AND therapy for other diseases = YES AND insurance = NO operation = NO age, physical condition, therapy for other diseases, insurance Patients, Operations operation = NO Patients.P# = Operations.P#
15
// a hypothesis
// the corresponding // query
The meaning of the above IF-THEN hypothesis is, obviously, that operation is not suggested for patients above a certain age limit, in a bad physical condition, who suffer from other diseases, and have no adequate insurance coverage. The above query corresponds to the hypothesis directly. However, if a domain rule built in the system says ``hemophilia = YES ⇒ operation = NO", then an optimized query might be [22]: SELECT FROM WHERE
age, physical condition, therapy for other diseases, insurance Patients hemophilia = YES
When testing validity and accuracy of knowledge discovered in the KDD process, it might happen that the knowledge contains redundant facts and contradictions. Redundancies are not a serious problem, but contradictions are. Testing discovered knowledge by applying valid and accurate domain knowledge can help determine whether contradictory patterns are really contradictory, or they just appear to be contradictory. For example, a KDD system may discover that in, say, 1998 the increase in selling beverages in the place X was 2%, while in 1999 it was 20%. At the first glance, this may appear to be contradictory. However, domain knowledge may tell the user that the place X is actually a small, tiny village in which the first store selling beverages has been open in early 1999, and hence the significant difference in the percents. On the other hand, an appropriate care should be taken when drawing conclusions about the validity of discovered knowledge, since that knowledge may be correct, and the domain knowledge used in the process may be out of date and inaccurate.
Pattern Evaluation In order to express how important a discovered pattern is for the user, various measures of pattern's interestingness have been proposed. They combine pattern's validity (certainty), novelty, usability and simplicity to specify the pattern's interestingness quantitatively. Pattern's validity is usually a number (on
16
Handbook on Software Engineering & Knowledge Engineering
some prespecified scale) showing to what extent the discovered pattern holds for the data in the database. Novelty shows quantitatively how new is the pattern for the user. Usability is the feature describing how and to what degree the user can use the discovered pattern. Simplicity is best understood from the following example: if the pattern is represented as an If-Then rule, then the fewer the Ifand Then-clauses in the rule, the simpler the pattern is. Simple patterns are easier to interpret and use than complicated ones. Some measures are objective. They show pattern's accuracy and precision, and rely on applying some statistical indicators (e.g., support and certainty factors), probability theory, and/or domain knowledge when evaluating a pattern's interestingness. For example, a KDD system using the summarization DM task may discover the following two patterns: ``sale increase in region 1 is 5%" and ``sale increase in region 2 is 50%". In this example domain knowledge might contain the fact that 1,000.000 people live in region 1 and only 15.000 in region 2, hence the system will evaluate the first pattern to be more important than the second one. There are also subjective measures of interestingness. They are often heuristic and depend on the user, hence can be different for different users. Fortunately, all discovered patterns have some important features that help create subjective measures of interestingness in spite of the fact that different users will use them with different subjective bias. As an illustration, consider the measure proposed by Silberschatz [25]. It relies on pattern's unexpectedness and actionability. Unexpectedness means that a pattern is interesting and important if it surprises the user. For example, if sale increase is more-or-less equal in all regions but one, that might be a surprise for the user who may have expected equal increase in all the regions considered. Actionability is the possibility to translate a discovered pattern into an appropriate action, i.e. to ``do something" when the pattern is discovered. It is directly related to the usability feature described above. After discovering an exceptionally low increase of sale in a certain region, the user can undertake, for example, some marketing actions to improve the sale in that region. Both unexpectedness and actionability are subjective in nature. While actionability is extremely difficult to quantify, there are ways to quantify unexpectedness. An important assumption here is that unexpected patterns are also actionable. If that is so, then measuring pattern's unexpectedness alone is sufficient to measure its interestingness. One way to measure pattern's unexpectedness is to describe and apply the user's system of beliefs, since unexpected patterns contradict the user's beliefs. If the system of beliefs is represented as a set of logical expressions b, then the conditional probability d(b|E) can denote the strength of some belief b, based on some previous evidence E. Strong beliefs have a high value of d(b|E). A pattern that contradicts such beliefs is unexpected and is always interesting. Variable beliefs have lower values of d(b|E). The higher the effect of a pattern on variable beliefs, the more interesting the pattern is to the user.
Knowledge Discovery and Data Mining in Databases
17
Different beliefs have different importance. If bi is a belief in a system of beliefs B, let wi denote the importance (weight) of bi in B. Let also:
∑
bi ∈B
wi = 1
Then, according to [25], the interestingness of pattern p for the system of beliefs B is the sum:
I ( p, B, E ) = ∑b ∈B wi | d (bi | p, E ) − d (bi | E ) | i
The interpretation of the above sum is as follows. If we assume that each belief bi can change upon the discovery of p, then the change of belief bi can be measured as the absolute value of the difference between the strengths of the belief before and after the discovery of p. The importance of the change for the system of beliefs B can be higher or lower, hence the difference is multiplied by wi. Then the sum I(p,B,E) actually shows how much the entire system of beliefs changes when the pattern p is discovered, hence it is a good measure for evaluating the pattern p. Intuitively, adding new data to a database can result in changes of some beliefs about the data in the database. It can be shown from the above expression for computing I(p,B,E) that for the old data D and the newly added ∆D it holds: (d(bi|∆D,D) ≠ d(bi|D)) ⇒ ∃p, I(p,B,E) ≠ 0 This means that when new data are added to a database, the strengths of beliefs d(bi|D) in the user's system of beliefs generally change. If the change is above a predefined threshold, it makes sense to start a search for new patterns. That process is called belief-driven DM.
Systems and Applications KDD/DM systems and various systems that support the KDD process are being developed intensively since mid-1990s. Table 1 shows some important application domains and typical problems that KDD/DM systems are used for in these domains [11], [13], [20], [21], [24]. The following brief descriptions illustrate the variety of current KDD/DM systems and applications. Recon. Lockheed Martin Research Center has developed Recon, a KDD/DM-based stock-selection advisor [18]. It analyzes a database of over 1000 most successful companies in USA, where data are stored for each quarter from 1987 to date. For each company in the database the data change historically, and each data record in the database contains over 1000 fields, such as the stock price for that company, the trend of price and profit change, the number of analysts monitoring the company's business, etc. Recon predicts return on investment after three months of buying some stock and recommends
18
Handbook on Software Engineering & Knowledge Engineering
whether to buy the stock or not. The results of the analysis are patterns in the form of rules that classify stock as exceptional and non-exceptional. Recon recommends to buy the stock if its return on investment can be classified as exceptional. Since the users are financial experts, not computer ones, Recon has a rich, easy-to-use graphical user interface. Pattern evaluation is done interactively, letting the users analyze and manually fine-tune the rules detected in the analysis. Domain
Problem
Medicine
Discovering side-effects of drugs, genetic sequence analysis, treatment costs analysis Credit/Loan approval, bankruptcy prediction, stock market analysis, fraud detection, unauthorized account access detection, investment selection Demographic data analysis, prediction of election results, voting trends analysis Analysis of satellite images Tax fraud detection, identification of stolen cars, fingerprints recognition Sale prediction, identification of consumer and product groups, frequent-flyer patterns Discovering patterns in VLSI circuit layouts, predicting aircraft-component failures Detection of extremely high or chained claims Classification of plant diseases Discovering reader profiles for special issues of journals and magazines
Finance
Social sciences Astronomy Law Marketing Engineering Insurance Agriculture Publishing
Table 1 - KDD/DM application domains and typical problems
CiteSeer. There are huge amounts of scientific literature on the Web. In that sense, the Web makes up a massive, noisy, disorganized, and ever-growing database. It is not easy to quickly find useful and relevant literature in such a database. CiteSeer is a custom-digital-library generator that performs information filtering and KDD functions that keep users up-to-date on relevant research [5]. It downloads Web publications in a general research area (e.g., computer architectures), creates a specific digital library, extracts features from each source in the library, and automatically discovers those publications that match the user's needs. JAM. The growing number of credit card transactions on the Internet increases the risks of credit card numbers being stolen and subsequently used to commit fraud. The JAM (Java Agents for Metalearning) system provides distributed DM capabilities to analyze huge numbers of credit card transactions processed on the Internet each day, in search of fraudulent transactions [7].
Knowledge Discovery and Data Mining in Databases
19
Many more transactions are legitimate than fraudulent. JAM divides a large data set of labeled transactions (either fraudulent or legitimate) into smaller subsets and applies DM techniques to generate classifiers and subsequently generate a metaclassifier. It uses a variety of learning algorithms in generating the classifiers. Subdue. It is possible to search graphs and discover common substructures in them. That is the basis of Subdue, the system developed to perform substructure discovery DM on various databases represented as graphs [10]. Subdue performs two key DM techniques, called unsupervised pattern discovery and supervised concept learning from examples. The system has been successfully used in discovering substructures in CAD circuit layouts, structures of some protein categories, program source code, and aviation data. Experts in the corresponding domains ranked all of the substructures that have been discovered by Subdue highly useful. Advanced Scout. The purpose of the KDD/DM system Advanced Scout is discovering patterns in the basketball game [4]. It helps coaches of the NBA basketball teams in analyzing their decisions brought during the games. Advanced Scout lets a coach select, filter, and transform data from a common database of all the games in the current NBA season, the database that is accessible to all NBA coaches. The database contains huge amounts of data per game, all time-stamped, and showing everything about individual players, their roles, shooting scores, blocks, rebounds, etc. The coach uses his domain knowledge (of the game and of his team) and can ask Advanced Scout DMqueries about the shooting performance, optimal playset, and so on. An example of patterns that Advanced Scout discovers is ``When Price was the guard, Williams' shooting score was 100%". Video records of each game facilitate interactive pattern evaluation process.
Research Issues Recent research efforts and trends in KDD/DM roughly belong to the following avenues: • Improving efficiency of mining association rules. Apriori remains the fundamental algorithm for mining association rules, but its efficiency is constantly being under further investigation. Apart from the already mentioned DHP improvment [8], [23], most notable other attempts to increase Apriori's efficiency are the CD, DD, IDD and HD algorithms that parallelize Apriori for execution by multiple processors [16]. Zaki has proposed several new algorithms based on organizing items into a subset lattice search space, which is decomposed into small independent chunks or sublattices which can be solved in memory [31]. • Mining object-oriented databases. One general criticism to KDD/DM technology is that until very recently it has been applicable only to
20
Handbook on Software Engineering & Knowledge Engineering
•
•
•
•
mining for patterns in relational databases. Knowledge discovery in object-oriented databases has just started to emerge, and there are still many open research problems in that direction. An example of research attempts in that direction is trying to mine for structural similarities in a collection of objects [29]. Mining multimedia databases. This is another research and application area of KDD and DM in which research has only started. Multimedia databases are structurally much different from relational databases, but also much richer in terms of the information they contain. KDD/DM challenges in multimedia databases include mining correlations between segments of text and images, between speech and images or faces, finding common patterns in multiple images, and the like [17]. Mining distributed and heterogeneous databases. Currently, KDD/DM tools work mainly on centralized databases. This is an important limitation, given the fact that many applications work combining raw data from multiple databases in distributed and heterogeneous environments. The question is how to apply KDD/DM tools to distributed, heterogenous databases. One possible way is to first apply KDD process on each database and generate partial patterns, and then to use either a central coordinator or various other tools to integrate these patterns and generate complete results [28]. The various tools have to communicate with each other to provide a complete picture. Another idea is to first integrate multiple databases by creating a data warehouse, and then mine the warehouse for patterns using a centralized KDD/DM tool [27]. In that case, KDD/DM techniques can be used for mining integration knowledge. Examples of such knowledge include rules for comparing two objects from different databases in order to determine whether they represent the same real-world entity. Another example would be rules to determine the attribute values in the integrated database. Text mining. Mining textual databases is tightly coupled with natural language processing, hence it is inherently very hard to fully automate. It requires large amounts of formally codified knowledge (knowledge about words, parts of speech, grammar, word meanings, phonetics, text structure), and the world itself [19]. However, interesting though simple applications are still possible in this area. For example, any online English text can be treated as a database containing a large collection of article usage or phrase structuring (parsing) knowledge. It is possible to automatically extract such kinds of knowledge about natural languages by text mining. One other important goal of text mining is automatic classification of electronic documents and contents-based association between them [30]. Using discovered knowledge in subsequent iterations of the KDD process. This is another shortcoming of current KDD/DM technology
Knowledge Discovery and Data Mining in Databases
21
that has to be investigated further, in spite of some solutions that have been proposed [10], [28]. Automating the use of discovered knowledge is not easy and requires necessary translations, since the form used to represent discovered knowledge is usually not suitable for storing and processing the same knowledge by the same tool. • Knowledge discovery in semistructured data. Semistructured data are data that are structured similarly, but not identically. Traditional data mining frameworks are inapplicable for semistructured data. On the other hand, semistructured data still have some structural features. For example, different person-related records and objects may have different structures and fields (attributes), but most of them will have a Name field, an Address field, and so on. Such structural features of semistructured data make it possible to extract knowledge of common substructures in data. Wang and Liu have proposed a solution for this discovery task based on the idea of finding typical (sub)structures that occur in a minimum number of observed objects [29]. Discovery of common substructures in graphs, such as in the above mentioned Subdue system, is another example of this kind of discovery task [10]. Web mining is a KDD/DM-related issue that attracts much attention in recent years [5], [6]. Although the World Wide Web is not a database in the classical sense, it is itself an enormously large resource of data. It puts much challenge to KDD/DM system developers, since data on the Web are highly unstructured and stored in a number of different formats. However, some minimum structure still exists and is represented by various linguistic and typographic conventions, standard HTML tags, search engines, indices and directories, as well as many classes of semi-structured documents, such as technical papers and product catalogues. This minimum structure creates realistic opportunities for Web mining through resource discovery, information extraction from discovered resources, and pattern discovery on certain locations and groups of locations on the Web. In that sense, Web mining is related to mining semistructured data.
Discussion and Summary Building a KDD/DM application in practice requires a lot of efforts. To an extent, using an integrated software environment for developing KDD/DM applications can alleviate these efforts. Such environments usually have a number of built-in techniques that facilitate going through all phases of the KDD process, and also support several DM tasks and algorithms. Moreover, some of them can be even downloaded for free from the Internet (see the first two sites mentioned in the final section). Still, KDD/DM systems are expensive to develop and it is risky to start such a project before conducting a detailed feasibility study. Generally, if data can be analyzed manually or just by applying some statistical techniques, then developing a KDD/DM system may be too
22
Handbook on Software Engineering & Knowledge Engineering
expensive. There must be at least an expectation in the organization that the KDD/DM system will bring some positive financial effects, otherwise organizational support in developing the system is likely to be missing [13]. Likewise, the amount of available data must be large enough and domain knowledge must be available in some form. The likelihood of success of a KDD/DM application is higher if available data are noise-free and if at least some domain knowledge already exists (KDD will upgrade it!). In spite of many open problems, the number of KDD/DM applications grows rapidly and the application domains constantly multiply. Perhaps the most important driving forces for future KDD/DM applications are the Internet and the Web. KDD and DM over the Internet will most likely be reality once the Internet technology, services, protocols, and management get improved through ongoing projects such as Internet 2 and Next Generation Internet (NGI).
KDD and DM on the Web There are many KDD and DM resources on the Web. The short list of URLs shown below has been composed according to the following criteria: • the number of useful links following from that URL • how comprehensive the site is • how interesting the URL is for researchers As for the first two criteria, the KD Mine site is the most useful starting point for everyone who wants to mine himself/herself for KDD/DM resources. Knowledge Discovery Mine is also a very useful general KDD/DM site. Researchers would probably be most interested in what's new in the field. In that sense, the two KDD Conferences sites shown are only two in a series of such sites dedicated to the most important annual KDD/DM conference. The most relevant dedicated journal in the field, launched in 1997, is Data Mining and Knowledge Discovery (the last URL shown). 1. KD Mine: http://www.kdnuggets.com/ 2. Knowledge Discovery Mine http://info.gte.com/~kdd/ 3. KDD Conferences http://www-aig.jpl.nasa.gov/kdd98/ http://research.microsoft.com/datamine/kdd99/ 4. Data Mining and Knowledge Discovery (journal) www.research.microsoft.com/research/ datamine/
Knowledge Discovery and Data Mining in Databases
23
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
Adriaans, P., Zanintge, D., Data Mining. Reading: Addison-Wesley, 1996. Agrawal, R., et al., Fast discovery of association rules. In: Fayyad, U.M., et al. (Eds), Advances in Knowledge Discovery and Data Mining. Menlo Park: AAAI/MIT Press, 1996., pp. 307-328. Anand, S.S., Scotney, B.W., Tan, M.G., McClean, S.I., Bell, D.A., Hughes, J.G., Magill, I.C., Designing a Kernel for Data Mining. IEEE Expert 12, March-April 1997., pp. 65-74. Bhandari, I., Colet, E., Parker, J., Pines, Z., Pratap, R., Ramanujam, K., Advanced Scout: data mining and knowledge discovery in NBA data. Data Mining and Knowledge Discovery 1, 1997., pp. 121-125. Bollacker, K.D., Lawrence, S., Lee Giles, C., Discovering relevant scientific literature on the Web. IEEE Intelligent Systems 15, March-April 2000., pp. 42-47. Chakrabarti et al., Mining the Web's Link Structure. IEEE Computer, August 1999., pp. 50-57. Chan, P.K., Fan, W., Prodromidis, A., Stolfo, S.J., Distributed data mining in credit card fraud detection. IEEE Intelligent Systems 14, November-December 1999., pp. 67-74. Chen, M.-S., Han, J., Yu, P.S., Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 1996., pp. 866-883. Chen, M.-S., Park, J.S., Yu, P.S., Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering 10, 1998., pp. 209-221. Cook, D.J., Holder, L.B., Graph-based data mining. IEEE Intelligent Systems 15, March-April 2000., pp. 32-41. Fayyad, U.M., Data mining and knowledge discovery: making sense out of data. IEEE Expert 11, October 1996., pp. 20-25. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., The KDD process for extracting useful knowledge from volumes of data. Communications of The ACM 39, November 1996., pp. 27-34. Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J., Knowledge discovery in databases: an overview. AI Magazine, Fall 1992., pp. 57-70. Ganti, V., Gehrke, J., Ramakrishnan, R., Mining very large databases. IEEE Computer 32, August 1999., pp. 38-45. Glymour, C., Madigan, D., Pregibon, D., Smyth, P., Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery 1, 1997., pp. 11-28. Han, E.-H., Karypis, G., Kumar, V., Scalable Parallel Data Mining for Association Rules. IEEE Transactions on Knowledge and Data Engineering 12, 2000., pp. 337352. Hauptmann, A.G., Integrating and Using Large Databases of Text, Images, Video, and Audio. IEEE Intelligent Systems 14, 1999., pp. 34-35. John, G.H., Miller, P., Kerber, R., Stock selection using rule induction. IEEE Expert 11, October 1996., pp. 52-58. Knight, K., Mining Online Text. Communications of The ACM 42, November 1999., pp. 58-61. Matheus, C.J., Chan, P.C., Piatetsky-Shapiro, G., Systems for knowledge discovery in databases. IEEE Transactions on Knowledge and Data Engineering 5, 1993., pp. 903-913. Munakata, T., Knowledge discovery. Communications of The ACM 42, November 1999., pp. 26-29.
24
Handbook on Software Engineering & Knowledge Engineering
22. Owrang, M.M., Grupe, F.H., Using domain dnowledge to guide database knowledge discovery. Expert Systems with Applications 10, 1996., pp. 173-180. 23. Park, J.S., Chen, M.-S., Yu, P.S., Using a hash-based method with transaction trimming for mining association rules. IEEE Transactions on Knowledge and Data Engineering 9, 1997., pp. 813-825. 24. Ramakrishnan, N., Grama, A.Y., Data mining: from serendipity to science. IEEE Computer 32, August 1999., pp. 34-37. 25. Silberschatz, A., What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering 8, 1996., pp. 970-974. 26. Simoudis, E., Reality check for data mining. IEEE Expert 11, October 1996., pp. 2633. 27. Srivastava, J., Chen, P.-Y., Warehouse Creation - A Potential Roadblock to Data Warehousing. IEEE Transactions on Knowledge and Data Engineering 11, 1999., pp. 118-126. 28. Thuraisingham, B., A Primer for Understanding and Applying Data Mining. IEEE IT Professional, January/February 2000., pp. 28-31. 29. Wang, K., Liu, H., Discovering Structural Association of Semistructured Data. IEEE Transactions on Knowledge and Data Engineering 12, 2000., pp. 353-371. 30. Weiss, S.M., et al., Maximizing Text-Mining Performance. IEEE Intelligent Systems 14, July/August 1999., pp. 63-69. 31. Zaki, M.J., Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering 12, 2000., pp. 372-390.