forms of interactive data mining, roles, requirements, as well as complex- ities of interactive data mining systems are discussed in this paper. 1 Introduction.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
On Interactive Data Mining Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca
Abstract. While many data mining models concentrate on automation and efficiency, interactive data mining models should focus on adaptive and effective communications between human users and computer systems. The crucial point is not how intelligent users are, or how efficient systems are, but how well these two parts can be connected, adapted, understood and trusted. Some fundamental issues including processes and forms of interactive data mining, roles, requirements, as well as complexities of interactive data mining systems are discussed in this paper.
1
Introduction
Exploring and extracting knowledge from data has been one of the fundamental problem in science. Many methods have been proposed and extensively studied, such as database management, statistics, machine learning, etc. Yao et al. observed that data mining and general scientific research have much in common in terms of their goals, tasks, processes and methodologies [19]. Particularly, data mining takes up many important tasks, such as description, prediction and explanation of data. Without the aid of computer systems, it is very difficult for people to aware, extract, memorize, search and retrieve knowledge in large and separate datasets, to interpret and evaluate data and information that are constantly changing, to make recommendations or predictions in the face of inconsistent and incomplete data. Therefore, data mining is featured by applying computer technologies to carry out nontrivial calculations. Computer systems can maintain precise operations under heavy information load, and maintain steady performance. The implementations and applications of computer systems reflect requests of human users, and contain certain human heuristics. Computer systems must rely on human users to set the goals, select alternatives if original approach fails, participate in unanticipated emergencies and novel situations, and develop innovations in order to preserve safety, to avoid expensive failure, or to increase product quality [5, 8, 13]. It is important to note that users possess various skills, intelligence, cognitive styles, frustration tolerances and other mental abilities. They come to a problem with various preferences, requirements and background knowledge. Given a set of data, every user may try to make sense out of data by seeing it from different angles, in different aspects, and under different views. Based on these differences, data mining methods and results are not unique. There does not
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
exist a universally applicable theory or method to serve the needs of all users. This motivates and justifies the co-existence of many theories and methods for data mining systems, as well as the exploration of new theories and methods. According to the above observations, we believe that interactive systems are required for data mining tasks. Generally, an interactive system is an integration of a human user and a computer machine. They can communicate and exchange information and knowledge. A foundation of human-computer interaction is cognitive informatics (CI). According to Wang [14], CI attempts to solve problems in two connected areas: “One, uses computing techniques to investigate cognitive science problems, such as memory, learning, and thinking; two, uses cognitive theories to investigate informatics, computing, and software engineering.” The importance of human-machine interaction has been well-recognized and studied in many disciplines. An example of interactive systems is an information retrieval system or a search engine. A search engine connects users to Web resources. It navigates, searches, stores and indexes the resources, responses to user’s particular queries, and ranks and provides the most relevant results to each query. Most of times, it is a user who initiates the interaction by a query. Sometimes, one of the feedbacks can arouse the user’s special interest, lead the user to refine the query, and make the user to change or adjust the further interaction. Without this connection, the user would be hard, if it is not impossible, to access these resources, no matter how important and how relevant they are. The search engine, as an interactive system, combines the power of the user and the useful resources, and generates new power. Though human-machine interaction has been emphasized for many disciplines, it did not yet get enough attention in the domain of data mining until recently [1, 3, 7, 20]. We observe two general problems in many existing data mining systems: 1. Overemphasize the automation and efficiency of the system, while neglect the adaptiveness and effectiveness of the system. Here, effectiveness includes human’s subjective understanding, interpretation and evaluation. 2. A lack of explanations and interpretations of the discovered knowledge, which may be possibly outside of the dataset, and can only be obtained through human-machine interaction. The human role in the data mining processes does not receive its due attention. To study and implement an interactive data mining system, we need to pay more attention to the connection between human users and computers. Wang suggested the relational metaphor for cognitive science, which assumes that relations and connections of neurons represent information and knowledge in the human brain, rather than the neurons [15]. Berners-Lee explicitly stated that “in an extreme view, the world can be seen as only connections, nothing else” [2], on which the World Wide Web is designed and implemented. Following the same way of thinking, we believe that interactive data mining is sensitive to capacities and needs of human and machines. A critical issue is not how intelligent the user is, or how efficient the algorithm is, but how well these two parts can be connected and communicated, adapted, stimulated and improved.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
Through interaction and communication, computers and users can divide the labours in order to achieve a good balance of automation and human control. Computers are used to retrieve and keep track of large volumes of data, and to carry out complex mathematical or logical operations. Users can then avoid routine, tedious, and error-prone tasks, concentrate on critical decisions, planning, and cope with unexpected situations [5, 13]. Moreover, interactive data mining can encourage users’s learning, improve insights and understandings of the domain, and stimulate users to explore creative possibilities. Users’ feedback can be used to improve the system. The interaction is bi-beneficial. In this paper, we focus on some fundamental issues of the interactive data mining. The processes of interactive data mining systems and the specific forms of interaction for each process are discussed in Section 2. The roles, requirements, and the complexity issues about interactive data mining systems are considered in Section 3. Section 4 is the conclusion.
2
Interactive Data Mining
The goal of interactive system design is to integrate users’ background knowledge into the entire data mining processes, give users sufficient information about current and historical status and activities. 2.1
Processes of interactive data mining
The entire knowledge discovery process include: data preparation, data selection and reduction, data preprocessing and transforming, pattern discovery, pattern evaluation and pattern representation [3, 6, 7, 9, 18, 19]. In an interactive system, these phases can be carried out as follows: – Interactive data preparation is to visualize raw data with specific format. Data distribution and some relationships between attributes can be easily observed. – Interactive data selection and reduction involves the reduction of the number of attributes and the number of records. A user can specify the interested attributes and data area, and remove other data that are outside of the interested area. – Interactive data preprocessing and transformation determines the number of intervals as well as cut-points for continuous datasets, and transforms the dataset into a workable dataset. – Interactive pattern discovery is to visualize the preprocessed data and interactively construct the patterns under user’s guidance, monitoring and supervision. – Interactive pattern evaluation is to evaluate the discovered pattern whenever the user is willing to. The usefulness is subject to the user’s judgement. – Interactive pattern representation is to visualize the patterns that are perceived in the pattern discovery phase.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
Practice has shown that the process is virtually a loop which is iterated till satisfiable results are obtained. Most of the existing interactive data mining systems add the visual functionality into the process, which enables users to invigilate the mining process at some stages, such as the raw data visualization and/or the final results visualization [3, 5, 7]. Graphical visualization makes it easy to identify and distinguish the trend and distribution. However, it is a necessary feature for human-machine interaction, but not sufficient. To implement a good interactive data mining system, we need to study what kinds of interactions users expect, and what roles and responsibilities a computer system should take. 2.2
Forms of interaction
Users expect four kinds of interactions: navigation, information acquisition, manipulation, and evaluation and explanation. They go through the entire data mining processes we mentioned above, and carry out desirable mining results. Information acquisition is the form of interaction that obtains information. Information might be presented in various fashions and structures. Raw data is raw information, mined rules are extracted knowledge. Various measurements show the information of an object from different aspects. Each mining process contains and generates a lot of information. An object might change, the information it holds might be erased, updated or manipulated by the user in question. Users need to retrieve the information in an interactive manner, namely, “show it correctly when I want to see it, and in an understandable format.” Navigation is moving through the world from place to place. From the standpoint of granular computing and hierarchy theory, objects in different levels are linked by order relation in a hierarchy. A granule, a set of objects, in a higher level can be decomposed into many granules in a lower level, and, conversely, many granules in a lower level can be combined into granules in a higher level. A granule in a lower level provides a more detailed description than that of a parent granule in the higher level, and a granule in a higher level has a more abstract description than a child granule in the lower level. Navigation is a necessary interaction working along with information acquisition. It provides convenient and explicit paths for moving from one granule to another, and from one level to another. Manipulation is the form of interaction includes selecting, retrieving, combining and changing objects, using the operated objects to get new objects. Different data mining processes require different kinds of manipulations. Interactive manipulations obligate the computer system to provide necessary cognitive supports, such as [4, 10–12]: – Exhaustive approach: using a systematic search for all possible solutions, which could be very hard for human capacity. – Algorithmic approach: using a well-defined solution, especially an established, recursive computational procedure for solving a problem in a finite number of steps.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
– Heuristic approach: selective search of a portion of solution space, a subproblem of the whole problem, or a possible solution according to user’s special needs. – Analogy approach: using previous solutions to solve existing problem. Evaluation and explanation is the fourth form of interaction which results in the decision of continuing or discontinuing the information acquisition, navigation, and/or manipulation. If the evaluation and explanation of some patterns are not satisfied, the user may repeat the data mining steps on these patterns, or the whole dataset. Basically, these four interactive forms have a general order, namely, navigation → information acquisition → manipulation → evaluation and explanation, as a loop shown in Figure 1 upon a typical data mining model. Though, the order is not always necessary, especially in an interactive environment. Iterations and skips are all possible at any stage. The whole procedure is need-driven. The interaction should be directed to construct a reasonable and meaningful cognitive structure to each user. To a novice, the constructive operation is the psychological paradigm in which one constructs his/her own mental model of a given domain; to an expert, the constructive operation is an experienced practice contains anticipation, estimation, understanding, management and taming of the domain. It might be true that a novice likes a well-defined routine or an algorithm more than creating a new heuristics or inferring a new analogy. Since novices are lack of insights and predictions of the domain, and easily become frustrated if the mining results do not perform as what they have expected. However, novices who emphasize a direct manipulation style, have a strong desire to be in control and to gain mastery over the system and domain, can use an interactive system very well.
3
Interactive Data Mining Systems
The design of interactive data mining systems is highlighted by the nature and forms of interaction. In this section, instead of talking about detail technologies, we discuss the fundamental functionalities, requirements of the computer systems, as well as the possible complexities they have, which must be concerned before real implementations and applications. 3.1
Roles of interactive computer systems
The human-machine interaction put more requirements on computer systems. To fulfill the interactions in the above mentioned processes and forms, the systems have to take at least five basic roles: storing, calculation, composition, inference, and presentation. Storing is one of the roles the system has to take. It stores not only raw data, intermediate results and mined patterns, but also the inventory of the welldefined routines, algorithms, common and regular manipulations. The algorithms
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
Navigation
Information acquisition Pattern discovery Data preprocessing
Pattern evaluation
Data selection
Pattern representation
Data preparation
Data
Selected data
Preprocessed data
Patterns
Evaluation and explanation
Evaluated patterns
Knowledge
Manipulation
Fig. 1. Interactive data mining
should be compatible and consistent in a unified fashion, and allow users to select any of them to process the knowledge discovery task. The system also needs to “remember” the regular process a user is preferred to reduce the human labour. Also, the stored processes enable the system to rollback to the previous stages, if the current stage shows an increasing dissatisfaction. Calculation is another basic role the system has to take, which refers to the use of comprehensive and procedural complex mathematical and logical methods to determine a result. Analysis and decision making are based on accurate calculations. Calculations can be applied instantaneously whenever a user is asking for the corresponding analysis, or be computed underlying, ready for the user to question it at any time. The purpose of composition is to construct powerful manipulations by combining basic operations. The algorithms are defined by many basic operations in order to achieve a specific, especially a heuristic, goal. They are normally characterized by high efficiency and good performance. Though, they may lead some results that not totally related to the user. The composition role of interactive systems allows the user to build his/her own mental building by the standard blocks. The blocks can be connected by some functions like the pipe command in UNIX systems. What it means is that the standard output of the command to the left of the pipe gets sent as standard input of the command to the right of the pipe. The result of composing is that users can define their own heuristics and algorithms. Inference shares part of human decision-makings in an explicit or implicit manner. It is especially useful while the domain is complex and the search space
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
is huge. To achieve inference, the system needs to store an extra rule base, (usually serves as a standard or a reference,) and is able to observe the current state. The inference function helps users to pay attention to something that are easily ignored, considered as “boundary” issues, or important but not on the current focus. The inference function takes the role and responsibility as a consultant. It ensures the process develops in a more balanced manner. Presentation is associated with interface, and normally, visualization. The interaction can be conducted through a text-based interface, a graphic-based user interface, or other kinds of interface, such as speech recognition and/or speech synthesis. When the interface is well-designed, it is comprehensible, predictable and controllable, makes users fell competent, satisfied and responsible for their actions.
3.2
Requirements of interactive computer systems
Computer systems can be categorized into two categories: deterministic and nondeterministic systems. A deterministic computer system is a system in which the output can be predicted with one hundred percent certainty. In a deterministic system, every action produces a reaction or effect. And every reaction, in turn, becomes the cause of subsequent reactions. The cascading events show exactly how the system will perform at any moment. A nondeterministic computer system is an active system that implements context-dependent and adaptive behaviours, and are dependent on user’s willingness that formed by historical events, current rational or emotional goals. According to Wang [16], a deterministic system can be composed by well-defined routine and algorithms, a nondeterministic system can be adaptive and/or autonomous. Wang’s classification is shown in Table 1. From our understanding, adaptation is to change an existing device or mechanism so as to become suitable to a new or special application or situation. An adaptive system should then prepare and store many choices, and they are easily tailored or recast into another form in response to a new user setting. Autonomy means the effective management of internal affairs and external environment. Autonomy facilitates users to construct new devices, measures, mechanisms that have not yet stored in the system, alert the significant indexes or changes happen to the system, and make actions accordingly.
Type of system Routine Algorithmic Adaptive Autonomous
Event Constant Variable Constant Variable
Behaviour Type of behaviour Constant Deterministic Constant Variable Nondeterministic Variable
Table 1. Categories of computer systems, cited from [16]
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
An interactive data mining system should obtain both the deterministic and nondeterministic features. The deterministic features guarantee the computation run accurately and precisely. The nondeterministic features ensure the interactive communications go smoothly. 3.3
Complexity of interactive data mining systems
Because of the special forms of interaction and special roles and responsibilities of an interactive system, the complexity issues often raise concerns during implementation. Weir identified three sources of complexity of interactive applications [17]: . Complexity of the domain: The domain can be very complex because of the size and type of data, the high dimensionality and high degree of linkage that exist in the data. The domain may embody a large number of possible states. Knowledge may be not determined by a few discrete factors but by a compound of interrelated variables. . Complexity of the control: The complexity of control can be understood as a function of system on the domain complexity. It studies how much time and memory space a computer algorithm may take. Different algorithms have different complexities of control. . Complexity of the interaction: The interactive operations may include the operations of do and undo, rollback, goal reset, visualization, recommendation, and so on. Normally, the greater a user demands the more complex the overall system is. Suppose we denote the domain as D. The complexity of the domain can be denoted as complexity(D). We also denote the set of all controls as C, each control has the complexity complexity(c), c ∈ C. Then the output of an interactive data mining system is a selected relation, r, between D and C, under the interaction I. This can be defined as: r = fI (D, C), = D × C, D 6= ∅, C 6= ∅, where × represents a Cartesian product between D and C. I can be any interaction forms we discusses in the Section 2. 3.4
An example: interactive classification
For interactive classification, a particular example of interactive data mining, Zhao and Yao analyze the complexity of domain using a granule network [20]. A granule network systematically organizes all the subsets of the universe and formulas that define the subsets. A consistent classification task can be understood as a search for the distribution of classes in a granule network defined by the descriptive attribute set. The analysis shows that the domain complexity of
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
the search space of a consistent classification task is not a polynomial bound. It can be extremely complex, especially when the number of possible values of each descriptive attribute is large, let alone continuous. This forces us to use heuristic algorithms to quickly find solutions in a constrained space. Each individual algorithm can be understood as a particular heuristical search within the granule network. Many different controls for interactive classification can be enumerated. For example, partition-based algorithms look for the most promising attributes to split the examined granules at each level, and each is labelled by one of the possible values of the selected attribute. The child granules naturally cover their parent granule, and pairwisely disjoint with each other. Covering-based algorithms look for the most promising attribute-value pairs at each level that are best classified by a particular class. It is possible that the granules being searched will overlap each other. Top-down algorithms start the search from the biggest granule, then heuristically search down for the most promising attributes to restrict the granule, until the consistent classification solutions are found. Bottom-up algorithms start the search from the smallest granules. Pre-pruning methods are used by top-down algorithms, that prematurely halt the search by meeting a predefined threshold. Post-pruning methods grow a decision tree or decision rules for the data first. then, prune from the tree/rules a sequence of subtrees/sub-rules. And last, try to select from the sequence of subtree/sub-rules a subtree/subrule which estimates the true regression function as best as possible. All these controls have different complexities of control. Many interactive functions have been implemented in the mentioned project. The classification accuracy of selected tree branches and the complete constructed tree can be easily retrieved in forms of tree-view, pie chart, bar chart and/or pivot table representation. The measurements of attribute and attributevalues are listed, that facilitate the user to judge and select one for splitting. The measures can be chosen from the pre-defined measurement set, or composed by the users. Users can test the mined classification rules at any moment, continue or cease the training process according to the evaluation, split the tree node for higher accuracy, or remove one entire tree branch for simplicity, etc. The interface of the mentioned project is illustrated in Figure 2. For more information, please refer to our paper [20]. We still plan to add the inference function into the project, that uses formal concepts to infer the “basic rules”. This can ensure the users keep the most important rules in mind, while they are exploring freely.
4
Conclusion
Implementing interactive computer systems is an emerging trend to the field of data mining. It aims to have human involve into the entire data mining processes in order to bring an effective result. The interaction requires adaptive, autonomous systems and adaptive, active users. The performance of the interactions depends on the complexities of the domain, the control, and the available interactive approaches.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
Fig. 2. The interface of the mentioned interactive classification system
As an illustration, we have implemented an interactive classification system. Future efforts will be made on other data mining tasks, such as association mining, sequential mining, clustering, and many more.
References 1. Anderst, M., Human involvement and interactivity of the next generations’ data mining tools, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Santa Barbara, CA, 2001. 2. Berners-Lee, T., Weaving the Web - The Original Design and Ultimate Destiny of the World Wide Web by its Inventor, HarperSanFrancisco, HarperCollins Inc., 1999. 3. Brachmann, R. and Anand, T., The process of knowledge discovery in databases: a huamn-centered approach, Advances in Knowledge Discovery and Data Mining, AAAI Press & MIT Press, Menlo Park, CA, 37-57, 1996. 4. Chiew, V. and Wang, Y., Formal description of the cognitive process of problem solving, Proceedings of ICCI’04, 74-83, 2004. 5. Elm, W.C., Cook, M.J., Greitzer, F.L., Hoffman, R.R, Moon, B. and Hutchins, S.G., Designing support for intelligence analysis, Proceedings of the Human Factors and Ergonomics Society, 20-24, 2004. 6. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
Proceedings of the Second Indian International Conference on Artificial Intelligence (IICAI’05), 2444-2454, 2005.
7. Han, J., Hu, X. and Cercone, N., A visualization model of interactive knowledge discovery systems and its implementations, Information Visualization, 2(2), 105125, 2003. 8. Hancock, P.A. and Scallen, S.F., The future of function allocation, Ergonomics in Design, 4, 4, 24-29, 1996. 9. Mannila, H., Methods and problems in data mining. Proceedings of International Conference on Database Theory’ 1997, 41-55, 1997. 10. Matlin, M.V., Cognition, fourth edition, Harcount Brace and Company, 1998. 11. Mayer, R.E., Thinking, Problem Solving, Cognition, second edition, W.H. Freeman and Company, 1992. 12. Ormrod, J.E., Human Learning, third edition, Prentice-Hall, Inc., Simon and Schuster/A Viacom Company, 1999. 13. Shneiderman, B., Designing the User Interface: Strategies for Effective HumanComputer Interaction, third edition, Addison-Wesley, 1998. 14. Wang, Y.X., On cognitive informatics, Proceedings of ICCI’02, 34-42, 2002. 15. Wang, Y.X. and Liu, D., On information and knowledge representation in the brain, Proceedings of ICCI’03, 26-29, 2003. 16. Wang, Y.X., On autonomous computing and cognitive processes, Proceedings of ICCI’04, 3-4, 2004. 17. Weir, G.R., Living with complex interactive systems, in: Weir, G.R. and Alty, J.L.(Eds.), Human-Computer Interaction and Complex Systems, Academic Press Ltd., 1991. 18. Yao, Y.Y., Zhao, Y. and Maguire, R.B., Explanation-oriented association mining using rough set theory. Proceedings of Rough Sets, Fuzzy Sets and Granular Computing, 165-172, 2003. 19. Yao, Y.Y., Zhong, N. and Zhao, Y., A three-layered conceptual framework of data mining, Proceedings of ICDM Workshop of Foundation of Data Mining, 215-221, 2004. 20. Zhao, Y. and Yao, Y.Y., Interactive user-driven classification using a granule network, Proceedings of ICCI’05, to appear.