A Visual Language for Internet-based Data Mining and Data ...

2 downloads 6140 Views 168KB Size Report
Data Mining is the search for valuable information in large volumes of scientific or business data. It combines the fields of Databases and Data Warehousing with.
A Visual Language for Internet-based Data Mining and Data Visualisation

Jaturon Chattratichat, Yike Guo, Jameel Syed Imperial College, University of London 180 Queen’s Gate, London SW7 2BZ, United Kingdom {jc8,yg,jas5}@doc.ic.ac.uk Abstract This paper describes a novel application of enhanced visual programming and visualisation techniques to support data mining processes on the Internet. While the idea of using visual languages to support data mining has been proven to be useful, the usability of existing implementations has been limited. Here, we consider the issue of usability of data mining via the Internet. We also present “interactive visual programming”, a method which automates the construction of a visual program through a direct manipulation interface and visualisation. We also illustrate new techniques for data and model visualisation that can aid the understanding of data and models.

1. Introduction 1.1 What is Data Mining? Data Mining is the search for valuable information in large volumes of scientific or business data. It combines the fields of Databases and Data Warehousing with machine learning algorithms and statistical methods to gain insight into hidden structures of the data. The challenge of extracting actionable knowledge from available data sources is achieved by addressing the following issues: • The physical size of the distributed data sources • The computational requirements of analytical algorithms when executed over very large data sets. • The usability of the data mining system. • The ability to support the data mining process The characteristics of the problem imply that data mining within a client-based environment is not practical. Data mining needs to utilise high-performance architectures for the large-scale computational tasks involved. The web is therefore an ideal and crucial tool for co-ordinating data mining tasks and distributing

workloads. At present, there is a growing interest in providing web-based support for such data-intensive activities.

1.2 Why a visual language for data mining? Data mining is not a one-step task. It has been defined as an iterative process [2], where each sub-step can be repeated. Visual languages are therefore ideal for providing a friendly interface to a non-expert user. Similar systems such as Clementine (ISL)[4] and Enterprise Miner (SAS)[5] are the most promising attempts at integrating visual languages into the user interface for data mining. However, these client-based systems lack the capability of supporting web-based or enterprise-based distributed execution. Moreover, the size of data they can handle is limited by the physical configuration of the client machine. Although these tools allow the user to interact with a visual abstraction of the data mining process, the user has little access to the underlying data they are manipulating. For example, in Clementine, the user is not able to view the effect of an individual data manipulation function until the whole procedure has been defined and executed; the user has to add a table-viewer node at the end of the procedure and execute it to view the resulting data. This tedious task of adding a table-viewer node has to be repeated each time that the user wants to see the result of a manipulation function. We argue that interaction through direct manipulation [11] - performing operations and immediately seeing the results - of both data and models can enhance the effectiveness of visual programming in the context of data mining. In this paper, we introduce enhancements to visual programming in the context of data mining. The valueadded features and capabilities also allow for integration with the Internet platform and thereby increase the usability of the system.

1.3 Kensington: An enterprise data mining system

The Kensington Enterprise Data Mining system is a research product built by the Data Mining Group at Imperial College [3]. It is a multi-user data mining system based on a three-tier architecture. The motivation behind this system is to deliver an easy-to-use large-scale data mining system through the use of the Internet. Based on these motivations, Kensington employs the visual programming and visualisation techniques for the client front end. The visual interface is supported by one or more application middleware (EJB) servers and mining servers. The components in each tier could reside in different locations and therefore reflect the true physical distribution of an organisation. The system employs the Internet as its main communication infrastructure, and thereby enables data mining to be performed anywhere. Kensington’s middleware provides support for database access, file access, storage and mining task execution management. The third tier mining servers provide support for the bulk of the data crunching and numerical operations. The main focus of this paper is in the visual language aspect of the client application. The goal is to build a simple yet powerful interface on a very “thin” client that requires minimal configuration. The challenge is to take advantage of web infrastructure transparently. The main contributions of the system outlined in this paper are as follows: • Allowing data mining to be performed across the Internet through the use of visual language. • Enhancing reusability and enable rapid redeployment of visual programs through object

Data Retrieval

Data flow Manipulation

Database + Table Model

Split, Delete, Transform, Filter

Data flow Data Mining

Machine learning and statistical algorithms

Model Model Analysis

Model testing modules and model visualisation

Report, Reusable procedure

Figure 1: Data mining task flow and associated modules

serialisation and XML file output Using a novel method for visually interacting with models. • Introducing interactive visual programming • Building interactive Decision Trees through visualisation. In the next section, we will introduce the visual programming features of Kensington. We have enhanced existing features that contribute to the effective use of visual programming as an interface for data mining. We also discuss how this visual interface deals with remote data on the Internet. In section 3, we introduce novel techniques for integrating interactive data and model manipulation, through appropriate visualisations. The last section outlines the main contributions of this paper. •

2. Visual programming for data mining In this section, we outline the design of an environment that supports the visual construction of a data mining procedure.

2.1 The process of data mining The data mining process commonly consists of four separate main stages: data retrieval, data preparation, data mining and model analysis [1][2]. The data retrieval stage is an initial step of loading data sources. The data preparation stage transforms and prepares the data for the mining stage. The mining stage involves the application of machine learning or statistical algorithms to produce a model. This model could be further analysed for accuracy or utilised for decision support. In designing a visual programming language for data mining, we take into consideration the nature of each step mentioned earlier. Each step is supported by a set of tools or components, which perform a task-specific function. The four categories are described in Figure 1. The left part of the figure describes the logical flow of information in a data mining task. The caption between each box describes the output from the previous step. The retrieval step initially loads data, which is fed into the manipulation step. The modified data then becomes the input into the mining step. After the mining operation, a model is produced, and finally evaluated. As shown on the right part of the figure, component nodes are grouped and categorised into different context-sensitive steps.

2.2 Visual task configuration The visual interface design of the Kensington system is based on the data mining process model. Its familiar icon-

based visual programming support allows the user to create a data mining task configuration, which can be remotely executed via the Internet. The user is presented with a range of data manipulation and data mining tools, which can be “drag-and-dropped” on to a project window, as shown in Figure 2. Similar to most visual programming systems, components are represented by unique icons that may be connected together to form a task configuration. Each component is configurable with the support of a floating property pane. Each modification to the task configuration causes it to be syntactically checked for errors and invalid states. Since each communication with the remote application server has a relatively high cost, it is important that as many errors in the graph are eliminated prior to remote execution. Each component may be in one of four states: • Error (Red background) – Some parameters associated with the component have not been specified correctly. • Invalid (White foreground) – The component has no data or model input. • Valid (Black foreground) – The component is syntactically correct. • Locked (Black background) – The component has been locked by a running process and cannot be modified, removed or disconnected. Colouring components to indicate their states allow the user to see the correctness of the whole mining procedure at a glance. If a node contains an error, the user is able to view a detailed error message and correct the problem by using the properties panel.

Within this visual programming environment, the user is given the flexibility to switch between different projects and perform concurrent operations. Through multithreading support, it is possible to execute a remote task while defining the next task configuration at the same time.

Figure 2: A visual program and the configurable properties for each node

Figure 3: (Top) Selecting area to be cloned. (Bottom) Two new cloned nodes are added.

2.3 Web-based data handling support The key issue in web-based data mining is the ability to handle remote data efficiently. In a real-world organisation, data is loaded from a (possibly) remote database server via the Internet. In many visual data mining systems, the precondition is such that data must reside on the local machine, but it is impractical to pre-load the entire data set on to a “thin” client. We propose a visual task configuration approach, which eliminates this precondition. In data mining, it is not necessary for the user to view the entire data set in order to construct a visual program or task. The key idea is to keep only a pointer to the data on the remote server and a small sample of data on the client. The term pointer refers to the reference given to a remote data object. Using this pointer, we request the server to calculate statistical functions (mean, standard deviation, minimum and maximum of each attribute) and queries. The results of these operations are relatively small amounts of data and can be efficiently transmitted to the client, even over a low bandwidth connection. These calculations provide sufficient information for the user to construct a data mining task without direct access to the whole data set. Figure 2 illustrates the task configuration of a simple data mining task using the decision tree classification algorithm [8]. The four components involved are the “Table”, “Delete”, “Filter” and “Decision Tree” components. The left-most icon represents a table pointer,

which propagates both metadata (i.e. the type and name of each attribute) and statistical information to the “Delete” node. The delete node simply removes the chosen column from the metadata and then propagates the amended information to the “Filter” node. In the filter node, data items that lie outside the specified range are eliminated. The last mining node, “Decision Tree”, performs a data mining operation and produces a model. The outcome of the above task configuration is a visually defined data mining program, using an efficient data handling mechanism. Only when the whole task is ready to be executed (i.e. syntactically correct), is the entire task configuration or program sent to the server to be executed over the whole population data set. The model is then returned to the client. Since the visual support has rid the client of the unnecessary burden of handling huge amounts of data, the task configuration scheme can be further extended to a parallel execution scheme on different remote sites. It is possible to perform a parallel execution of multiple remote tasks on data sets residing in many separate physical locations. This sort of configuration is useful for enterprises with distributed resources that want to eliminate the unnecessary mass transfer of data.

2.4 Enhanced visual support for rapid program construction

session. This means that repetitive programs or modules need not be redefined. Figure 3 illustrates the cloning operation. Initially, the desired nodes are selected (visually highlighted) by making sure that they fall within the selected grey area. A clone command from the popupmenu will replicate the selected nodes and attach them to the original parent node. Another extension we introduce is the notion of histories. In the data mining context, the model produced from a mining operation contains useful knowledge about the data. However a model is meaningless without knowing about its history (a description of the task configuration which produced the model). For example, a model that was learned from last month’s data set wouldn’t be as interesting as one learned with today’s updated data. In cases where the confidentiality of the data is not an issue, the history of each model is remembered so that its task configuration can be recreated and re-executed. Figure 4 illustrates the use of histories stored within a model. The top pane illustrates the production of a model, shown as a “diamond” shaped object, after the four-node task execution. This model can then be stored or transferred to a different user (in the bottom pane) and one can see how the exact history of the model can be revealed. The user then has the freedom to re-execute the task or change the parameters to produce a new model. Since the main focus of the data mining system

The visual interface provides enhanced features for supporting a truly iterative and interactive data mining process. This means that the same task, with slight changes to the configuration (say, some parameters), can be executed many times over different data sets. For comparative analysis, it is important for the user to compare how different data mining algorithms perform on the same set of data. Each component or group of components can be cloned and replicated at the same Figure 5: Database Manager manages a directory of bookmarks and starts a query revolves around data and databases, a faster method of accessing the databases is needed. In Kensington, each user is given an object reference to a directory of “database bookmarks”, as shown in figure 5. A database bookmark is a visual object, which encapsulates information about connectivity to a remote database. At the user’s command, new data sources can be loaded and stored instantaneously, without having to deal with the complicated issues of connection drivers, platform or authentication. The properties of these bookmarks are predefined and can be distributed by a database administrator to potential users. Figure 4: History of the tree model restored

These key visual features have been introduced to increase user’s efficiency and reduce the complexity in constructing visual programs.

2.5 Support for reusability and redeployment For the purpose of re-deployment of a defined task configuration, the system allows the entire or subset of task to be stored persistently and retrieved at a later stage. At present we employ two methods of persistence: java object serialisation [7] and XML-based [10] Data Mining Markup Language (DMML) file output. Object serialisation benefits from being a compact representation and allows for faster loading and saving. To compliment this, the DMML text output provides an open user-editable representation of task configurations and its components. The DMML format has the advantage of being humanreadable and its well-structured format is neutral to any user-interface. Storing a task is synonymous to saving the source code of a program, which can be redistributed to other users and re-executed at a later stage. Besides a component, a mining model can also be saved and distributed to other users. In many cases, a group of users may want to share and transfer learned models instead of data. This stems from two reasons: (1) models are usually smaller in size than the data and (2) the data may be confidential and should not be readable by anyone else. This can effectively reduce of size of information transfer as well as preserve the anonymity of the data. In addition, Kensington’s visual support for abstraction provides a mean of grouping component nodes into a Interactive Level

single abstract interface. This object-oriented terminology refers to the action of hiding complex or long tasks within a single node. The “grouping” mechanism implements this abstraction idea by allowing any number of linked tasks to be grouped, provided that they belonged to the same task configuration. This feature reduces the complexity of the program code and improves the visual presentation of the program.

3. Visual interaction environment In this section, we discuss the support for the “interactive” nature of data mining. The conventional method of visual programming is to drag and drop icons, and connect them into a visual program. Each icon is essentially the visual abstraction of a function module. Although this popular technique is widely accepted by novice programmers, the user is still subject to a learning curve in order to get used to the meaning of each icon or colour representation. In this section, we propose an interactive technique, which is built on top of the visual programming interface described in the previous section. The logical view of the system is illustrated in Figure 6. Notice that the bottom two levels describe the existing logical view of current visual programming implementations. The visual program is defined as a directed acyclic graph, and this is translated into real program code which, with the support of a server, can be executed across the Internet. In our extension, we allow the user to manipulate a higher level of interface, which can produce a similar visual program code.

3.1 Interactive visual programming User Interaction

Drag+Drop Visual Programming

Execution Level

Figure 6: Hierarchy of logical views for visual interaction

In this section, we discuss the concept of “interactive visual programming”. We introduce a high-level interface for producing a visual program, without using the “dragand-drop” method. This interface is able to give the user immediate feedback during the construction of the visual program We implement a “Table View Editor” which provides a view of the (sample) data. By clicking on one of the action buttons, the view of the data is instantaneously modified. At the same time, an associated visual programming icon node is created on the project window. The user is able to “undo” or “redo” each operation. This novelty here is that the system immediately provides the user with an updated view of the data, as well as incrementally constructing the visual program since operations over the sample data are executed on the client

independently of the application server. Hence, there is minimal latency associated with each operation.

histogram of the distinct values is shown. Initial calculations are based on the sample data (top row in the figure). However, the user is given the option of viewing statistical summaries of the population data set. The interactive presentation of information about each attribute allows an analyst to decide which attribute should be adjusted and which should be discarded.

3.2 Model visualisation and interaction

Figure 7: Interactive visual program construction Figure 7 shows an example of interactive visual programming. The initial display table provides a view of the unmodified data table object. The data consists of the following columns: “ID”, “income, “age”, “children”, “sex”, and “region”. When the user clicks on the delete column button, the selected column “ID” is eliminated from view, and, at the same time, a new “delete” node is added to the visual program (shown as a temporary node in the bottom pane). Next, a “discretize” column” button is selected. This node puts values of a numerical attribute into different bins. When the user enters the names and number of bins, the value of the “age” attribute is changed, and a new node is inserted into the visual program. Notice in the figure that the column “ID” was deleted from the view and the column “AGE” was modified by the discretize operation. When all the operations are “committed”, the resulting visual program containing two new nodes is added to the project window.

The result of a data mining task is a model. In data mining terms, this model contains a piece of actionable knowledge, which could be used to help in decision making. The key difference between data and model visualisation is that while the structure of data is uniform, different models contain irregular structures and formats. The structure of the model is often dependent on the kind of algorithms being used. For example, a classification algorithm produces a tree or rule model. On the other hand, a clustering algorithm produces a map of clusters. Model visualisation for data mining is still an active research area. The challenge is to clearly present the given model in a concise and useful form. The Kensington system offers various model visualisers to display the result of a mining execution. For this paper, we will focus on the visualisation of classification trees. Conventional 2D and 3D visualisers [6] are useful tools for displaying tree structures that focus on a certain branch. In the case of displaying deep or fat trees, scrollbars are used to show a view of a portion of the tree. In doing so, the structure of the entire tree is hidden from view. This “focus + context” issue is solved by the introduction of hyperbolic tree browser [9]. The browser plots the tree on a hyperbolic disc, with the root starting at

Figure 8: Statistical summaries for each attribute A separate tab pane showing a statistical summary for each attribute is also provided, as shown in figure 8. For continuous attributes, a box plot showing the distribution statistics is drawn. If the data is categorical, then a

Figure 9: 3D Hyperbolic visualiser

the centre, and the next branch spreading along the next ring of the disc. This approach is extremely useful for displaying a directory structure or trees with multiple deep branches. In our approach, we extend the hyperbolic idea to convey more information within each node. The 3D hyperbolic tree visualiser employs a novel technique of arranging the tree structure onto a 2D hyperbolic plane and projecting information about each node on to the third dimension. The idea is to convey as much information as possible without losing sight of the “big picture”. Figure 9 shows a decision tree being displayed with each node consisting of a base supporting three 3D histograms.

Figure 10: Interactive visual model builder In our example, the height of each histogram represents the frequency of three possible target class values. The node with the tallest histograms is the root and in the case shown in the figure, it branches out on four sub-branches. Here, we can distinctly see that one of the four subbranches has a deep and fat structure. Each path from the root to the leaf represents a unique rule. The end leaf node reveals the probability of each target class value, while the immediate parent displays the sub-total of all its children. This technique allows analysts to quickly study the structure of a tree and visually judge the “frequency” of each class value. The advantage of this technique compared to the 2D hyperbolic tree browser is that it enables additional information about each node of the tree to be displayed without cluttering up the canvas. Here, we can distinguish among the different nodes and their associated histograms.

3.2 Interactive Mining The nature of visual programming implies that an analyst has to wait for the completion of each task execution to see the model. In the learning step, when the

data is being analysed by the learning algorithm, the user has no way of interacting with the learning process. The only interaction is to either accept or reject the model produced. It is therefore more desirable if a learning algorithm allows the user to interact with it whilst building a model. To this end, we introduce the idea of using visualisation as an abstract form of a visual language for building models interactively. In order to prove this concept, we implemented a classification algorithm [8]and added a visualisation interface on top. The algorithm builds a top-down decision tree by starting with a single node and then branching according to a condition on an attribute and so on. The selection criterion for choosing the attribute to branch on is based on the “gain” of the attribute. The result is in the form of a decision tree with multiple branches. The motivation behind interactive model building is that an analyst with domain knowledge should be allowed to contribute to the learning process. The machine learning algorithms base their decision to branch on pre-defined heuristics. Although the decision is statistically correct, the analyst after all, judges the usefulness of the model produced. Therefore, we allow the machine to compute statistical information to support the user to make their decision. The concept of interactive visualisation is illustrated in Figure 10. In our design, the user can incrementally construct models using graphical visualisation methods. During the learning process the visualisation module displays the structure of the partially constructed decision tree. The user can choose in which way the tree should split by selecting from the available attributes. The table presents the list of attributes with its associated “gain” values. The visualiser highlights the choice that the built-in heuristics would make and prompts the user for confirmation or alteration of this choice. In the example, we build a decision tree for the “mailshot” data set. The goal of the learning task is to create a predictor for deciding whether a customer would respond to a mail advertisement, given information and characteristics about the customers such as age, income, etc. The probabilities of the two possible classes (Respond and Ignore) are represented by the height of each histogram on a node (The lighter shade represents Respond, while the darker shade represents Ignore). The first level of the decision tree is split on the car attribute. At the next stage the algorithm estimates the married (27%) and children (23%) attributes to be the most significant ones. Even though the information gain figure for the children attribute is not the highest, the analyst may find it to be a more important attribute and could select it. This interactive process

interactive visual programming technique, which enables the user to build up visual programs conveniently and effectively. We have argued that the use of visualisation and visual programming are complementary and should be utilised together. We have introduced a novel model visualisation technique for displaying tree structures. The idea of using visualisation as a programming tool was also illustrated using the interactive tree builder. Information on the middleware and mining server can be obtained our web site (http://ruby.doc.ic.ac.uk).

6. Acknowledgements Figure 11:Interactive visual model pruning continues until the analyst feels that the tree is sufficient for the application. Some learning algorithms applied to large data sets may require a substantial amount of time to complete the model building stage. An analyst can choose to construct part or the entire model automatically. The second example, shown in figure 11, combines model tuning with model visualisation. Models generated in the learning stage are frequently subjected to refinement. The refinement process is based on heuristics, and usually there is no notion of “interestingness” in the pruning algorithm. In the worst case, the reduction in model complexity is achieved by summarising or deleting cases, which the analyst would have considered interesting and worth a closer inspection. Close interaction should therefore be an option in this stage. Consider again the decision tree within a graphical tree visualiser. The pruning algorithm works as follows: (1) From the bottom up, calculate the error of a subtree as error1, (2) calculate the potential error error2 that would occur if the subtree was replaced by a leaf node labelled with the local majority class, and (3) replace that subtree with a leaf labelled with the majority class if error1 is greater than error2. We allow user intervention at the decision step (3), so that the analyst can keep subtrees that are of potential interest. As shown, the user chooses the sub-tree over which the pruning should be attempted. In this case, the user has chosen the sub-tree with the label “married=yes”. The pruning algorithm is now performed for this sub-tree only.

5. Conclusion In this paper, we have provided a brief introduction to data mining and presented an extension to the conventional visual programming paradigm to support the structure of the data mining process. We have also presented an

The authors would also like to thank J. Darlington, S. D. Hedvall, M. Köhler, and J. Forbes-Millott for their contribution to the system.

7. References [1] R. J. Buchanan and Tej Anand. The process of knowledge discovery in databases, in Advances in Knowledge Discovery and Data Mining. MIT Press, 1996. [2] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: An overview., in Advances in Knowledge Discovery and Data Mining. MIT Press, 1996. [3] J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Köhler, J. Syed. An Architecture for Distributed Enterprise Data Mining, High Performance Computing and Networking 1999, Amsterdam, Netherlands. [4] SAS Institute. SAS data mining solution white paper.

http://www.sas.com/software/datamining/whitepapers/ [5] Integral Solutions Ltd. Clementine User’s Guide, 1995. [6]

Silicon

Graphics.

Mineset

Data

Mining

Product.

http://www.sgi.com/software/mineset/ [7] Java Object Serialization, http://java.sun.com/ [8] E.B. Hunt, J. Marin, P. J. Stone, Emperiments in Induction, New York: Acedemic Press, 1966. [9] J. Lamping, R Rao, P. Pirolli. A focus + context technique based on hyperbolic geometry for visualizing large heirarchies, Proc of CHI’95 Conference: Human Factors in Computing Systems, ACM, New York (1995), 401-408. [10] XML, http://www.w3.org/TR/1998/REC-xml-19980210 [11] B. Scheiderman. Dynamic Queries for visual information seeking, IEEE Software, 11,6 (1994), 70-77.