Visual data mining modeling techniques for the ... - Semantic Scholar

7 downloads 8152 Views 2MB Size Report
Visual Data Mining techniques have proven to be of high value in .... Our definition of visual data mining gives us the flexibility to apply that concept to all three ...
ARTICLE IN PRESS

Journal of Visual Languages and Computing 14 (2003) 543–589

Journal of Visual Languages & Computing www.elsevier.com/locate/jvlc

Visual data mining modeling techniques for the visualization of mining outcomes Ioannis Kopanakis, Babis Theodoulidis* CRIM—Center of Research in Information Management, Department of Computation, UMIST, PO Box 88, Sackville Street, ManchesterM60 1QD, UK Received 2 August 2002; received in revised form 7 March 2003; accepted 9 June 2003

Abstract The visual senses for humans have a unique status, offering a very broadband channel for information flow. Visual approaches to analysis and mining attempt to take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see. Visual Data Mining techniques have proven to be of high value in exploratory data analysis and they also have a high potential for mining large databases. In this work, we try to investigate and expand the area of visual data mining by proposing new visual data mining techniques for the visualization of mining outcomes. r 2003 Elsevier Ltd. All rights reserved. Keywords: Visual data mining; Databases; Association rules; Classification

1. Data mining The process of searching and analyzing large amounts of data is called ‘‘data mining’’. The large collections of data are the potential lodes of valuable information but like in real mining, the search and extraction can be a difficult and exhaustive process [1]. Data Mining is a knowledge discovery process of extracting previously unknown, actionable information from very large databases. In details it is the non-trivial extraction of implicit, previously unknown and potentially useful information from *Corresponding author. Tel.: +44-161-200-3309; fax: +44-161-200-3324. E-mail addresses: [email protected] (I. Kopanakis), [email protected] (B. Theodoulidis). URL: http://www.crim.co.umist.ac.uk. 1045-926X/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.jvlc.2003.06.002

ARTICLE IN PRESS 544

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

data. In other words, it is the search from relationships and global patterns that exist in large databases, but are ‘‘hidden’’ among the vast amounts of data. These relationships represent valuable knowledge about the database and objects in the world [2]. 1.1. Data mining life cycle We view the life cycle of the data mining operation as a three-stage process of preparing the data for mining, deriving the model, and using the knowledge obtained from the data [3]. (Fig. 1) The data preparation stage deals with improving the data quality and summarizing the data to facilitate the analysis and discovery process. Data mining can be done on either operational databases or on a data warehouse, which is usually a summary database of the various businesses of an enterprise. The quality of the data in the data warehouse is constantly monitored by data analysts. Due to the heterogeneity and non-standard policies enforced on data quality at the different source databases, the warehouse data is usually cleaned or standardized via data scrubbing. The model derivation stage focuses on choosing learning samples, testing samples and learning algorithms. Due to the large volume of available data, data mining may be done on subsets of the data from the data warehouse. An appropriate data sample is selected from the data in the warehouse and is checked for descriptiveness. This process may have to iterate a few times before a suitable sample set can be selected. The selected sample dataset forms the training data for the data-mining algorithm. The data-mining process is viewed in our framework as the derivation of an appropriate knowledge model of the patterns in the data that are interesting to the user. The algorithm for model derivation, together with the guidance provided by the user, will generally produce several models of the information contained in the data. The data-mining algorithms use guidance from the analyst to decide various

Data Preparation Stage Scrub, Verify, Summarize Data

Operational Database

Data Warehouse

Selection of Training Data Sample

Model Usage & Population Shift Monitoring and Incremental Learning

Training Data for knowledge Model Learning

Model Derivation Algorithm + User Guidance of learning process

Knowledge Engineer

Validation Stage

Interesting Models

Selection of most Interesting Models

Models learned from training data

Fig. 1. Information flow in data mining life cycle.

Model Derivation Stage

Verify & Evaluate

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

545

parameters of the model being learned from the data, such as its accuracy and prevalence, and to control the computational complexity of the learning process. Among all the models generated, the users may only select a few interesting models to be included in their applications. The usage and maintenance phase is concerned with monitoring of database updates and continued validation of patterns learned in the past. Even though the learning process may have user guidance, not all the knowledge models generated will have business applications. Only the interesting models are selected and applied performing business tasks. Another important task in this stage of the life cycle is to continuously monitor the validity of the knowledge models in the context of changes to data in the warehouse. When the population in the warehouse shifts significantly, the previously learned models will no longer be applicable, and new models will have to be derived. We may also be able to learn new models incrementally from the new data. 1.2. Visual data mining Visual Data Mining could be related with all the previously described sub-modules that we have partitioned the overall data-mining task. The goal should be to provide a synthesis of visualization and data mining, to enhance the effectiveness of the overall data mining process. Since this synthesis is rather new, there is very little work that covers both aspects. Visual data mining involves the invention of visual representations that could be applied in all three data-mining life cycle stages, as partitioned to the data preparation, model derivation and validation stage. That concept also indicates the partitioning of visual data mining in three fields, each one targeted on producing visual representations that will enhance information and knowledge flow throughout each data mining module. Visual data mining in the field of data preparation could be defined as the attempt to enhance or carry out some of the pre-processing module’s tasks in a visual manner. That in general involves the visual manipulation of row data according to the requirements posed by the following data mining stage of model derivation. By the term of visual manipulation we presume the ability to handle problems such as missing data fields, data transformations, sampling and pruning, data discrepancies and inconsistencies usually met in this stage, by the use of visualization techniques. Such capabilities enable us to formulate accurate hypotheses and objectives in KDD, and select carefully only the relevant and useful data to be sampled and extracted during the data pre-processing. Visual data mining in the field of model derivation implies the specification of model construction performed at this stage by visual means. Selection of the training data set and model, definition of its parameters, training process specification and outcomes storage are the general tasks of this stage. Further than that, according to our point of view, a visual overview of the whole model derivation module should also be recommended. That actually implies evaluation, monitoring and guidance of this data-mining module. Evaluation includes the validation of training samples, test-samples, and learned models against the data in the database plus the

ARTICLE IN PRESS 546

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

appropriateness of data and learning algorithms for specific data-mining situations. Monitoring includes activities such as tracking the progress of the data-mining algorithms, evaluating the continued relevance of learned patterns in the context of database updates, etc. Guidance includes activities such as user-initiated biasing or altering inputs, learned patterns and other system decisions. Visualizing a model should allow a user to discuss and explain the logic behind the model with colleagues, customers, and other users. Getting by in the logic or rationale is part of building users’ trust in the modelling results [4]. If the user can understand what has been discovered, he/she will trust it and put it into use. A model that can be understood is a model that can be trusted. Unfortunately, users are often forced to trade off accuracy of a model for understandability. Advanced visualization techniques can greatly expand the range of models that can be understood by domain experts, thereby easing the accuracy/understandability trade off [5]. Visual data mining on the validation stage could be defined as the graphical presentation of data, whether the data is base data, summary data, or mined outcomes extracted from data. This is a type of visual data analysis, where the analytic component is offloaded to human perception [6]. That implies that the basic objective of visual data mining on the validation stage is to represent as much information hidden in the Data Space to our Visualization Space, in a way that the user will acquire as much information/knowledge from that representation. That attempt involves a mapping from the amount of information available to the amount of information that can be visualized by our visual data mining techniques [7]. These notions simply define the aim of any visual data mining model proposed; to produce information rich visualization outcomes easily perceived by human’s perception. Two main factors are emphasized by that statement. On one hand, the visualization model should present as much information as possible and on the other hand, this representation should be done in such a way that the knowledge engineer would easily acquire that knowledge. The difficulty on producing new visualization models is balancing between those two factors having also in mind to increase the magnitude and the quality of the knowledge extracted. 1.2.1. Importance of visual data mining In 1854, while searching for ideas to bring a cholera epidemic raging in London, Dr. J. Snow drew dots on a map of the neighbourhood at the locations of the recorded deaths. The maps had the positions of the drinking water wells. The concentration of deaths near just one of the wells was visually striking. He had the handle of the suspect well changed and the epidemic stopped! Apparently, the disease was being transmitted by contact with the handle. This true story is widely considered as an early success of visualization [8]. In our days visualization could be the link between the two most powerful information-processing systems: humans and the modern computer. Humans, unfortunately, have many limitations. In particular, we are quite limited in our ability to handle scale and are easily overwhelmed by the volumes of data that are now routinely connected. Data mining, an automated process, is a natural reduction technique that could complement human capabilities. Combing these two

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

547

approaches for knowledge discovery is clearly a great idea [9]. Visualization could add our tremendous pattern-recognition ability with the problem solving data mining processes. Transforming and presenting problems visually could provide new insights and paves the way to their solutions [8]. The idea is to bridge the differences in approaches and encourage research to help produce new methodologies. Data mining is primarily centred on computation number crunching techniques, with minimal user involvement as the machine attempts to extract various features of the data. Data visualization, at the other extreme, has emphasized on user interactions and manipulations of graphical data representations for visual feature recognition and understanding. Each approach has its advantages and its major weaknesses. Whereas algorithms working in isolation can miss out on the ‘‘wisdom’’ that is readily available from human knowledge of the problem and the data, strictly manually guided approaches can easily cause users to lose their way in high-dimensional spaces. Between data mining and data visualization the statistical camp employs user-applied numerical methods with standard graphical displays for visual interpretation. It our belief that the joint efforts of those disciplines can provide break thoughts in the most difficult analysis problems, along with helping overcome hurdles within individual fields [9–11]. For all those reasons, we strongly believe that the contribution of visual data mining could be of essential importance in order to make the knowledge engineer part of the data mining process and take advantage of human’s perceptual system. Our definition of visual data mining gives us the flexibility to apply that concept to all three stages of the data-mining life cycle. 1.3. Summary In this work we are mostly interested in producing visual representations applied on the validation stage. Our aim is to assist the knowledge engineer to acquire enhanced knowledge and extract valuable inferences by the visualization of outcomes produced by data mining processes. Having investigated in Section 2 the problematic issues encountered during the exploitation of data mining outcomes we continue introducing our modelling techniques. Three models are presented for the visualization of association rules each one providing a different perspective and level of detail over the visualized set of rules. Those models complement a suite of visualization techniques which interactively combine powers and drive to the derivation of inferences as presented in the two evaluation scenarios (Sections 3.6.1 and 3.6.2). Following our research track we introduce in Section 4 two visualization techniques for the representation of relevance analysis outcomes. As presented in the corresponding case studies these techniques allow the visualization of large relevance analysis outcomes enhancing the derivation of temporal inferences. Finally, visualizing classification outcomes, we introduce the 3D-class preserving projection technique and its application in Section 5.1.1. Introduction of this technique makes us capable of producing 3D class-preserving projections that best discriminate among four class centroids.

ARTICLE IN PRESS 548

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

2. Research focus The most common used mean of visualization is the relatively small computer screen, where the total number of data items that can be mapped at one time is limited [7,12]. The same restriction holds also true even for other representational means such as printed views or virtual worlds where for other additional reasons we have considerable representational limitations. Those restrictions become even more tightening if we consider the pace of growth that characterizes today’s datasets. Taking those facts under consideration we need to find allies on our attempt to gain insight into our data. Those allies are the data mining algorithms and their knowledge-extracting powers. Our point of view suggests not just performing visual data mining to the raw data but also to the outcomes produced by the data mining algorithms. Our approach indicates the visual mining of the results produced by the algorithms as a secondary ‘‘fine-tuned’’ step or information abstraction step [13,14]. As a first thought that seems to be quite useful, as in most cases the results of data mining algorithms representing association rules, relevance analysis outcomes, classifications, etc. are in a form difficult to be understood by humans who are accustomed to perceive information by their visual senses. Currently, it is a challenging task for designers of visual data mining environments to find the strategies, methods and corresponding tools to visualize a particular type of information. The graphical presentation should be simple enough to be easily understood, but complete enough to reveal all the information present in the model. This is a difficult balance because simplicity usually trades off against completeness [4]. It is not obvious how to effectively visualize the results of mining large amounts of data in N-dimensions, where N can be as large as 1200 [9]. Visualization has a number of dimensions to be measured and is highly dependent on the user, the task, and the structure of the data. It is difficult to pull this out to identify an optimal method [15]. Each type of data mining outcomes produced has its specific characteristics regarding its comprehension. Investigating the specific problems that the knowledge engineer encounters on his/her attempt to exploit the mining outcomes will help us to be more precise on our research focus and the resulting visualization suggestions as long as we justify our interest for their application. 2.1. Association rules Mining for association rules, as a central task of data mining, has been studied extensively by many researchers. Much of the existing research, however, is focused on how to generate rules efficiently. Limited work has been done on how to help the user understand and use the discovered rules. In real-life applications though, the knowledge engineer wants first to have a good understanding over a set of rules before trusting them and use the mining outcomes. Investigation and comprehension of rules is a critical pre-requirement for their application. Those issues become even more tightening if we consider several other outstanding problems in the field of mining for association rules. In brief, the first difficulty is the

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

549

‘‘large resulting rule set’’ problem. A rule-mining algorithm can easily generate a large number of rules that cannot be handled by human users. Some methods such as rule pruning and interesting rules selection have been proposed to deal with this problem. The following issue is the ‘‘hard to understand’’ problem. Rules produced by rule generation algorithms are often hard to be understood by human users. The third problem is the ‘‘rule behaviour’’ problem. In real-life environments, data change over time. Rules mined in the past may not be valid in the future. As a consequence each rule has a behaviour history over time [14] In general, the mining of associations in a database creates rules that syntactically could be expressed as IF List/ConditionS THEN List/ResultS Support (%) Confidence (%) /ConditionS ) lower limitonumerical attributeo=upper limit numerical attributeo=upper limit lower limitonumerical attribute categorical attribute IN {sub-set list} /ResultS ) lower limitonumerical attributeo=upper limit Categorical Attribute=categorical value Clarifying the syntactic formalization we could comment that ListoCondition> is a set of conditions upon the values of several attributes. The condition could be either a numerical attribute in a range with upper, lower or both limits specified or a categorical attribute taking values from a finite set. In the same context but not that flexible List/ResultS could be again any combination of specific conditions. Those conditions though could either be an equality condition of one categorical attribute or a numerical attribute in range with both upper and lower limits specified. Simplifying also support and confidence mathematical definition we could affirm that the support of each rule is actually the percentage of tuples from the whole database that satisfy the rule’s left-hand side clause (IF expression) and the confidence is the percentage of tuples from the previously mentioned set which they do also satisfy the rule’s right-hand side clause (THEN expression). Trying to be general, constructing a concise mathematical formalization, the notation that we suggest for the purposes of our study is IF ½:::ðnVali1 pnAttri pnVali2 Þ:::ðcAttrj INcValj1 ; :::cValjN Þ THEN ½:::ðnValk1 pnAttrk pnValk2 Þ:::ðcAttrm ¼ cValm Þ s%,c% which can be further generalized to IF fK Attri IN½Vali1 ; Vali2 K; Kg THEN fK Attrk IN½Valk1 ; Valk2 K; Kg

s%; c%:

The prefixes ‘‘n’’ and ‘‘c’’ denote numerical and categorical attributes and values. Both IF and THEN clauses are finite sets of conditions as also any set of categorical values in a categorical sub-expression. Numerical ranges with no lower or upper limit specified could be produced as if we consider the corresponding nVali1 or

ARTICLE IN PRESS 550

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

nVali2 values equal to N or N. In a similar way of thinking, sub-expressions of equality could be considered when nVali1 and nVali2 are equal fnVali1 ¼ nVali2 ; ðnVali1 pnAttri pnVali2 Þg ) nAttri ¼ nVali1 : The generalized notation can also produce all the possible syntactic forms of an association rule by assigning to the Vali1 or Vali2 the appropriate numerical or categorical values. 2.2. Relevance analysis Relevance analysis knowledge is qualitative and it is quite useful when mining from large databases that hold information about many objects. The context of the output produced by relevance analysis though, is not that complex as in the case of association rules. That makes the knowledge engineer’s effort easier. Several techniques have been utilized for the visualization of relevance analysis outcomes with their advantages, drawbacks and limitations, with most important the decreasing quality of the representation as the number of relevant attributes increases. When the number of attributes is large, the resulting visualization looses its basic characteristics and becomes a fuzzy mapping, confusing the analyst. Most of those techniques have also been targeted on the visualization of relevance analysis outcomes at a specific time point. Our aim therefore, was to produce an analogously to the content of relevance analysis simple representation, which will have a robust behaviour as the number of the examined relevant attributes increases. It should be capable of dealing with a large number of attributes without been overwhelmed by fuzzy characteristics. Additionally, we paid particular attention on the temporal aspect of the relevant analysis outcomes. On one hand that is because of the importance of the time factor, and on the other hand, due to the little research and applied work that has been done on this field. When we examine the relevance of a specific attribute to a set of other fields, the output that is produced by relevance analysis techniques is, or can be transformed in the form: Examined Attribute Relevant to Target Attribute Uncertainty Coefficient ExmAttr List /TrgAttrS List /C%S In other words the uncertainty coefficient for each target attribute is obtained with respect to the examined attribute producing a list of relevant factors corresponding to each target attribute. That syntactic representation could be transformed in the following mathematical formalisation. ExmAttr :

ðTrgAttr1 ; c1 %Þ:::ðTrgAttrN ; cN %Þ:

2.3. Classification Classification is a primary method for machine learning and data mining. It is either used as a stand-alone tool to get insight into the distribution of a data set,

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

551

e.g. to focus further analysis and data processing, or as a pre-processing step for other algorithms operating on the detected clusters. The main enquiries that the knowledge engineer usually has on his/her attempt to understand the classification outcomes are * * *

*

How well separated are the different classes? What classes are similar or dissimilar to each other? What kind of surface separates various classes, (i.e. are the classes linearly separable?) How coherent or well formed is a given class?

Those questions are difficult to be answered by applying the conversional statistical methods over the raw data produced by the classification algorithm. Unless the user is supported by a visual representation that will actually be his/her navigational tool in the n-dimensional classified world, concluding inferences will be a tedious task. Our main aim therefore should be to visually represent and understand the spatial relationships between various classes in order to answer questions such as the above mentioned. Answers to these questions can enable the data analyst to infer inter-class relationships that may not be part of the given classification, and additionally, gauge the quality of the classification and quality of the feature space. Discovery of interesting class relationships in such a visual examination can help in the design of a better classifier and also lead to enhanced feature selection. Such an analysis would be useful while training a classifier in a ‘‘pre-classification phase’’, or in evaluating the quality of clusters in a ‘‘post-clustering’’ phase. In general, a simple mathematical formalization that we could utilize to represent that a tuple ti ; with attributes ti1 ; ti2 ; :::; tiN ; belongs to class ci has as follows: ti1 ¼ vi1 ; ti2 ¼ vi2 ; :::; tiN ¼ viN ; tiNþ1 ¼ ci : The tiNþ1 field is the classifying attribute and each vij ; j ¼ 1; :::; N the numerical or categorical value of that field.

3. Visual representation of association rules In this section, we propose three visual data mining models for the visualization of outcomes produced when mining for association rules. The proposed visualization techniques are based on abstract modelling conceptions applied in the field of data mining. That justifies their reference as visual data mining models. Addressing the problems of this field, our attempt is targeted on facilitating the knowledge extractor to visually analyze and understand a single or a set of rules, along with their changing behaviours. Addressing those issues, we believe that utilizing representations such as charts, scatter-plots and simple graphs is one promising way to enhance the visual discovery of knowledge, as long as remaining simple on our representations. Extensive work has been done on those types of representation, which could constitute our basic modelling infrastructure.

ARTICLE IN PRESS 552

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

3.1. Bar chart form visual data mining model for association rules Bar chart graphs have been utilized in numerous scientific and commercial applications with their usefulness undoubtedly proved. The ease to reveal the notions represented, makes them easily perceived by analysts accustomed to such graphs, and a good starting point for our research effort. Based on the principal ideas of the bar chart representations we came up with the bar chart form visual data-mining model for association rules. The core idea of our modelling proposal is that each rule’s sub-expression should be visualized by a bar in the chart. By the term of sub-expression we designate any condition constituting the overall expression of a rule’s left- or right-hand side clause. Based on the previously defined mathematical formalization, a subexpression could have one of the following syntactic forms: ðnVali1 pnAttri pnVali2 Þ or ðcAttrj INfcValj1 :::cValjNj gÞ depending on sub-expression attribute type (numerical or categorical). The items composing each sub-expression’s conception are represented on the graphical characteristics of the bar. The bar’s length, depth, colour and position are features that are utilized to assist and enhance our representation. An interesting observation on our modelling proposal is that following a harmonized way of representation, based on rule’s fundamental factors, we analogously constructed our visual representation. As the assembly of subexpressions constitutes a rule, the set of bars, representing those sub-expressions, is forming our bar chart model, visualizing the overall rule. On our attempt to define our model in detail, we continue in the following sub-sections examining specific cases depending on the type of the rule to be visualized. 3.1.1. Bar chart form model and numerical sub-expressions Gradually introducing our model, we examine the case of having numerical subexpressions composing rule’s left- and right-hand side clause. For such type of rules our modelling scheme implies that each sub-expression’s characteristics will be mapped to the length and position of the corresponding bar in the chart. Following that idea, in the horizontal axis (X-axis) of the chart constructed we map the names of the numerical attributes participating in that rule. An empty place partitions the bars, distinguishing rule’s left- and right-hand side clause. Sub-expressions ordering, and as a consequence bars ordering, might be specified by the user or any other automatic method (i.e. ordering depending on how many tuples in the dataset satisfy sub-expression’s condition). The vertical axis has been enumerated and scaled to map the condition limits of all the sub-expressions in that rule. According to our modelling suggestion, sub-expression’s numerical range should be represented by the range that the bar occupies in the vertical axis. In detail, bar’s starting point is defined as the value of the lower limit in sub-expression’s condition. Analogously thinking, bar’s upper limit is defined as the value of the upper limit in sub-expression’s condition. At the boundaries of the vertical axis we have also assigned the infinite values of N and +N. Such an approach gives us the flexibility of visualizing in a uniform manner sub-expression’s ranges with either their

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

553

Fig. 2. Bar chart form model and numerical sub-expressions.

upper or lower limit unspecified. In sub-expressions of equality ððnAttri ¼ nVali Þ the bar has been degenerated to a line segment. An example of our visual data-mining modelling proposal regarding association rules with numerical sub-expressions is presented in Fig. 2. If we expand our model further, we could suggest the utilization of colour. Instead of having single coloured bars with their position representing the related subexpression’s characteristics, we could enhance our perception by the smoothed distinctive colouring of those bars, designating the extent of their corresponding sub-expression’s range. That could be achieved if we map by normalization the range of each sub-expression’s condition to the colour range of that specific bar. The utilization of colour interpolation filters is considered appropriate for that aim. From our research, we suggest PBC colour scale (colour scale for Perception-Based Classification) [7] based on the HSI colour model. An example representation, according to the proposed visual data-mining model, with the utilization of the features described is presented in Fig. 23. 3.1.2. Including categorical sub-expressions in the bar chart form model Making our modelling suggestion more flexible, we continue posing no restrictions on the rule’s type. Including also categorical attributes in our model, we suggest that each corresponding categorical sub-expression’s bar should be divided in smaller segment areas, differently coloured, each one representing a categorical value. Assigning a single colour to each categorical value will aid on their discrimination as long as identifying the similar ones, independently from the sub-expression that they belong. As a categorical value might participate in different sub-expressions, the single value colouring method will enhance our ability to identify their existence wherever we trace segment areas similarly coloured. This type of modelling and coding option was preferred due to the uniform expansion that it has in order to include also categorical sub-expressions in our representation. The representation that we have chosen in our visual data-mining model, as long as the colouring scheme, produces a bar chart form where numerical and categorical bars are easily distinguishable. Based in the same underlying representational notions we can easily distinguish among numerical and categorical sub-expressions,

ARTICLE IN PRESS 554

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 3. Bar chart form model and categorical sub-expressions.

as shown in the example of Fig. 3. The size of the categorical segment areas has chosen to be the same for all categorical values as there was no indication for choosing differently. Options for specific case representation, such as the mapping of high–medium–low categorical values to bright–normal–dark colours or big– medium–small segment sizes are also supported in our model, although highly depended on each case study. The length of the bar corresponding to a categorical sub-expression is analogous to the number of the categorical values in that sub-expression. That harmonized notion of analogous mapping is preserved both in the representation of numerical and categorical sub-expressions. Thus, we could urge that in all cases the size of the bar is analogous to the sub-expression’s ‘‘norm’’, indicating the condition’s range or the number of categorical values. 3.1.3. Support and confidence in bar chart form model Rule’s support and confidence factors are of vital importance as they indicate the strength of the rule in the mined data set. According to our suggestions, we have two options for the representation of this information. The first option indicates that those two factors should be presented in the form of bars. Two additional bars at the right most place of the chart, partitioned by an empty column from bars representing sub-expressions, would indicate rule’s support and confidence correspondingly. Trying to produce a compact representation of a rule we could map the information regarding rule’s support and confidence at the background of the corresponding visualization view. Utilizing a coloured background, following a filling pattern, provides the flexibility of a different mapping. According to this type of modelling option the background colour indicates the level of support and the chosen filling pattern the confidence. Again, colour and pattern mapping conventions have been investigated and chosen depending on what would best reveal the underlying information. As in Fig. 24, we have chosen that rule’s support should be analogously mapped to the brightness of the background’s grey scale and the confidence to the density (thickness) of the grid-line pattern. According to this option of modelling, a rule with high strength factors would have a bright background with an intense overlapping filling pattern. Such an

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

555

approach would have a strong influence to the overall representation as long as it is the way that the user perceives the coded information. Another advantage of this alternative representational way is that we are decreasing the required screen area for the visualization. This modelling option can be particularly desirable in cases of multiple rules visualization on a single screen, as this compact form is considered more adequate. Specific examples and cases studies are provided in the evaluation Section 3.6. 3.1.4. Bar chart form model and evolution over time We are particularly interested on taking under consideration the time aspect in the construction of the models proposed. That aim stems from the importance of the time factor in the mining procedure as long as in the knowledge extracted. When the data set is time-stamped, it is in the interest of the knowledge extractor to take under consideration that additional information and extract also time-stamped inferences. Understanding over those outcomes could be enhanced by the utilization of modelling techniques that can deal adequately with the temporal factor. As the importance of temporal data mining is increasing, visual data mining techniques have to conform to that trend, and being complete, they should be able to represent the evolution over time of the temporal outcomes produced. Based over those notions we follow two approaches for the visualization of time stamped association rules; either by animation over the existing static model, or by the construction of a 3D world with multiple replicas of the static model. Each one of those static models corresponds to a specific time-point depending to its positions on the time axis (usually the depth axis). In the first case, as time passes, the animated growth or shrunk of bars and changing position will represent the respective change and evolution over time of that specific rule. Additionally, all other features that have also been utilized in our representation (colour, depth, segment areas, etc.) evolve accordingly, following the changes over time of the corresponding rule. That animated representation, performed in a speed rate that the user considers appropriate for his/her knowledge extraction task, would provide insight over the time factor of the outcomes visualized, revealing the underlying temporal knowledge. In the second approach, as we navigate in the 3D space, future or past rule’s states will be revealed on respect to our current time point. In detail, the plane defined by the X and Z axes of our coordinate system, as long as any other plane parallel to that one provides the basic environment of a single rule’s representation at a specific time point. The parallel to X- and Z-axis plane, intersecting the time axis at a specific time-point, is our projection framework to visualize the corresponding rule at that specific time point. Both models are considered adequate for the mining of time-stamped association rules. The animated bar chart form model provides a simple single-screen representation with minor requirements on user’s behalf. The knowledge extractor should only adjust the animation’s speed rate according to the desired depth of insight that is required and to the learning rate that he/she is capable of following. In an analogous way, the 3D bar chart form model provides the same potential on

ARTICLE IN PRESS 556

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

mining a rule’s history. By zooming in and out of the 3D representation constructed, we could either acquire an abstract overview or a detailed inspection of rule’s evolution. 3.2. Visualizing a set of association rules Our research objective is to produce visual representations of a set of rules in a single view according to the model defined. Up to this point, we have been targeted on providing insight over a single rule by its visualization in order to make understood, trusted and usable the underlying knowledge. In real-life problems though, we usually deal with a set of association rules. Additional requirements though are posed following that new perspective. The knowledge extractor in such cases seeks for inferences regarding either the whole set of rules, or a subset of it. That information should be available to be extracted either by our focused attention to the overall representation, or to a subset area. Those concepts led us to the construction of a compact view, where the concise representation of each rule was integrated. Each chart representation, considerable reduced to its basic characteristics, reveals the abstract form of the rule. With that guideline under consideration, our attempt was to have the representation of as many as possible rules in a single view, maintaining though the visualization principles that we already had. Two main issues evolve with such an attempt. On one hand, we should define the reduced representation of a rule and on the other hand, how the placement of those abstracts representations would be. Our intension is to produce a smart placement of concise rules’ representations that will enhance human’s perception abilities and mining effort. 3.3. Similarity arrangement of association rules The problem of defining a smart placement of rules in a compact visualization is not a simple one. Relevant issues have been investigated in the field, regarding attributes ordering and placement in multidimensional data visualizations, as in the case of attributes ordering in the parallel coordinates model [8], but no general solution exists as the answers proposed are quite case dependent. On our attempt to provide a solid suggestion to the problem specified and produce an advanced placement of association rules, we first define the similarity factor between two rules. Based over that measure we find the similarity factor among all rules to be visualized and then we indicate a smart pattern for placement of those rules in order to meet the new posed requirements. Moving one step at a time, we first define the attributes participation table. That is a n m matrix, where n is the number of rules to be visualized and m the number of distinct attributes participating in any sub-expression of that set. Each rule has been assigned to a row and all distinct attributes names to the columns. The table element pij ; if equal to one designates the participation of the corresponding attribute j in the

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

557

ith rule (pij equal to 0 elsewhere): 2 3 ( p11 ? 1; attrj Arulei ; 6 7 4 ^ 5 where pij ¼ 0; attrj erulei : pnm Before the definition of the similarity factor, for clarification reasons we first specify the term dimension ð jjKjjÞ of a rule. As each attribute participating in a rule is actually adding an additional dimension to our data space, we define that the dimension of an association rule is the number of attributes participating in that rule, either belonging in the left- or-right-hand clause (i.e. ||if(a#K)(b#K)(c#K) then (d#K)||=||if(k#K)(l#K)(m#K)(n#K)||=4). Based on the attributes participation table we may define now the rules similarity table as 2 3 s11 ? Pm 2 6 7 k¼1 ðpik pjk Þ : 4 ^ 5; where sij ¼ sji ¼ jjri jj þ jjrj jj snn In this table, all rules to be visualized have been assigned both to columns and rows in the same ordering, which is random and irrelevant to the following steps. Table element sij specifies the similarity factor between rules i and j: That makes our measurement invariant of the sizes of the rules compared, providing an accurate indication of similarity. Values close to 1 indicate high similarity among rules and small values, close to zero, low similarity. The resulting table is orthogonal and symmetric. Having constructed the similarity table we may proceed on defining the rules placement. Our intension is to define a concise view, where each rule is surrounded by its most similar rules of the set to be visualized, in order to produce sub-areas in the representation that might be of knowledge engineer’s interest. The strategy that we follow is based on the similarity table. We first suggest a linear ordering of the rules, which will latter be transformed for our purposes to a 2D rules placement, by the utilization of a smart filling pattern. For our case of representation, that would produce neighbourhoods of similar rules. That step-by-step approach maintains the flexibility to propose analogous suggestions over similar case problems, as the same notions under a different filling pattern could produce 1D or 3D rules placement, providing a concise view even for different modelling suggestions (Fig. 4). The attempt starts by finding a linear sequence of rules. Our suggestion is that the ordering procedure should define a sequence where the most similar rules are first, then the most similar to the second one will follow, and so on. In other words, in that linear placement each rule should have on its right its most similar rule of the set according to the similarity measurement defined. The starting point of the sequence is the two rules that have the highest similarity factor. Having found the highest value in the table, we continue searching in the similarity table for the most similar rule to the second one, again comparing the factors among them. In case of equal similarity value, resulting in an ordering procedure conflict, we also take under

ARTICLE IN PRESS 558

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 4. Smart 2D placement.

consideration the support and confidence of each rule, giving priority to the highest factor. Continuing with the arrangement of rules in the visualization area, the placement that we suggest is a recursive pattern based on a generic scheme. It is a simple back and forth arrangement, proceeding vertical to the diagonal axis of a grid form. First, the elements are placed from down-left to up-right until we meet the borders of the grid form, then below backwards from up-right to down-left, then again forward as previous, and so on. As we proceed to the centre of the grid, the length of the route increases to continue decreasing when passing it. The visualization area is a square grid form of dimension d; ðdANÞ: The size of the grid is the smallest one capable of encapsulating all rules representations. In other words, the condition ðd 1Þ2 onpd 2 should be preserved, where d is the grid’s dimension and n the size of the set to be visualized. This placement, as it will be noticed in our case studies, provides a semantically meaningful arrangement of closely related rules according to our similarity criterion, with nice clustering properties. That kind of trace walks through our visualization area in a frugal way, producing clusters of similar rules, accordingly mapped to the visualization area. By such an arrangement we are enhancing the visual mining attempt to be targeted in sections of the compact representation that we have preserved to be constituted by similar rules. Instead of trying to visually eliminate patterns of interest in the whole view, which will actually be like searching for a pin in a hey stack, the knowledge extractor will be searching for patterns in the pre-constructed clusters. Following this approach we have managed to construct neighbourhoods of mining interest, where the user will be able to query for inferences. Detailed illustration of the characteristics of this approach, applicable to real life case studies, will be in Section 5. In Fig. 5 we give an indicative example, in order to present an early view of method’s behaviour. In this example, of the visualization of rules derived from the echocardiogram [16] data set, we can easily distinguish the three clusters of similar rules at the upper left, middle and down-right sub-areas of the view. The reduced representation of the rules in this example has chosen to be just the core of the modelling suggestion with no legend or axes labelling. Customizing the properties of the view, we may reveal or conceal that information.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

559

Fig. 5. Bar chart form model and rules placement.

3.4. Grid form visual data mining model for association rules On our attempt to provide a different perspective over the visualization of a set of rules, we came up with the grid form visual data-mining model. Our intention was to provide a view over a set of rules under the same denominator, in order to make their comparison easier. That was the reason why this model was oriented from its early beginning on visualizing a collection of rules, as long as to provide an overall comparable view of that set. That of course has several disadvantageous and advantageous impacts on matters such as the level of detail or the number of the rules visualized. The approach that we have followed in order to create an abstract representation of a set of association rules is based over a cluster of cells, constructing a grid form. Named after its basic framework, the grid form visual data-mining model has each column corresponding to an attribute and each row representing a rule. The crossing area of an attribute’s column and a rule’s row, which constructs a cell on the grid, might correspond to a sub-expression. That is specified by the existence of a bar in the cell as long as its colouring scheme. In detail, the existence of a bar in a cell defines the participation of the corresponding field in a sub-expression of the rule visualized which has been related

ARTICLE IN PRESS 560

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

to that row. Condition’s characteristics are coded to the properties of the bar. A similar coding approach to the bar chart form model has been adopted for each bar in order to represent the detailed condition’s information, as could be seen in Fig. 6. Similarly, in order to be able to represent infinite sub-expressions, at the left and right boundaries of each cell we have assigned the infinite values N, +N. Additionally, close to the boundaries of each cell we have assigned the normalized minimum and maximum values of that attribute. Again, the length and position of the bar, as long as colour interpolation filters, have been utilized for the coding of the ranges in numerical sub-expressions. In the case of categorical sub-expressions, coloured bar segments corresponding to categorical values have been coding the specific details. The example of Fig. 6 demonstrates that the existence of the bars within the corresponding grid cells indicates a numerical sub-expression with both upper and lower limits specified and a categorical sub-expression with three categorical values. As we have already mentioned, the concept of rules ordering according to our similarity criterion may also be applied in the case of the grid form model. Following the same procedure, we construct the rules similarity table, from which we extract the sequence of rules. In the final step though, as the pattern placement requirements are one-dimensional, the constructed rules ordering defines also the placement of the rules. In other words, the rules ordered according to the similarity criterion, defining the corresponding linear sequence, are mapped to the vertical axis of the grid form. An example of a set of rules ordered according to our similarity criterion and visualized based on the bar chart form model is presented in Fig. 25. Addressing matters such as the evolution over time we could suggest either the time-oriented approach of animation or the construction of 3D worlds. In the first case, as we have already been accustomed with, the animated shrunk and growth of the length of bars as long as their changing position and colour would reveal the evolution over time of the rules visualized. In the later case, as an alternative tactic is considered more adequate, we follow a different one to the previous model approach in the construction of the 3D world. According to the alternative suggestion each grid form should be placed in such a way that it conforms a side of a perspective wall in 3D. Each side of the wall represents the set of rules visualized at a specific time-point. The n-sides of the

Fig. 6. Colour mapping on the grid form model.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

561

perspective wall reveal the overall evolution over time of the set of rules in the time period defined by the chosen n time points. By the clockwise or anti-clockwise rotation of the perspective wall past and future behaviour of a set of rules will be revealed on respect to our current time-point. An indicative example is presented in Fig. 7. Reducing the dimensions of the perspective wall, the flexible and adaptive properties of this model make possible its application on the visualization of the evolution over time of a single rule. In a 2D grid form we could have in all rows the representation of a single rule in several time points. In this approach, the vertical axis of our coordinate system has been time-stamped, defining the related time point of a grid’s row. The representation of the rule at each time point is visualized in the corresponding row of the grid. Such an alternative perspective provides the timeoriented context of a rule in a single view. Knowledge regarding the time-oriented behaviour of the rule along with each specific attribute factor is revealed when we are targeting our interest in the columns of the grid, vertically moving our inspection. Grid form visualization model is considered adequate to produce compact, concise and abstract representations of a set of association rules. Abstract knowledge would be extracted by seeking for colour and sequence of cube patterns or any other distinguishable features on the view constructed. Furthermore, the overall

Fig. 7. Grid form model—perspective wall.

ARTICLE IN PRESS 562

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

construction along with the placement makes easier the grouping of attributes. As a consequence, their enhanced examination results on extracting detailed inferences either over static or time oriented sets of rules. We believe that this model provides an alternative perspective over a set of rules bringing us one step closer to create a suite of models for the visualization of association rules, desired for most visual data mining tasks. 3.5. Parallel coordinates visual data mining model for association rules Inventing visual data mining models is actually conceiving new mapping techniques from the multidimensional space to a lower dimensional space even in the case of association rules. As each attribute participating in a rule is actually adding an additional dimension to our data space, we try to map each association rule existing in Rn to a lower dimensional space, which could be represented in a display. The main issue that evolves with such an attempt is not only how we could display data having many more than two variables but also how we could represent the underlying information and knowledge of a rule in our visualization. That makes our task even more demanding. In the visualization of multivariate data sets, lots of ingenious methodologies visually encoding multivariate points sets were developed. Many of them though, are laborious with high representational complexity limiting the number of variables that can be handled and lose valuable information. Our approach is based on the parallel coordinates system. In geometry parallelism, which does not require the notion of angle, rather than orthogonality is the more fundamental concept. This, coupled with the fact that orthogonality ‘‘uses-up’’ the plane very fast, was the inspiration for parallel coordinates [8]. Following those notions, as long as the fundamental concepts of parallel coordinates system, we came up with the parallel coordinates visual data mining model for association rules. In the original idea of the parallel coordinates system the goal was the visualization of multidimensional geometry and multivariate problems without loss of information. Furthermore, in our case, our attempt is also to find a mapping capable of transforming association rules in such a form that they could be visualized according to the principal ideas of parallel coordinates. Analogously to the original parallel coordinates system, we map the attributes participating in any rule’s sub-expression to the equidistant axes, which are parallel to one of the display axes. Each rule viewed as a point in the n-dimensional space is presented as a polygonal line, intersecting each of the axes at that point, which corresponds to the characteristics of the considered sub-expression. Every time we follow the trace of a rule in this representation, inspecting a crossing point among the corresponding segment line and a parallel axis, is like gradually examining an additional rule’s sub-expression. Several issues rise with those sort statements. We should first clarify what we mean by the term sub-expression’s corresponding point and how the segment line and parallel coordinate crossing point reveals subexpression’s characteristics. In other words, we should define the mapping procedure

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

563

that we follow in the representation of each sub-expression both for categorical and numerical attributes. Furthermore, when visualizing a set of rules, we should also examine how we treat the cases where an attribute does not participate in a rule, yet it has a corresponding axis in our visualization. That occurs when an attribute appears in at least one subexpression of another rule that is also included in the data set to be visualized. In the original parallel coordinates model we assume the all n-dimensional points have specific values along all dimensions. That does not conform in our case, as our attempt is to define a model flexible enough to represent a set of rules, which are not necessarily composed from exactly the same attributes. Last, but not less important and one of the difficult problems in using parallel coordinates is to somehow define the axes ordering and find an axes permutation which is ‘‘good’’ for our representation. Addressing such an issue should result in a model that is invariant from rules ordering in the data set, as long as to the subexpressions ordering in a rule. Furthermore we want a robust model to produce the same representation for the same set of rules and a smart ordering which will enhance the knowledge acquisition. Having posed those specific requirements and starting with the numerical subexpressions we could propose the crossing points of those cases to be the mean value of the numerical range. That is of course when the range has both the lower and upper limits specified. In the case of infinite ranges, with either upper or lower limits unspecified, we define that the crossing point should be the finite limit of the range. The discrimination on the type of the sub-expression will be based on the marking type of the crossing point. In detail, when we are visualizing a sub-expression with both ranges specified a horizontal small segment line will mark that case. The length of that marking line will be analogous to the range indicating its width. In the other cases, with infinite sub-expressions, the marking point will be an arrow pointing from the starting point of the range to the infinite value (either N or +N) depending on the case. That formalization will be giving additional hints regarding the detailed characteristics of the numerical sub-expression. Regarding categorical sub-expressions we present two different approaches depending on the characteristics of the categorical attribute. Categorical attributes have a finite set of values either self-characterized, designating their ordering (i.e. high, medium, low), or providing general clustering information (i.e. married, single, divorced, widowed), usually with a larger number of distinct values. Selfcharacterized categorical attributes usually have an obvious assignment on an axis. In the later case though, the representation depends on the random assignment of numerical values to the distinct categorical values. Moreover, there is no straightforward way to indicate the participation of more than one categorical values in a single sub-expression (i.e. attr IN {A,B}) according to our modelling approach as we cannot have more than one crossing point between a segment line and a parallel axis. For cases where the categorical attribute does not have an obvious or predefined ordering, or when in a sub-expression we have more than one categorical values assigned to an attribute, we propose the utilization of the ‘‘dimensional expansion’’

ARTICLE IN PRESS 564

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

technique ( flattening) [17]. According to this technique each categorical attribute should be expanded to as many new dimensions as the number of the categorical values that it can take. In other words, we have as many parallel axes for a categorical attribute as its number of the distinct categorical values that is can take. For each one of those axes, the crossing point with a segment line at the highest point indicates the participation of the corresponding categorical value at that rule’s subexpression. In the opposite case, when the crossing point is at the minimum of an axis, we represent the non-use of that value. The number of dimensions of the data space, and as a consequence the number of the parallel axes, has been increased by the application of this method (Fig. 8). However, parallel coordinates visualization technique can better illustrate the categorical nature of the sub-expressions expanded in that way. On the representation of general categorical attributes with many distinct values and particularly in sub-expressions with many categorical values, the utilization of this technique is considered essential. Mining for clusters and patterns of rules can be significantly enhanced by the utilization of the dimensional expansion method, as we can have a detailed view refined even in the level of a single attribute. Regarding attributes that do not participate in a rule yet they have a corresponding axis, our modelling suggestion is that the corresponding attribute’s polygonal line should have a faint colour at that area and there would be no connection marker between the segment line and the parallel axis. That approach results on a clear representation, indicating that there is no relevance between that rule and the corresponding numerical or categorical attribute. As far as the ordering of the attributes, and as a consequence the parallel coordinates sequence, we suggest that the attributes should be first divided in two categories depending on their type (numerical and categorical) (Fig. 9). Each category, ordered according to the frequency of each attribute in the set of rules, would produce the overall ordering. This simplified choice is justified by the fact that we do not want to add more computational complexity as this solution found to be acceptable for our case. The brightness of the colour and the thickness of the segment line corresponding to the specific rule have been utilized for the encoding of the support and confidence factors. In Fig. 26 we present an indicative example of rules derived from the echocardiogram [16] data set which have been visualized by the parallel coordinates model. The common behaviour that some rules share is quite evident.

Name Name1 Name2 Name3 Name4 Name5

Income High Medium Low High Low



Name Name1 Name2 Name3 Name4 Name5

IncomeHigh 1 0 0 1 0

IncomeMedium 0 1 0 0 0

Fig. 8. Dimensional expansion of a data set.

IncomeLow 0 0 1 0 1

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

565

Fig. 9. Parallel coordinates model.

3.6. Evaluation of the visual data mining models for association rules The three techniques presented complement a suite of visual data mining models for association rules. Providing different perspectives and points of view over a set of association rules they support the visual data mining effort by making possible the visual extraction of knowledge in a drill down manner. By the utilization of the adequate model we may extract detailed or general inferences from the refined or abstract representations. We consider that starting from the concise bar chart form model, to the parallel coordinates and finally to the grid form model we construct abstract, to middle level and detailed representations correspondingly. We believe that the contribution of their combined application is greater than the sum of their contribution as individual techniques. 3.6.1. Dermatology case study For our case study we have chosen the field of medicine and more specifically the mining for association rules in the dermatology data set of the UCI Repository of machine learning databases [16]. The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but

ARTICLE IN PRESS 566

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. For the construction of this data set patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope. The dataset contains 34 attributes, 33 of which are linear valued and one of them is nominal. The number of instances is 366. Over this dataset we have performed mining for association rules by the utilization of the data mining tool Envisioner [18]. The resulting rules were then provided as input for our models. To begin with, after the mining for association rules we selected those rules that had a middle or low level of support with high confidence factor. That decision was made as rules with high support and confidence were obvious for the experts of the field. That fact led us to the direction of trying to mine for new knowledge in the stack of rules that is commonly not analyzed, usually left out of the mining process. That is due to the laborious effort that those rules require in order to be comprehended, as their large number and variations are quite confusing. That is the main reason why it is difficult for the knowledge engineer to perceive their knowledge and derive combined inferences. For the set of rules derived from the dermatology data set and the sub-set of rules selected, we applied the bar chart form model along with the utilization of the similarity criterion and the smart-2D placement. The resulting representation is shown in Fig. 10. As it can be depicted from that view, if we omit several rules at the lower-right part of the visualization area, there are distinctive similarities among the remaining rules. The less relevant rules of the subset were positioned by our placement algorithm at the lower-right part of the visualization area, allowing the construction of four clusters of rules with distinctive similarities in the remaining area. In the figure we have marked the four similar sub-sets of rules. With the interactive functionality provided we can select those rules and proceed to the following step of our quest for interesting inferences. Having inspected the neighbourhoods of interesting rules we finally selected the A, B and C clusters as marked in Fig. 10. Those rules’ had sub-expressions of the ‘‘elongation of the rete ridges’’ attribute with ranges of small values. In Fig. 11 the selected sub-set of rules has been visualized according to the parallel coordinates model. For this representation it was considered necessary to slightly modify the values in order to avoid an intense overlapping phenomenon along with assigning zero values to the non-participating attributes. By interaction we would have been able to distinguish among the rules but for their printed presentation we considered more adequate their slight rearrangement in order to clearly comment upon the constructed view. This method, of slightly disturbing the position of overlapped glyphs in a representation, is a commonly utilized technique in the field of visualization. By a simple investigation of their representation we can easily distinguish the common patterns that those three clusters of rules share. Following the trace of those

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

567

Fig. 10. Evaluation of the bar chart form model—dermatology set of rules.

rules, in order to observe their behaviour, we can notice that the first five fields are only capable of partitioning those rules in two clusters. After the entry in the scene of the ‘‘munro microabcess’’ attribute we have their partitioning in three clusters. That conforms to our knowledge that those diseases share common clinical and histopathological features. Interactively investigating this representation, we noticed that the two classes (‘‘psoriasis’’ and ‘‘seboreic dermatitis’’) can be distinguished based on three fields. The combined overview of all the rules designate that middle values of ‘‘band-like infiltrate’’ (approximately 1.5) with low values of ‘‘munro microabcess’’ (approximately 0) and middle values of the ‘‘PNL infiltrate’’ (approximately 1.5) indicate the classification of those cases to ‘‘seboreic dermatitis’’. Examining the remaining sub-set of rules we can clearly find that there is no contradiction to the previous conclusion as cases with low values of ‘‘band-like infiltrate’’ (approximately 0), middle values of ‘‘munro microabcess’’ (approximately 1.5) and low at the ‘‘PNL infiltrate’’ (approximately 0) are classified as ‘‘psoriasis’’. Although those inferences seem to be interesting we are not yet capable to query their strength as we have been based for their extraction in a middle level of detail representation. One thing that we could say with confidence is that there are distinctive patterns of similarity among those rules with interesting inferences probably derived when we manage to combine their knowledge. Additionally,

ARTICLE IN PRESS 568

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 11. Evaluation of the parallel coordinates model—dermatology selected set of rules.

attributes ‘‘band-like infiltrate’’, ‘‘munro microabcess’’ and ‘‘PNL infiltrate’’ can probably play a pivotal role on the discrimination and classification of the cases. In order to verify our conclusions and derive detailed inferences, we utilize the grid form model. On this step of our visual data mining attempt, this model is considered more suitable for such tasks as it is capable to represent a rule in the refined detail of a sub-expression. The flattening technique for the categorical attributes ‘‘Class’’ has been utilized as we wanted to highlight the distinctive classification of the rules visualized. The investigation of Fig. 12 indicates that our inferences should be enriched. The selection of the ‘‘munro microabcess’’ attribute is not the best choice. A more accurate and stronger inference could be made if instead of the ‘‘munro microabcess’’ factor we consider the ‘‘elongation of the rete ridges’’ attribute. The core idea of our inference could be determined as: cases of with higher than the middle indication of ‘‘PNL infiltrate’’ (higher that 1.5) are classified as ‘‘Seboreic dermatitis’’. ‘‘Psoriasis’’ is indicated in the cases where we have higher than the middle values of ‘‘elongation of the rete ridges’’ and a small indication of ‘‘Spongiosis’’. Both inferences hold true for cases with not high values of ‘‘elongation of the rete ridges’’ as that was our initial criterion on the selection of the clustered rules.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

569

Fig. 12. Evaluation of the grid form model—dermatology selected set of rules.

As a final step, the knowledge extractor has the option to thoroughly examine the rules according to the detailed bar chart form model (Fig. 27). In that view all the components constituting a rule are precisely mapped in their representation. By scrolling down the view we can examine the ordered sequence of rules as produced according to their similarity criterion. That detailed view enhances the utilization of the knowledge extractor’s expertise for verifying once more the inferences derived and make the final conclusive remarks. The steps that we have followed throughout the evaluation scenario presented the core notions of our work. Detailed advantageous characteristics of each model could be illustrated in even more demanding case studies where back and forth visual data mining steps might be considered necessary. That would result in a higher level of collaboration among the visualization models and among the models and the expert. 3.6.2. Visual mining of association rules—adult case study In this section we are presenting a new case study where the proposed models were applied in order to investigate their capabilities in a greater extent. We will try to demonstrate their adaptive capabilities and abilities to cope well with any scenario. The indicative case study along with the previous presented scenario will provide a solid view of the perspective and potential of those models.

ARTICLE IN PRESS 570

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

To begin with, we selected the adult data set [16]. This data set has previously been used in several classification experiments. The task is to predict whether a given adult makes less or more than $50,000 a year based on the attributes such as education, hours of work per week, etc. The 48,842 instances have six continuous and eight categorical attributes. The selection of this data set was due to the interesting properties that it had for our purposes and to demonstrate the behaviour of our models with categorical attributes. The large set of rules generated by Envisioner [18] was a mixture of numerical and categorical values. Many of the rules seemed to be obvious or with no interest if analysed separately. Those factors, along with the large set of rules, made difficult their analysis and their combined investigation in order to conclude interesting and general inferences which would hold true with confidence. We would like to find out how our models cope with such cases, if capable to enhance combined examination of association rules of any kind (numerical or categorical), allowing the extraction of interesting conclusions. As in the previous scenario, we start our visual mining attempt with the utilization of the concise bar chart form model. The sub-set of rules selected was derived by the investigation for associations among the class of the adult (income greater or less that $50,000) and the capital gained, the education and the age of that adult. We restricted the mining for association rules to those attributes as we were mostly interested in a clear demonstration of our ideas. Following in our case studies an approach analogous to the one described we concluded that the models are capable to expand and deal with larger cases and sets of rules. That expansion though requires analogous increase of involvement on user’s behalf, which was actually one of our main goals. In Fig. 28 we present the constructed representation of the set, according to the abstract bar chart form model. Again the smart 2D placement of ordered rules based on their similarity has been utilized. The categorical attribute education has been expanded with the bars in the 2nd, 3rd, 4th columns of each chart to designate education of ‘‘Bachelors’’, ‘‘High School graduate’’ and ‘‘College’’ correspondingly. The 1st and 5th columns have been assigned to ‘‘capital gain’’ and ‘‘age’’. The bright blue and the black bars of the two right most columns (7th, 8th column of the chart as presented at the corresponding colour illustration) of the chart indicate the income of higher and lower than $50K correspondingly. Due to the fact that those rules had quite many common sub-expressions, the cluster of similar rules are not distinguishable by their first examination. Actually, they seem to construct a single cluster of similar rules. Patterns of similar subexpressions though can be noticed. Distinguishable rules with bright background colour, indicating high levels of support and confidence factors, are attracting our interest. Investigating the constructed representation we could notice that large levels of ‘‘capital gain’’ (1st column) is commonly associated with education in the level of bachelor (2nd blue bar) and results to the greater than 50K (G50K) class (7th bright blue bar). Moreover, the 3rd rule of the first row and the 3rd of the first column indicate with high support and confidence that adults with none or low capital gain

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

571

and education of either high school or college are more likely to be classified to the lower than 50K (L50K) class (8th black bar). When the selected set of rules is visualized according to the parallel coordinates model, the representation of Fig. 13 is constructed. A more detailed perspective is acquired by this view, where patterns of common sub-expressions among the rules are easily distinguishable. Again, the categorical attributes have been expanded in order to have a detailed view of each rule’s composing factors. The demonstration of the dimensional expansion technique, which was first introduced for the purposes of this model, highlights interesting patterns of common sub-expressions by the detailed expansion of determinant categorical attributes. To begin with, the assumptions made based on the bar chart form model, could be further verified in this view. If we notice the corresponding representations of the rules investigated in the previous model, we can conclude that the hypotheses made are valid. Furthermore, the examination of the group of segment lines with high capital gain refines the previous inference regarding the age of the adults. We can urge that older adults with high capital gain and bachelor are the most likely group to earn more that $50K. That additional conclusion was made based on the bright colours that the thick segment lines with high marked values on the capital gain parallel axis had.

Fig. 13. Parallel coordinates model—adult data set of rules.

ARTICLE IN PRESS 572

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Finally, following the direction of our mining attempt, we are examining the constructed view of the set of rules selected based on the grid form model, as shown in Fig. 14. Several new and complementary observations came up by the investigation of this visual data mining model. In rules (R5 –R6 ; R11 –R12 ; R21 –R22 ) high levels of capital gain is associated with education in the level of bachelor. Moreover, in some of these cases, with high support and confidence factors, the same assumption states true for adults with greater than the average age. These cases though, as expected, are more likely to fall in the class of lower that $50K as it is not that easy to earn more than $50K per year. On the other hand, those adults are the only group that have a significant possibility of support and confidence to gain more that $50K. The investigation of this model lead us at first to the direction of extracting a new assumption which later became a conclusion by its verification. Noticing the alternating pattern which is constructed in our view by the classification bars (ClG50K, ClL50K) and the corresponding analogous pattern that is followed by the confidence bars, we targeted our interest on those rules. Examining rules (R1, R2) we can assume that it is more likely for one adult with education of some college to belong in the class of L50K. The same assumption lead us to the inspection of rules: (R3, R4), (R7, R8), (R9, R10), (R13, R14), (R15, R16), (R118, R117), (R120, R119), which

Fig. 14. Grid form model—adult data set of rules.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

573

are associations of adults of various ages, education of college or high school and small capital gain. The combined inference made by all those rules is complementary to the one already derived and suggests with high confidence factors that adults irrelevantly to their age with not a bachelor education and small capital gain are more likely to belong to the L50K class. The two derived conclusions complementary support our belief regarding their validity. Expressed in a different way, they both conclude to the same notion, derived though by the visual mining of two different sets of association rules. The inference, although expected, indicates our approach and presents the abilities and the behaviour of our models. The power to interact among the models and among the models and the user along with leading the knowledge engineer to a drill down visual mining approach, makes possible the combined derivation of conclusions regarding the association rules represented.

4. Visualizing relevance analysis outcomes Following the track of our research interest, we proceed in this sub-section on the definition of visual data mining models regarding the representation of outcomes produced by relevance analysis tasks. 4.1. Solar plexus visual data mining model for relevance analysis Our aim was to suggest a flexible, information rich and with robust behaviour visualization that in a pleasant informative way would reveal a clear indication of the relevancies among the attributes under examination. In that context, we suggest a model where each attribute should be represented by a circle. In the centre of our model we have the attribute whose relevance we inspect. The target attributes are placed in a circular form around the examined attribute in equal size arcs. The closer a circle is to the centre, the most relevant the corresponding target attribute to the main attribute. Additional mapping indications of the relevance factor are the thickness of the connection arrows as long as the radius and the brightness of the sphere. Thick connection arrow between the examined attribute and a target attributes, large sphere with bright colour indicate a strong relevance among the corresponding attributes. An indicative example is presented in Fig. 15. If the number of relevant attributes under examination is larger than a specific threshold, which was found to be ten in our testing procedures, then we should consider the expansion of our model in the three dimensions. Otherwise, the previous model is not capable of clearly presenting the underlying relevancies due to fuzzy overlapping or dense placement of the spheres. The leading idea for this attempt is that the most relevant attributes should be placed in the foreground of our world. These notions imply the ordering of the attributes according to their relevance factor along with their smoothed placement in the dimension of depth in order to produce an overall smart representation of the spheres in the

ARTICLE IN PRESS 574

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 15. Solar plexus model.

3D world constructed. Preserving our original idea, that the distance among the examined and each one of the target attributes should be analogous to the relevant factor, the resulting placement is the 3D-snail placement of the spheres, as shown in Fig. 16. Taking under consideration the temporal character that the relevant outcomes might have, we have included in our model the time aspect. For the representation of the relevancies among the examined attributes at each specific time point we follow the previously defined model either in its 2D or 3D form. The series of time-stamped sub-sets of relevant analysis outcomes will produce an animated representation where the evolution of sphere’s position, radius and colour would reveal their temporal behaviour. 4.2. Time-oriented parallel coordinates visual data mining model for relevance analysis When particular attention needs to be paid over the time aspect of the relevance analysis task, we suggest the time-oriented parallel coordinates visualization model. The aim is to represent the evolution of the relevance factor among the examined and the target attributes as time passes. In such cases, our modelling is based over the underlying ideas of the parallel coordinates technique and the segment line-plot. Our perspective suggests that the trace of the segment line in the plot, which corresponds to a target attribute, should reveal the evolution over time of its relevance factor. That implies that the crossing points over the parallel axes should represent the relevance factor at that specific time point. That is achieved by

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

575

Fig. 16. Solar plexus model 3D-snail placement.

adopting several representational assumptions. The x-axis has been time-stamped with a parallel axis at each time-point. Over each parallel axis we indicate the relevance factor by a marking point, which is the crossing point for the segment line. Expanding those notions for a set of target attributes we may construct the representation shown in Fig. 17. Such a representation gives us the opportunity to have an overall notion of the history of the relevance analysis task in a single view. Additionally, the simplicity of the overall model would pose no computational or representative complexities, as that familiar to analysts representation manner would result on an easily perceived flow of information and knowledge. Detected patterns in the representation indicate similarities in the evolution of the corresponding relevant attributes and enhance the derivation of temporal inferences. 4.3. Evaluation of the visual data mining models for relevance analysis For the evaluation of the visual data mining model for relevance analysis outcomes we selected the Spambase [16]. This database was constructed by the collection of spam and non-spam e-mails and their processing in order to derive factors such as the frequency of specific words and characters. Performing relevance

ARTICLE IN PRESS 576

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 17. Time-oriented parallel coordinates model for relevance analysis.

analysis with Envisioner [18] we visualized the outcomes derived with the model proposed. The resulting representation is shown in Fig. 29. Although the number of examined and visualized attributes is relatively large there is a clear indication of the most relevant attributes. By navigating in the 3D-world constructed we can target on the high, medium or low relevant attributes and examine their exact indicated relevance factor. A hypothetical time oriented scenario of the case study presented was constructed for the evaluation of the time oriented parallel coordinates model. We have formed a set of time stamped relevance analysis outcomes indicating the time evolution of the relevance factors of the Spambase during a period of twelve months. As it can be derived from the representation of Fig. 18, factors such as: average length of uninterrupted sequences of capital letters, length of longest uninterrupted sequence of capital letters, total number of capital letters in the e-mail and frequency of the character exclamation mark have an increasing trend of relevance, designating if an e-mail is a spam one or not as time passes. On the other hand, factors such as the: frequency of words ‘‘free’’ and ‘‘money’’ and the frequency of character ‘‘dollar’’ have a periodical behaviour, highly influencing the categorization of an e-mail during the summer months.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

577

Fig. 18. Evaluation of the time-oriented parallel Coordinates model.

5. Visualizing classification outcomes On our attempt to graphically reveal the knowledge extracted by a classifier we have mainly based our research effort on the underlying ideas of the geometric projection techniques [19]. From their study we have concluded that those methods seem to be the most promising framework to base our research effort on our attempt to meet the posed requirements in the area of visualizing classification outcomes. According to the mathematical formalization that we have adopted, a tuple ti ; categorized in the class ci could be expressed as: ti1 ¼ vi1 ; ti2 ¼ vi2 ; :::; tiN ¼ viN ; tiNþ1 ¼ ci with the last field being the classification attribute. The data may naturally occur in this form, or constructed by a data mining classification algorithm. On our modelling case we consider numerical values as field values. If we want to include all types of attributes, an evident (high=1, medium=0, low= 1) or synthetic (single=0, married=1, widowed=3) mapping of categorical to numerical values is essential. Among the several geometric projection techniques that we have studied, the most interesting methodology was the Class-Preserving Projection Algorithm [20], due to the robust behaviour that it has and its middle level of computational complexity. Those issues are quite essential to be considered when applying techniques in the

ARTICLE IN PRESS 578

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

field of data mining where the limitations posed by the commonly large sets of the classified data and the computational power provided by modern computers are of critical importance. The main characteristic of classified data embedded in high-dimensional Euclidean space is that proximity in Rn implies similarity. During the mapping procedures, class-preserving projection techniques preserve the properties of the classified data in the Rn space also to the projection plane in order to construct corresponding representations from which accurate inferences could be extracted. Our research study on those techniques formed a new geometric projection technique that expands the existing methods in the area of visualizing classified data. That new technique named 3D Class-Preserving Projection technique projects from the Rn to the R3 space along with being capable of preserving the class distances (discriminating) among a larger number of classes. 5.1. 3D class-preserving projection technique In this section we introduce class-preserving projections of multidimensional data. The main advantage of those projections is that they maintain the high-dimensional class structure by the utilization of linear projections, which can be displayed on a computer screen. The challenge is in the choice of those planes and the associated projections. Considering the problem of visualizing high-dimensional data that have been categorized into various classes, our goal is to choose those projections that best preserve inter-class and intra-class distances in order to extract inferences regarding their relationships. On our attempt to expand the existing projection techniques we worked on the definition of a projection scheme that would result on the construction of a 3D world. In detail, our attempt was to define a projection scheme that would map from the Rn to the R3 space, preserving though the properties of the data in the high dimensional space, as long as the discrimination among the classes. Conclusively, it should be a 3D class-preserving projection technique. That attempt stemmed from our belief that the freedom provided by the additional dimension in the projection world would result on the construction of an information rich representation. The loss of information when mapping from Rn to the R3 space is decreased compared to the one when mapping to the R2 space. Moreover, the freedom provided in 3D representations would enhance knowledge engineer’s explanatory attempts. Allowing the navigation in the 3D world constructed will bring the knowledge engineer into direct contact with the projection of the classified data. Our beliefs guided our research effort in the definition of the 3D Class-Preserving Projection Technique, which is presented in this sub-section. The main idea in this method is that in order to project onto the 3D space we should define our orthonormal projection vectors based on four points. If we chose those four points to be the class-means of the classes of our interest, we have managed to maximize the inter-class distances among those four classes on our projection. Such an approach provides the flexibility of distinguishing among four classes instead of three, as long as being promoted into the 3D projection space.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

579

We consider the case where the data is divided into four classes. Let x1 ; x2 ; :::; xN be all the N-dimensional data points, and m1 ; m2 ; m3 ; m4 denote the corresponding class-centroids. Let w1 ; w2 and w3 be an orthonormal basis of the candidate 3D world of projection. Analogously to the previous cases of the 2D projections, the point xi gets projected to ðwT1 xi ; wT2 xi ; wT3 xi Þ and consequently, the means mj get mapped to ðwT1 mj ; wT2 mj ; wT3 mj Þ j ¼ 1; 2; 3; 4: Again, one way to obtain good separation of the projected classes is to maximize the difference between the projected means. This may be achieved by choosing vectors w1 ; w2 ; w3 ARn such that the objective function Cðw1 ; w2 ; w3 Þ ¼

3 X

fjwTi ðm2 m1 Þj2 þ jwTi ðm3 m1 Þj2 þ jwTi ðm4 m1 Þj2

i¼1

þ jwTi ðm3 m2 Þj2 þ jwTi ðm4 m2 Þj2 þ jwTi ðm4 m3 Þj2 g is maximized. The above may be rewritten as Cðw1 ; w2 ; w3 Þ ¼

3 X

fwTi fðm2 m1 Þðm2 m1 ÞT þ ? þ ðm4 m3 Þðm4 m3 ÞT gwi g

i¼1

¼ wT1 SB w1 þ wT2 SB w2 þ wT3 SB w3 ¼ W T SB W ; where W ¼ ½w1 ; w2 ; w3 ; wTi wi ¼ 1; wTi wj ¼ 0; iaj; i; j ¼ 1; 2; 3

and

SB ¼ ðm2 m1 Þðm2 m1 ÞT þ ? þ ðm4 m3 Þðm4 m3 ÞT : The positive semi-definite matrix SB can be interpreted as the inter-class or between-class scatter matrix. Note that SB has rankp3, since ðm3 m2 ÞA spanfðm2 m1 Þ; ðm3 m1 Þg; ðm4 m2 ÞAspanfðm4 m1 Þ; ðm2 m1 Þg; ðm4 m3 ÞA spanfðm4 m1 Þ; ðm3 m1 Þg: It is clear that the search for the maximizing w1 ; w2 and w3 can be restricted to the column (or row) space of SB : But as we noted above, this space is at most of dimension 3. Thus, in general, the optimal w1 ; w2 and w3 must form an orthonormal basis spanning the space determined by the vectors ðm2 m1 Þ; ðm3 m1 Þ and ðm4 m1 Þ: In the degenerate case when SB is of rank two, (i.e. when m1 ; m2 and m3 are collinear) w1 should be in the direction of m2 m1 ; w2 in the direction of m3 m1 while w3 can be chosen to be any unit vector orthogonal to the plane defined by the other two vectors. 5.1.1. Evaluation of the 3D class-preserving projection technique In order to provide solid proofs regarding the advantageous characteristics of our innovative attempt to expand the existing 2D projection techniques to the three dimensions, we are presenting in this section several evaluation case studies of the newly introduced 3D class-preserving projection technique.

ARTICLE IN PRESS 580

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

5.1.1.1. Letter image recognition case study. To begin with, we are demonstrating the ability of this technique to discriminate among a larger number of classes. In Fig. 30 we have visualized the classes A, B, C and D of the letter image recognition data set, which are represented in the colour figure by the red, green, blue and mauve spheres correspondingly. From the constructed 3D representation, it is quite clear the distinction of class A compared to the adjacent arrangement of classes B, C and D. As expected, the similar curves of those characters resulted to the neighbouring placement of the corresponding classes in the n-dimensional space, which was preserved in the projection to our 3D-world. Furthermore, due to the similarity of the straight line that letters B and D have the representations of the classes B and D are bound even closer. Those constituting basic shapes controlled the placement of the corresponding classes. Our first impression is that we have managed to define a technique which projects to the three dimensions, preserving though the properties of the classified data in the high-dimensional space. In other words, we have managed theoretically and practically to construct a 3D class-preserving projection technique. For the practical demonstration of our arguments we continue with supplementary case studies, where we apply our technique in the medical field.

Fig. 19. 3D class-preserving projection technique—dermatology (CL 3rd, 4th, 5th, 6th).

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

581

5.1.1.2. Dermatology case study. In this case study we are attempting again to gain insight into the dermatology data set from a different perspective, that of visual mining the classified data in the high-dimensional space. Instead of analysing the association rules mined, we would like to investigate the properties of the classes in the 34-dimensional dermatology data space. In order to derive interesting inferences, our attempt will be focused on composing accurate assumptions regarding the properties of those classes and their relationships. Those assumptions will be translated to corresponding conclusions regarding the dermatology diseases examined, which will then be suggested to the experts of the field for their final evaluation. As we have already stated, in this scenario the differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases and as a result the classes in this group are cronic dermatitis, lichen planus, pityriasis rosea, pityriasis rubra pilaris, psoriasis and seboreic dermatitis. The tool for our navigation in the high dimensional classified world will be the 3D class-preserving projection technique. According to this technique we are capable to target our interest in the discrimination among four classes. In Fig. 19, we selected the pityriasis rosea, pityriasis rubra pilaris, psoriasis and seboreic dermatitis classes, which are represented by the red, green, blue and mauve spheres in the colour

Fig. 20. 3D class-preserving projection technique—dermatology (CL 1st, 2nd, 4th, 5th).

ARTICLE IN PRESS 582

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 21. 3D class-preserving projection technique—dermatology (CL 1st, 4th, 5th, 6th).

version and by the medium grey, bright grey, black and dark grey spheres in the grey version correspondingly. The remaining cases, which belong to the diseases not selected, have been represented by the white spheres. From the constructed representation we could notice that the selected classes are quite coherent and clearly distinguishable. Each one of them though has several cases where the distinction is unclear. Those cases are mixed up in the centre of the view with instances of the other classes. In other words, in each one of the selected classes to be visualized, there are a number of cases where there is a clear indication of the type of the dermatology disease that they belong and some others which are quite confusing. By that conclusion, we have affirmed something that we already knew from the domain information provided for the data set. The diagnosis of erythematosquamous diseases is difficult due to the common clinical features that they have. The conclusion derived is therefore accurate, which also suggests the accurate behaviour of our 3D class-preserving projection technique. That was achieved by the preservation of the properties of the classified high dimensional data during the mapping procedure of our technique. According to the 3D class-preserving projection technique we have the option to select any combination of four, among the six classes of our data set. The resulting

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

583

Fig. 22. 3D class-preserving projection technique—dermatology (CL 2nd, 3rd, 5th, 6th).

representation will then preserve the best discrimination properties among the selected classes of interest. Selecting thus the classes of cronic dermatitis, lichen planus, pityriasis rubra pilaris and psoriasis, the representation of Fig. 20 is constructed according to our model. The selection of those diseases was motivated by our attempt to find distinctive classes, which would provide directions for enhancing the diagnosis procedure. We kept on our investigation the pityriasis rubra pilaris and psoriasis classes as they already had indications of a clear discrimination among them. Among the remaining none investigated diseases we selected the other two. In the resulting representation we can notice the improved discrimination among the classes visualized and the absolute detach of the pityriasis rubra pilaris class. These observations made feasible our research attempt to provide directions to the experts of the field which will enhance the diagnosis procedure. Our observations, in the context of the medical field, could be summarized as: if we have already excluded for a patient the case of seboreic dermatitis and pityriasis rosea the diagnosis among the remaining erythemato-squamous diseases is not that complex, as the discrimination among these diseases is quite clear. Furthermore, for the patients belonging to the pityriasis rubra pilaris class the diagnosis is expected to be derived

ARTICLE IN PRESS 584

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

with confidence (if all the necessary clinical examinations have been performed) as there is a clear discrimination of this disease. In Figs. 21 and 22 we are continuing to investigate the relationships among the diseases by selecting in each case the corresponding classes of interest. In Fig. 21 cronic dermatitis, pityriasis rubra pilaris, psoriasis and seboreic dermatitis have been selected and in Fig. 22 lichen planus, pityriasis rosea, psoriasis and seboreic dermatitis. In both examples the classes have been represented by the red, green, blue and mauve spheres in the colour version and by medium grey, bright grey, black and dark grey in the grey version correspondingly. Visual mining the constructed representations concludes to analogous to the previous inferences that support our derived arguments and strengthen our belief regarding the usefulness of the innovative technique introduced. Being our navigational tool, 3D class-preserving projection technique enhanced our attempt on gaining insight into the high-dimensional data space, accurately highlighting properties of the classified data (Figs. 23–30).

Fig. 23. Bar chart form model—enhancing colour.

Fig. 24. Bar chart form model—support & confidence.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

585

Fig. 25. Grid form model.

Fig. 26. Parallel coordinates model—example.

6. Conclusions and future work With the proposed visual data-mining models our attempt was focused on the invention of visual representations of the outcomes produced by common data mining processes. In order to equip the knowledge engineer with a tool that would be

ARTICLE IN PRESS 586

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

Fig. 27. Evaluation of the detailed bar chart form model—dermatology selected set of rules.

Fig. 28. Bar chart form model—adult data set of rules.

utilized on his/her attempt to gain insight over the mined knowledge, we tried to present as much information extracted in a human perceivable way. Additionally, our basic guideline was that the construction of each visual representation, as long as

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

587

Fig. 29. Evaluation of the solar plexus model.

Fig. 30. 3D class-preserving projection technique—letter image recognition.

the definition of the underlying visual data mining model, should harmonically map the constituting elements and notions of the corresponding type of mining outcome. The models proposed have distinctive advantageous characteristics, addressing the commonly tedious issues that the knowledge engineer handles during the

ARTICLE IN PRESS 588

I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

exploitation of the mining outcomes. Furthermore, their ability to combine forces and enhance the information flow among them and the user, brings us one step closer to make human part of the data mining process, in order to exploit human’s unmatched abilities of perception. Our future work is mainly targeted on the extensive evaluation of those models along with the development of the advanced visualization features not yet implemented in VizMiner [21]. Furthermore, expansion of those techniques in order to address issues such as attributes ordering, mapping of categorical attributes and robust models behaviour regarding large sets of outcomes is among our future plans.

Acknowledgements Thanks to Dr. Yannis Zorgios for his thoughts and comments on this work and to Dr. Areti Sfrintzeri for her expertise in the medical evaluation scenarios.

References [1] D.A. Keim, H.-P. Kriegel, Using visualization to support data mining of large existing databases, Proceedings of the IEEE Visualization ’93 Workshop, San Jose, CA, in: Lecture Notes in Computer Science, Vol. 871, Springer, Berlin, 1994, pp. 210–229. [2] W. Frawley, G. Piatetsky-Shapiro, C. Matheus, Knowledge discovery in databases: an overview, AI Magazine (1992) 13, 213–228. [3] M. Ganesh, E.-H. Han, V. Kumar, Visual data mining: framework and algorithmic development, Department of Computer and Information Sciences, University of Minnesota, Minneapolis, 1996. [4] K.H. Thearling, B.G. Becker, D. Decoste, W. Mawby, M. Pilote, D. Sommerfield, Visualizing data mining models, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 205–222. [5] W.L. Johnston, Model visualization, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 223–227. [6] D.A. Keim, J.P. Lee, B.M. Thuraisingham, C.M. Wittenbrink, Database issues for data visualization: supporting interactive database exploration, Proceedings of the Workshop on Database Issues for Data Visualization, Atlanta, GA, 1995, in: Lecture Notes in Computer Science, Springer, Berlin, 1996, pp. 12–25. [7] D.A. Keim, H.-P. Kriegel, Issues in visualizing large databases, Proceedings of the Third IFIP 2.6 Working Conference on Visual Database Systems, Lausanne, Switzerland, in: Visual Database Systems 3, Chapman & Hall, London, 1995, pp. 203–214. [8] A. Inselberg, Data mining, visualization of high dimensional data, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), Proceedings of the Workshop on Visual Data Mining, San Francisco, USA, 2001, pp. 65–81. [9] U.M. Fayyad, G.G. Grinstein, Introduction, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 1–17. [10] D.A. Keim, Visual data mining, tutorial, International Conference on Very Large Databases (VLDB ’97), Athens, Greece, 1997. [11] D. Law, Y. Foong, A visualization-driven approach for strategic knowledge discovery, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 182–190. [12] D.A. Keim, H.-P. Kriegel, Possibilities and limits in visualizing large amounts of multidimensional data, in: Perceptual Issues in Visualization, Springer, Berlin, 1995, pp. 203–214.

ARTICLE IN PRESS I. Kopanakis, B. Theodoulidis / Journal of Visual Languages and Computing 14 (2003) 543–589

589

[13] P. Docherty, A. Beck, A visual metaphor for knowledge discovery. An integrated approach to visualizing the task, data and results, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 191–203. [14] K. Zhao, B. Liu, Visual analysis of the behavior of discovered rules, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), Proceedings of the Workshop on Visual Data Mining, San Francisco, USA, pp. 59–64. [15] G.G. Grinstein, P. Hoffman, R.M. Pickett, Benchmark development for the evaluation of visualization for data mining, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 129–176. [16] C.L. Blake, C.J. Merz, UCI Repository of machine learning databases [http://www.ics.uci.edu/Bmlearn/ MLRepository.html], Department of Information and Computer Science, University of California, Irvine, CA. [17] P.E. Hoffman, G.G. Grinstein, A survey of visualizations for high-dimensional data mining, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 47–82. [18] G. Koundourakis, EnVisioner: a data mining framework based on decision trees, Ph.D. Thesis, Department of Computation, University of Manchester Institute of Science and Technology (UMIST), Manchester UK. [19] I.S. Dhillon, D.S. Modha, W.S. Spangler, Visualizing class structure of multidimensional data, in: Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Interface Foundation of North America, Vol. 30, Minneapolis, May, 1998, pp 488–493. [20] I.S. Dhillon, D.S. Modha, W.S. Spangler, Class Visualization of High-Dimensional Data with Application, IBM Almaden Research Center, San Jose, 1999. [21] I. Kopanakis, B. Theodoulidis, Visual Data Mining and Modelling Techniques, ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (KDD 2001), Proceedings of the Workshop on Visual Data Mining, San Francisco, USA, 2001, pp. 114–128.