EVISTA – Interactive Visual Clustering System - Semantic Scholar

2 downloads 0 Views 883KB Size Report
reduces the complexity in visual distance computation and eases the human .... The purpose of the Grand Tour and Projection Pursuit is to guide user to find the ...
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009

EVISTA – Interactive Visual Clustering System K. Thangavel1, P. Alagambigai2 1

Department of Computer Science, Periyar University, Salem, Tamilnadu, India Email: [email protected] 2 Department of Computer Applications, Easwari Engineering College, Chennai, Tamilnadu, India Email: [email protected]

Abstract—Due to the enormous increase in the data, exploring and analyzing them is increasingly important but difficult to achieve. Information visualization and visual data mining can help to deal with this. Visual data exploration has a high potential and many applications such as fraud detection and data mining will use information visualization technology for an improved data analysis. The advantage of visual data exploration is that the user is directly involved in the data mining process. There are a large number of information visualization techniques which have been developed over the last decade to support the exploration of large data sets. VISTA is an interactive visual cluster rendering system which invites human into the clustering process, but there are some limitations in identifying the cluster distribution and human-computer interaction. In this paper, we propose an Enhanced VISTA (EVISTA) which addresses these drawbacks. EVISTA improves the visualization in two ways: first it uses the weighted vector normalization instead of max-min normalization, which improves the data visualization such that the user can understand the underlying pattern without human intervention. Secondly it completely eliminates the use of α tuning, which reduces the complexity in visual distance computation and eases the human computer interaction in a better way. The experiment results show that EVISTA explore the underlying pattern of the dataset effectively and reduces the user operation burden greatly.

involvement in the interactive process. More recently there are a lot of discussions on visualization for data mining. Visual data mining can be viewed as an integration of data visualization and data mining [5, 15]. Considering visualization as a supporting technology in data mining, four possible approaches are stated in [1]. The first approach is the usage of visualization technique to present the results that are obtained from mining the data in the database. Second approach is applying the data mining technique to visualization by capturing essential semantics visually. The third approach is to use visualization techniques to complement the data mining techniques. The fourth approach uses visualization technique to steer mining process. In general, visualization can be used to explore data to confirm a hypothesis or to manipulate a view. Exploratory visualization creates a dynamic scenario in which interaction is critical. The user not necessarily know that what he/she is looking for, can search for structures or trends and is attempting to arrive at some hypothesis. The confirmatory visualization, in which the system parameters are often predetermined and the visualization tools are used to confirm or refute the hypothesis. The manipulative visualization focuses on refining the visualization to optimize the presentation. Visualization has been categorized in to two major areas: i) scientific visualization –which focuses primarily on physical data such as human body, etc. ii) Information visualization – which focuses on abstract nonphysical data such as text, hierarchies and statistical data. Data mining techniques primarily oriented on information visualization [4]. Both scientific visualization and information visualization create graphical models and visual representations from data that support direct user interaction for interaction for exploring and acquiring insight in to useful information embedded in the underlying data [10, 15]. Even though visualization techniques have advantages over automatic methods, it brings up some specific problems such as limitation in visibility, visual bias due to mapping of dataset to 2D/ 3D representation, easy-to-use visual interface operations and reliable human-computer interaction. In most of the visualization methods the human-computer interaction costs than automated [9]. In general, the visual data mining is different from scientific visualization and it has the following characteristics: Wide range of users Wide choice of visualization techniques and Important dialog function. The users of scientific visualization are scientists and engineers who can endure the difficulty in using the system for little at most, whereas a visual data mining must have the

Index Terms— Clustering, EVISTA, Human-computer interaction, Information visualization, Visual data mining.

I. INTRODUCTION Data visualization is essential for understanding the concept of multidimensional spaces [5]. It allows the user to explore the data in different ways at different levels of abstraction to find the right levels of details. Therefore techniques are most useful if they are highly interactive, permit direct manipulation and include a rapid response time. Visualization is defined by ware as “a graphical representation of data or concepts” which is either an “internal construct of the mind” or an “external artifact supporting decision making”. Visualization provides valuable assistance to the human by representing information visually. This assistance may be called cognitive support. Visualization can provide cognitive support through a number of mechanisms such as grouping related information for easy search and access, representing large volumes of data in a small space and imposing structure on data and tasks can reduce time complexity, allowing interactive exploration through manipulation of parameter values [11]. Visualization techniques could enhance the current knowledge and data discovery methods by increasing the user 83 © 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 circumference of the circle C, where the unit vectors are obtained by

possibility that the general persons uses widely and so on easily [16]. By considering this issue, this paper proposes a novel information visualization technique called enhanced visual clustering system (EVISTA), an extension version of VISTA [8]. VISTA, a dynamic data visualization model which invite human into the clustering process. Even though VISTA proved to be an efficient interactive visual cluster rendering system, it requires a complete user interaction throughout the clustering process. When the number of dimension increases, the human computer interaction becomes tedious. EVISTA designed in such a way to provide an efficient data visualization such that the user can able to understand the underlying pattern of the given data set without human intervention. The rest of the paper is organized as follows: Section 2 discusses reviews of the related works in the domain of information visualization. Section 3 deals with the EVISTA. Section 4 discusses the experimental analysis. Section 5 concludes the paper.

r S i = (cos( 2πi ), sin( 2πi )), i = 1, 2,..., k k k And the 2D point Q ( x , y ) is obtained by,

{Qx,Qy} = ⎧⎨( c k ) ∑ xi'cos 2πi − x0 ,( c k )∑ yi'sin 2πi − y0⎫⎬ k



k

0

i =1

k

i =1



k

(2)

xi' = where

wt i − x i wt i

xi

(3)

represents the given data object,

xi ' represents the

normalized data value based on weighted vector wt

wt = i

and

i

[14]

∑x

ij

j =1, 2 ,..., n

(4) EVISTA employs the design of VISTA visual cluster rendering proposed by KeKe Chen and L. Liu [8] provides an intuitive way to visualize clusters with interactive feedbacks to encourage domain experts to participate in the clustering revision and cluster validation process. It allows the user to interactively observe potential clusters in a series of continuously changing visualizations through α. More importantly, it can include algorithmic clustering results and serve as an effective validation and refinement tool for irregularly shaped clusters [9]. The VISTA system has two unique features. First, it implements a linear and reliable visualization model to interactively visualize the multidimensional datasets in a 2D star-coordinate space. Second, it provides a richest set of user-friendly interactive rendering operations, allowing users to validate and refine the cluster structure based on their visual experience as well as their domain knowledge. The VISTA visualization model consists of two linear mappings: Max-min normalization followed by α-mapping. Equation (5) represents the Max-Min normalization: is used to normalize the columns in the datasets so as to eliminate the dominating effect of large-valued columns. ⎡ 2 ( v − min) ⎤ (5) −1 v i=

II. RELATED WORKS Various efforts are made to visualize multidimensional datasets [2, 10, 11, 13]. The early research on general plot based data visualization is Grand Tour and Projection Pursuit [2]. The purpose of the Grand Tour and Projection Pursuit is to guide user to find the interesting projections. L.Yang [2] utilizes the Grand Tour technique to show projections of datasets in an animation. They project the dimensions to co-ordinate in a 3D space. However, when the 3D space is shown on a 2D screen, some axes may be overlapped by other axes, which make it hard to perform direct interactions on dimensions. Star coordinate [7] is an interactive visualization model which treats dimensions uniformly, in which data are represented coarsely and by simple and more space efficient points, which result in less cluttered visualization for large data sets. Interactive visual clustering (IVC) [10] combines springembedded graph layout techniques with user interaction and constrained clustering. VISTA [8, 9] is a recent visualization models utilizes star coordinate system provide similar mapping function like star co-ordinate systems. There are two types of cluster rendering in VISTA model. The former one is unguided rendering and the latter is guided rendering.

⎢ max − min ⎣ i

⎥ ⎦

where v is the original and v is the normalized value. The α - mapping maps k dimensional points on to two dimensional visual spaces with the convenience of visual parameter tuning. The proposed visualization model EVISTA utilizes the weighted vector normalization which is performed on rows instead of columns, such that the visualization model defines the reliable position of Q ( x , y ) . EVISTA completely eliminates the usage of α- tuning, since α- mapping is tedious when the number of dimensions is high. And each change in α- values requires a fresh visual distance computation. As the number of dimensions increases, visual distance computation process may create time complexity. Similar effects may occur when the number of data objects increases. This makes

III. ENHANCED VISUAL CLUSTERING SYSTEM Enhanced VISTA (EVISTA) is an information visualization frameworks employs improved data visualization and reveal the hidden patterns in complex high dimensional data sets, without human intervention. The EVISTA model is designed based on the star coordinates. Star coordinate system is a traditional multivariate data visualization technique in which the k-axis is defined by an r origin and k coordinate O = ( x, y )

S 1, S 2, S 3,...., Sk represents the k dimensions in 2D spaces. The k coordinates are equidistantly distributed on the 84 © 2009 ACADEMY PUBLISHER

(1)

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 the human computer interaction ineffective and affects the applicability of VISTA.

B. Results and Discussion 10 8 6

IV. EXPERIMENTAL ANLYSIS

4 2 0

To illustrate the efficiency of our proposed visualization, empirical analyses are conducted on number of bench mark data sets available in the UCI machine learning data repository. The performance of EVISTA is compared against VISTA system and the automatic clustering algorithm KMeans. The experiments in VISTA are conducted by setting α value as 1.The detailed information of the data sets is shown in Table I.

-2 -4 -6 -8 -15

-10

-5

0

5

10

15

20

Figure 1. Visualization of Iris Dataset using VISTA system 8

6

4

2

0

TABLE I. DETAILS OF DATASETS

-2

-4

-6 -15

A. Cluster validation Validation of clusters is very important in cluster analysis, because clustering methods tend to generate clustering even for fairly homogeneous datasets. The quality of clusters obtained through visual clustering is measured in terms of three classical methods proposed in [3]; • The Rand index and Jaccard coefficient validations S. No 1 2 3 4 5 6

Data Set Iris Breast Cancer Hepatitis Bupa Pima Australian

No. of No. of Attributes Classes 4 3 10 2 19 2 6 2 8 2 14 2

No. of Instances 150 699 155 345 768 690

d on the agreement between clustering results and the “ground truth”. The classical validity measures are heavily related to the geometry or density nature of clusters and they do not work well for arbitrary shaped clusters [8]. In such cases, visual perception plays an important in deciding right clusters. Iris Data: Iris dataset is a benchmark dataset widely used in pattern recognition and clustering. It is formed by 150 four dimensional instances of the three classes of plants classified according to the sepal length and width and the petal length and width. The iris dataset consists of three clusters with equal distribution. One cluster is linearly separable from the other two; the latter two are not exactly linearly separable from each other. Figure.1 shows the initial visualization of iris dataset in VISTA model, where we observe the possibility of three clusters. And it is observed from the figure that, one cluster is completely separated from the other two, where the remaining two are found to be overlapped. After performing interactive visual clustering with suitable α tuning the visual boundaries between the clusters become clearer. Figure. 2 show the visualization of iris dataset after α tuning. As the literature of iris dataset specified, the two clusters are not linearly separable. In VISTA it could be observed after the fine tuning of α. And the small region which consisting of the overlapping data points are also observed. And more importantly the separation of two clusters found to be difficult for the users.

-5

0

5

10

Figure 2. Visualization of Iris Dataset after α- tuning using VISTA system 0

-0.5

-1

-1.5

-2

-2.5

-3

a r e

-3.5

-4

-4.5 0

0.5

1

1.5

2

2.5

3

3.5

4

Figure 3. Visualization of Iris Dataset using EVISTA system

b In VISTA, the domain knowledge plays a vital role in a finding the optimum number of clusters. In general, the s domain knowledge in the form of labeled items obtained by e traditional automatic clustering algorithms such as K-Means can be incorporated in to the visual clustering process. And a user without domain knowledge may fail in finding the optimum clusters, since α tuning change the data point distribution. Most of the automated clustering algorithms require the number of clusters to be specified prior, that may not coincide with real cluster distribution of the dataset. This increases the complexity of clustering process. EVISTA reduces the complexity of clustering by eliminating the usage of α. Figure. 3 show the iris dataset visualization based on EVISTA model. From the results, it is observed that one cluster is completely separated from the others and the visual boundaries between the other two clusters are clearly identified. It is also noticed that there are only two data points are overlapped. Since EVISTA doesn’t possess α tuning the process of visual distance computation process is completely eliminated, which reduces the time complexity. EVISTA doesn’t require the domain knowledge in any form, which eases the human computer interaction and it visualizes the exact pattern of the given dataset without human intervention. Australian Data: Australian Dataset concerns with credit card applications. This dataset is interesting because there is a good mix of attributes continuous, nominal with small numbers of values, and nominal with larger numbers of values. This data set also has missing values. Suitable statistical based computation is applied for finding the missing values. It has 85

© 2009 ACADEMY PUBLISHER

-10

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 data visualization. EVISTA is designed with weight vector normalization, which improves the data exploration. And the elimination of α tuning in the visualization process reduces the complexity of human – computer interaction. More importantly EVISTA doesn’t require the domain knowledge in any form, which improves the applicability of EVISTA. The experiment results show that the EVISTA efficiently identifies the cluster distribution and reduces the complexity in the visual distance computation. Specifically it eases the human-computer interaction.

two classes. The class distribution is 44.5% for class A and 55.5% for class B. Figure.4 show the visualization of Australian data set in VISTA, where possibly one single cluster is observed. During α tuning, the user can able to identify the two clusters. If the α tuning is not performed carefully, the user may get different pattern which may leads confusion. Figure. 5 show the process of α tuning, where it is observed four cluster distribution. This leads a poor cluster quality. In such case, domain knowledge is the only aid to identify the optimum number of clusters. Figure. 5 show the cluster distribution using EVISTA; where two potential clusters are observed. Since α tuning is not included in the EVISTA model, the cluster distribution can be clearly visualized. Even though the user doesn’t have enough domain knowledge in any of the form such as: number of clusters, cluster distribution, visualization model EVISTA suitably identifies the optimum number of clusters.

10 8 6 4 2 0 -2 -4 -6 -8 -1 0 -1 0

-8

-6

-4

-2

0

2

4

6

8

10

Figure 4. Visualization of Australian Dataset using VISTA system

Pima Data Pima Dataset is an Indian Diabetes Database with 768 data objects. It has two classes with class distribution as 500 and 268. It consists of attributes such as number of times pregnant, Plasma glucose concentration, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), Diabetes pedigree function, etc. Figure. 7 show the VISTA visualization of pima Indian dataset. When the pima dataset is visualized using VISTA, one possible cluster is observed. Even the suitable α tuning doesn’t distinguish the clusters. The boundary regions of the two clusters are possibly not identified. Whereas EVISTA visualization of pima dataset clearly shows two potential clusters. From Fig. 8 it is observed that pima dataset contains two potential clusters, and few data objects are scattered around the potential area. Since EVISTA doesn’t require α tuning the user may find it very flexible in finding the underlying pattern of the dataset without human intervention. And with suitable geometric transformation such as scaling and rotation the user may able to observe the cluster distribution according to their visual perception.

10 8 6 4 2 0 -2 -4 -6 -8 -1 0 -1 0

-8

-6

-4

-2

0

2

4

6

8

10

Figure 5. Visualization of Australian Dataset using VISTA system with αtuning

10 8 6 4 2 0 -2 -4 -6 -8 -1 0 -1 0

-8

-6

-4

-2

0

2

4

6

8

10

Figure 6. Visualization of Australian Dataset using EVISTA 10

8

6

4

2

0

C. Comparative Analysis This part of the section compares the results of EVISTA with VISTA and the centroid based automatic clustering algorithm K-Means. In EVISTA the cluster labeling is performed using free hand drawing. The area with potential data points are covered by convex hull and the data points in the convex hull are labeled as one single cluster. The cluster results are evaluated based on Rand Index and Jaccard coefficients are shown in Table II and Table III. The results of VISTA are obtained by conducting the experiments on several runs and the average of them is taken for experimental analysis.

-2

-4

-6

-8

-1 0 -1 0

-8

-6

-4

-2

0

2

4

6

8

10

Figure 7. Visualization of Pima Dataset using VISTA system

10 8 6 4 2 0 -2 -4 -6 -8 -1 0 -1 0

-8

-6

-4

-2

0

2

4

6

8

10

Figure 8. Visualization of Pima Dataset using EVISTA system

V. CONCLUSION With the development of data collection technology, effective data visualization models are required to understand the pattern of multidimensional and multivariate data. In this paper Enhanced VISTA is proposed to gain improvement in 86 © 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 ACKNOWLEDGMENT First author expresses his thanks to University Grants Commission for financial support (F-No. 34-105/2008, SR).

TABLE II. COMPARISON OF EVISTA WITH VISTA AND K-MEANS BASED ON RAND INDEX Visual Clustering Dataset

Iris Breast Cancer Hepatitis Bupa Pima Australian

VISTA Without α tuning

With α tuning

94.24 62.33 64.00 49.00 50.01 63.46

96.72 66.60 68.00 50.44 50.01 68.00

EVISTA

Traditional Clustering K-Means

97.03 64.12 65.45 51.51 54.56 50.61

88.20 58.31 61.71 50.58 55.13 50.71

REFERENCES [1] Bhavani Thuraisingham, “DataMining: Technologies, Techniques, Tools and Trends”, CRC press, London,Newyork, Washington,1999. [2] Cook, D.R., Buja, A., Cabrea, J., and Harley, H.: Grand Tour and Projection pursuit. J.Computational and Graphical Statistics, v23, (1995). [3] Daxin Jiang, Chun Tang, Aidong Zhang, “Cluster analysis for gene expression data: a survey”, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No.11, 2004. [4] Daniel, Keim, A., and Hans-Peter (1996), ‘Visualization Techniques for Mining Large Databases:A Comparison’, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp.923-937. [5] J. Han and M. Kamber,” Data Mining: Concepts and Techniques,” Morgan Kaufmann Publishers, August 2000, ISBN 1-55860-489-8. [6] A., K ,Jain,, M. N., Murty and Flynn P.J,” Data clustering : A Review”, ACM computing surveys, 1999. [7] E. Kandogan,” Visualizing Multi-dimensional Clusters,” Trends and outliers using star co-ordinates, Proc of ACM KDD, 2001. [8] Keke Chen and Liu. L, “VISTA: “Validating and Refining clusters via Visualization”, Information Visualization, Vol. 3, 4, 257-270,2004. [9] Keke Chen and Liu.L, “iVIBRATE:” Interactive VisualizationBased Framework for Clustering Large Datasets”, ACM Transactions on Information Systems, Vol. 24, April 2006, Pages 245-294. [10] Marie desJardins, James MacGlashan, Julia Ferraioli,” Interactive visual clustering,” Intelligent User Interfaces 2007, 361-364. [11] Melanie Tory and Torsten Moller, “Human Factors in Visualization Research,” IEEE Transactions on Visualization and Computer Graphics, 10(1), 2004. [12] Pang-ning Tan, Michael Steinbach and Vipin Kumar, “Introduction to Data Mining”, Pearson Addison Wesley, Boston, 2006. [13] O.,Sourina., D., Liu.,“Visual interactive 3-dimensional clustering with implicit functions”, Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Volume: 1, 1-3 Dec 2004, pp. 382-386. [14] Thangavel. K and Ashok Kumar. D, ‘Optimization of code book in Vector Quantization”, International Journal Annals of Operations Research, Vol.143, No.1, 317-325, 2006. [15] Ye N., “The Hand Book of Data Mining”, Lawrence Erlabum Associates, Publishers, Mahwah, Newjersey, 2003. [16] Zhen Liu, Shinichi Kamohara., Minyi Guo,“A Scheme of interactive Data Mining Support System in Parallel and Distributed Environment,” ISPA 2003, LCNS 2745, Springerverlag, pp. 263-272, 2003.

TABLE III. TABLE IV.

COMPARISON OF EVISTA WITH VISTA AND K-MEANS BASED ON JACCARD COEFFICIENT Visual Clustering

Dataset

Iris Breast Cancer Hepatitis Bupa Pima Australian

VISTA Without α tuning

With α tuning

84.24 52.33 57.35 39.00 40.01 48.82

92.72 59.60 58.00 40.44 40.01 52.22

EVISTA

Traditional Clustering K-Means

92.03 55.12 64.31 51.27 54.56 50.61

70.09 52.31 59.05 45.55 45.84 50.41

Figure 9. Comparison based on Rand Index

Figure 10. Comparison based on Jaccard coefficients

87 © 2009 ACADEMY PUBLISHER

Suggest Documents