PROVEDA: A scheme for Progressive Visualization and Exploratory Data Analysis of clusters. Work in Progress Report Aaron J. Quigley & Peter Eades Department of Computer Science & Software Engineering, University of Newcastle, Callaghan NSW 2308,
[email protected] Abstract This paper presents a scale-oriented scheme for data visualization. The aim is to explore and validate the hypothesis, that a high quality visual layout exhibits a good quality hierarchical data clustering. In this scheme, the information to be visualized and clustered is represented as a graph, where the nodes represent pieces of information and the edges represent relationships between those pieces. This scheme supports three related models of information, the underlying graph structure, the graph clustered according to some geometric attributes and, the graph represented according to some drawing mechanism. Also introduced is a method for reducing the computational complexity of a graph drawing algorithm called, force directed placement, from O(n2) to O(n log n). This method is adapted from an n-body hierarchical force calculation, that allows larger data sets to be draw and visualized on various levels of abstraction. Finally, this method provides the framework for the integration and testing of numerous concepts about how the quality of layout and clustering are related. Keywords: Visualization, Clustering, Space Decomposition, N-body, Quality Measures. 1. Introduction Exploratory data analysis or EDA is a method used by researchers and statisticians from various fields of science to investigate and explore the structure of large data sets[1]. The size and complexity of these data sets has resulted in the development of techniques which can ‘average’ parts of the data set so a better insight into the overall structure of the data can be obtained. One such method of averaging or classification of the data set is called Cluster Analysis. Cluster analysis is based on the fact that in numerous fields of study it is possible that sets of objects can be partitioned into subgroups that differ in a quantifiable or distinguishable manner. In abstract terms, a cluster set is a rooted tree with n leaves, together with a function F: leaf set -> C, Cs = C1 U C2 U C3 …..U Ck Ci ∩ Cj = 0, i ≠ j Clustering procedures attempt to generate average objects, so that the objects within a cluster are close, based on some similarity measure. The application of such a procedure to a data set, which outputs a cluster set is
referred to as Clustering. The recursive application of such a procedure, where higher order average objects are formed from lower order average objects, is referred to as hierarchical clustering. The use of hierarchical clusters allows very large data sets to be classified and then reasoned about on and across various levels of abstraction. Visualization systems that deal with large data sets employ various techniques such as, a large virtual canvas (scrolling & zooming), information mural [24], fisheye view [14] and logical frames with trails [11] to aid the user in viewing and exploring the information presented. However, the complexity and size of these data sets means any visualization tool is limited in its ability to present an overall picture of the information. As in clustering, a visualization system must balance the loss of precision in presenting more abstract representations with the realization that a clearer more abstract representation will allow greater insight into the structure of the information. This trade off in visualization can be best summed as, “if a picture isn’t worth a thousand words, the hell with it”[2].
This paper extends the notion of EDA using a visualization scheme, namely VEDA that can handle the scale of large data sets. This scheme also introduces the concept of progressive refinement, where both the quality of visual representation and the clusters formed improves over time. This scheme is called Progressive Visual Exploratory Data Analysis or ProVEDA. The rest of this paper is organized as follows; section 2 introduces the clustering and layout concepts that are the foundation of this scheme. Section 3 describes quality measures for clustering and layout. Section 4 presents several hierarchical space decomposition methods. Section 5 outlines force directed methods and describes a hierarchical force calculation. Section 6 outlines the architecture of this scheme. Finally, Section 7 describes the future work and further development of ProVEDA. 2. Clustering and Drawing 2.1 Graph Theoretic Clustering Entities, or pieces of information, and their relationships can be viewed in terms of a graph structure, with nodes and edges. Graphs allow various abstract or theoretic measures of similarity to be used. Graph Theoretic Clustering refers to the process of forming clusters from the graph structure employing a measure of similarity based on the edge distance between two nodes. For example, in clustering literature “single-linkage clusters” are formed by successively eliminating links, from a minimum spanning tree of the graph, starting with the longest link first until some threshold is reached [5,1]. Figure 1 shows clusters formed from a single linkage cluster analysis.
A B C D E F G H I J K L M N
A 0 1 1 0 0 0 0 0 0 0 0 0 0 0
B 1 0 1 1 0 0 0 0 0 0 0 0 0 0
C 1 1 0 1 0 0 0 0 0 0 0 0 0 0
D 0 1 1 0 0 1 1 0 0 0 0 0 0 0
E 0 0 0 0 0 0 1 0 0 0 0 0 0 0
F 0 0 0 1 0 0 1 0 0 0 0 0 0 0
G 0 0 0 1 1 1 0 1 0 0 0 0 0 0
H 0 0 0 0 0 0 1 0 1 1 0 0 0 0
I 0 0 0 0 0 0 0 1 0 0 0 0 0 0
J 0 0 0 0 0 0 0 1 0 0 1 1 0 0
K 0 0 0 0 0 0 0 0 0 1 0 0 0 1
L 0 0 0 0 0 0 0 0 0 1 0 0 1 1
M 0 0 0 0 0 0 0 0 0 0 1 1 0 1
N 0 0 0 0 0 0 0 0 0 0 1 1 1 0
Figure 1: An example of a Graph Theoretic Clustering.
The main problem is selecting this threshold without some a priori knowledge of the data. A low threshold will result in too many clusters with few elements. A high threshold will result in large tree-like clusters (dendograms), with the degenerate case is all the nodes collapsing into a single cluster. For example, graph theoretic clustering is used in the classification of serum concentrations of alkaline phosphatasis and iron in medical subjects’ [1]. 2.2 Geometric Clustering The various attributes of an object can be considered as coordinates for a vector in kdimensional space, ℜk. Vectors have both magnitude and direction; this can be represented as a directed line segment. A simple similarity measure based on these kdimensional vectors can be evaluated in terms of the distance between the vector endpoints. Similarity or dissimilarity measures based on k-dimensional vector distance comparisons are referred to as geometric clustering [1,5,8]. In general any distance metric which obeys the following conditions can be used in similarity measures for geometric clustering. • • • •
D(x,y) ≥ 0 distances cannot be negative D(x,y) = 0 If and only if x=y D(x,y) = D(y,x) Distance is symmetric D(x,y) + D(y,z) ≥ D(x,z) Triangular inequality
Figure 2: An example of a Geometric Clustering. Numerous distance measures and their specialization’s obey these constraints, such as Minkowski, Manhattan, Mahalanobis and Chebyshev distances [1,5,8]. For example, Manhatten refers to city-block distances (x-separation + y-separation) an example of
this measure is show in Figure 3, Chebyshev is the maximum of (x-separation or y-separation)
Figure 4: A Prototype output from ProVEDA using a graph drawing layout method.
Figure 3: Using a Manhatten distance metric to determine similarity for cluster generation. The K-means algorithm is based on a distance metric where the attributes are scaled so that the Euclidean distance between cases is appropriate [5,9]. Objects are classified into only one partition or cluster by K-means. 2.3 Graph Drawing Since the attributes of objects can be represented as vectors in ℜk. One method used to help understand this information is to construct a graphical representation, i.e. a picture. The problem in creating a picture, of a graph, is to assign a location for the node and a route for the edge, this is the classical Graph Drawing problem. This problem has been researched since graphics workstations were introduced in the 1980s see; for example [3,4,11,12,15,27]. Different types of objects and their relationships can be expressed in graph terms. This lends itself as a very good model for the representation of information. A good visual representation of these graphs can effectively convey information and hence understanding to the user but a poor representation can confuse or worse, mislead [2]. The challenge in graph drawing is to develop methods that produce quality information layout. The quality of a layout can be measured in some concrete terms such as node separatedness, edge length, uniformity of edge length, straight edge connections and, edge crossings.
Other, more abstract measures, such as how much information is conveyed, how fast the diagram presents a piece of information are more difficult to measure and are the subject of considerable HCI study [3,20,21,29]. 3. Quality Measures In abstract terms a quality measure or metric is a function that returns a value. This value can be used to compare the results of applying more than one method to a particular problem. This scheme incorporates the ability to define quality measures for the three models of information used, namely: • Graph Theoretic Clustering Quality Measures (QMTC) • Geometric Clustering Quality Measures (QMGC) • Graph Layout Quality Measures (QMGL) Defining various metrics allows comparisons between different clustering approaches, algorithms and force layout models to be made. Specifically, this scheme allows the hypothesis that a layout that exhibits a good layout also has a good clustering, to be tested and validated. In abstract terms, if the graph has a good QMGL also has a good QMGC. A rudimentary example of a QMTC would be to count the number of edges that are not included in any cluster. Different graph theoretic clustering approaches can then be compared in absolute terms. For regression tests or test suites these QM could also be extended to measure how GT clusters match up with the ‘real’ clusters. Given any clustering in geometric terms, various measures can be applied. One such measure is based on coupling and cohesion; this is a measure based on comparing the intercluster particle distance with the intra-cluster distances. This quality metric is used to determine if the overall clustering represents a real particle division or just an arbitrary space
division. Quality measures for a graph drawing (QMGL) are measured in terms of how well the model and algorithm draw the graph according to a set of aesthetics. These aesthetics are not mutual exclusive and are sometimes in direct conflict. A graph drawing algorithm cannot be expected to satisfy all of the aesthetics, as a result a representative set of aesthetics for the problem at hand are chosen which can then be used to determine the quality of a graph drawing[29]. Some of the aesthetics used include: • Edges should not cross when drawn. • Edges should drawn as straight-lines. • Minimization of the area of the drawing. • Low variance of edge lengths. • Good angular resolution A quality measure, for a graph drawing, can therefore use these aesthetics as means to determine a value. This value allows different drawing mechanisms to be compared and contrasted for the quality of the drawings they produce. This scheme introduces other, nonfunctional, requirements for a graph drawing method that, although not related to the quality of the output, must be considered. Some of these include: • Speed of the method. • Computational complexity. • Ability to scale for larger layouts. • Reduction of visual complexity. 4. Hierarchical Space Decomposition Space decomposition techniques have been used is numerous fields for representing spatial data.
Figure 5: Hierarchical regular space decomposition, Quad tree.
For example in image processing, a commonly used data structure is a quad-tree which uses a regular decomposition to describe a two dimensional image [10], an example of which is shown in Figure 5 with an area indicated. A segment of the corresponding data structure, with the corresponding leaf, is shown in Figure 6. The data structure is labeled in relation to the geometric area it represents, SouthEast (SE), SW, NE and NW. Three dimensional space decompositions are also used to represent geometric object from fluid dynamics to collision detection systems. One such decomposition is based on Oct-trees where the space is recursively decomposed into cubes, which are also referred to as voxels [17,10,27]
Root Node (total area)
SE
SW
Area of Node
SE
SE
NE
SW
SW
NE
NE
NW
NW
Figure 6: A segment of the data structure with a particular node indicated.
Figure 7: A regular triangular space decomposition, the indicated nodes are at the fifth level of the decomposition tree.
Not all hierarchical decomposition methods have to be based on regular geometric structures. As shown in Figure 7, the space can be decomposed in non-regular areas. A method based on recursive Voronoi diagrams is proposed in this scheme. S is a set of n points in d-dimensional Euclidean space Ed. The points are referred to as sites. A Voronoi diagram of S divides Ed into partitions with one partition (region) for each site. Points in a partition for a site s, are closer to s than to any other site in S. For any set of connected partitions, referred to as parent partitions P’, points in this parent partition are closer to the set of point s’ than to any other sites in S. An example of this is shown in Figure 8, where a tree structure describes recursive Voronoi sites.
Figure 8: A recursive Voronoi diagram. The indicated nodes are at the fifth level of the decomposition tree. Different recursive space partition mechanisms are supported in this scheme; part of the validation will include testing the applicability of different mechanisms for both the clustering and visualization. The space decomposition mechanisms are also used to support largescale layout in conjunction with the mechanism resented in this scheme. It should be noted, recursive space decompositions over a set of points P whether regular of irregular defines a clustering Cp of those points. These space partitions will hence aid us in developing ProVEDA for two reasons, the ability to reduce the visual complexity by viewing a higher level in the decomposition tree rather than the leaves. Secondly, recursive space partitions will aid in testing and validating the clustering hypothesis presented earlier. Finally, a non-functional requirement of speed and scale can be aid by using recursive space decompositions. It can be used to reduce the computational complexity of one method in graph drawing.
5. Force Directed Methods In graph drawing one popular family of layout methods are force-directed methods. These methods use the physical analogy of the graph as a system of bodies with forces acting between and across the bodies [3, 4, 12]. Generally these methods consist of two parts; the model which is, a system of forces and the algorithm, which provides a way to reach an equilibrium state for the system of forces. The basic difference between any two force directed methods is in how it describes the model (i.e. sees the physical world) and the algorithm it uses to reach an equilibrium state. All force directed algorithms move the system from state to state, with the goal of approaching equilibrium with each transition. Since this is a simulation mechanism, how many iterations required to bring the system to, or close to, equilibrium cannot be determined a priori, in general this is approximated. Empirical evidence has shown that when using this method the resulting graph drawing can be good. The quality measures presented in section 3, allow us to quantify and compare the quality of different force directed methods. Various algorithms based on simulated annealing, parallel processing, constraint satisfaction [15] and electrical/spring forces [11] have been developed. The simplest method is based on a spring/electrical force model. Edges in the graph are modeled as springs, which obey Hooke’s law and which have a natural length luv. All nodes repel edge other; this is modeled as an electrical repulsion, which follows the inverse square law. The overall force on any node V can be described as.
F (v ) =
∑f
( u ,v )∈E
uv
+
∑g
uv
( u ,v )∈V ×V
The Scale Problem The complexity of this method is O(n2), due to the node to node force calculation for electrical repulsion. The high computational cost of previous force directed methods, have limited their use to the drawing of relatively small graphs. 5.1 N-Body Methods The problem of node to node force comparisons is not unique to graph drawing. Determining the position of particles under motion in a system with more than two particles doesn’t have an analytical solution. An answer can only be approximated using simulation (with the inherent loss of accuracy due to computational rounding). In physics, galaxy interactions, planet motions and stars
implosions are all problems of many particles under motion while interacting. This problem is known as the n-body or many-body problem. The force on a body:
V = Vshort+Vlong+Vex Vshort is a force function, which rapidly decays over distance. Vlong is a long-range force like gravitational attraction. Vex is an external force, independent of position or number of particles. If computed directly Vlong results in O(n2) calculations, as the force between each pair of bodies in the system must be taken into account. Since the size of n in physical simulations can be very large O(105 - 1020). Directly including Vlong in a force calculation limits the size of the simulations (factors such as processing speed and parallel simulations also impact the size of n). A method used in nbody simulation called PIC (particle in cell) uses a regular grid to divide the entire space [16,17]. All the particles in a single grid cell are combined to form an average source density. Instead of computing each particle to particle force, a particle to cell comparison can be used (with a resultant loss in only a small degree of accuracy). This method of reducing the O(n2) computational complexity, of the electrical repulsion, in the force directed method at first seems appropriate. However, PIC codes have shown in empirical studies to have difficulty dealing with non-uniform particle distributions [17]. It is reasonable to believe that a drawing of a graph structure should result in a nonuniform distribution of nodes, where connected nodes are drawn closer together. A PIC like graph drawing algorithm for a force directed approach is discussed in [4]. PIC codes can only reduce the computational complexity of the graph drawing problem but don’t address the visual complexity problem when dealing with large data sets [27]. Hierarchical schemes that take advantage of the fact that particles interact strongly with nearest neighbours but less detailed information is needed for large range interactions. Appel developed codes based on neighbor lists to exploit this fact, thereby reducing the number of particle to particle comparisons, hence overall complexity [17]. Barnes and Hut introduced the notion of recursive space decomposition, where a tree is built for each new step in the simulation [18]. The tree structure is used in a systematic way to determine if a particle to sub-tree comparison for the force calculation is sufficient based on some distance measure.
Developing the measure of distance in abstract terms rather than absolute distances allowed Barnes & Hut to rigorously prove an O(NlogN) complexity. Unlike PIC codes, tree methods automatically adjust to the particle distribution. The process of building such a regular space decomposition tree is described in [17]. When determining the forces, a mathematical measure of whether to use a particle to particle comparison or a particle to sub-tree (cell) must be computed. One such measure is shown in figure 9, if l/d < Θ, then a particle to sub-tree force calculation is use. Note, as Θ tends 0 comparisons tend O(n2) and as Θ tends ∞ the comparisons tend to O(n). The more particle to particle comparisons the more the quality (in physical simulation terms) improves.
d
Center of mass
l Figure 9: Using a tree structure to determine a measure of closeness, if l/d