Parallel Dual Visualization of Multidimensional Multivariate Data

11 downloads 0 Views 292KB Size Report
Abstract - A visualization technique based on the ideas of scatter plot matrix, star diagrams and parallel coordinates plot is developed in this paper.
Proceedings of the 2007 IEEE International Conference on Integration Technology March 20 - 24, 2007, Shenzhen, China

Parallel Dual Visualization of Multidimensional Multivariate Data Yonghong Xu, Wenxue Hong, Xin Li, Jialin Song Department of Biomedical Engineering University of Yanshan Qinhuangdao, Hebei Province, China [email protected] visualization and the switch between various visualizations is an important task. This paper focuses on three kinds of visualization techniques (scatter plot matrix, star diagrams and parallel coordinates). They are discussed and interpreted from a whole and systematic perspective. The three visualizations can be united and transformed to each other by geometrical transformation. We discussed the three visualization techniques in a three-dimensional design space, the relationship between them are discussed. Guided by this analysis, a new visualization we called parallel dual plot is proposed. The algorithm and application example of the parallel dual plot are also given in this paper. The remainder of this paper is organized as follows. In Section Ċ , we firstly provide an overview of the three coordinate-based presentation methods of multidimensional multivariate data (scatter plot matrix, star diagrams and parallel coordinates). Then the design space of coordinate-based visualization techniques is described in Section ċ . The Geometrical principles and methods of coordinate-based visualization are described. Layout and linking ways of these visualization techniques are also discussed. The algorithms of the parallel dual plot we propose are presented in Section Č. We also present a application example in Sectionč. Finally, we draw the main conclusions in SectionĎ.

Abstract - A visualization technique based on the ideas of scatter plot matrix, star diagrams and parallel coordinates plot is developed in this paper. The principles and algorithms of these visualizations are analyzed and compared from a geometrical perspective. It shows that the three coordinate-based geometrical visualizations have a unified mathematical foundation and presentation ways thus can be combined into a single visualization that we call parallel dual plot. The new visualization is created by firstly transforming a scatter plot into a star glyph, and then the star glyph is presented by parallel coordinates. Thus, this approach provides a point-to-point mapping and effectively overcomes the over-plotting problem of parallel coordinates. Moreover, this technique has merits of simple algorithm and easy interpretation. Example of application is demonstrated at the end of this paper. Index Terms – multidimensional multivariate visualization, scatter plot matrix, star diagrams, parallel coordinates, duality I. INTRODUCTION The history of information visualization is very long, but it has not until recently become an independent discipline. There are various definitions of information visualization, one of which is using interactive visual representations of abstract, non-physically based data to amplify cognition [1]. There are a number of well known techniques for visualizing multidimensional multivariate data sets such as scatter plot matrix, star diagrams, and parallel coordinates [2]. With the development of computer graphics theory and interaction technology, these static graphical presentations are gradually turned into interactive, dynamic, multifunctional visualization systems. There has appears a tendency of the cooperation and fusion of information visualization technologies and other disciplines such as data mining, machine learning, and statistics in recent years [3]. The concept of visual analytics is presented by some experts recently and expected to apply to some important domains such as national security and environmental monitoring [4]. Although visualization techniques have been strengthened greatly these years both in theory and in practical applications, there are a few problems remaining to be overcome. One of the obvious obstacles is that many visualization techniques developed lack necessary theoretical foundations. The relationships of these visualization and their merits and disadvantages are not systematically and fully explored. It is widely recognized that there has no single visualization, which can deal with all data types and application needs. Therefore, the choice of proper

1-4244-1092-4/07/$25.00 © 2007IEEE.

Ċ. THREE TRADITIONAL VISUALIZATION TECHNIQUES After several decades of development, there are a number of methods that has been proposed for visualizing multidimensional multivariate data sets. Among them, scatterplot matrix, star diagrams, and parallel coordinates are well known and are applied widely to various domains. They are supported in many data analysis systems such as Matlab, S-plus and R etc. These visualization tools seem very simple and their underlying mathematical bases are long overlooked. Although the duality properties of parallel coordinates had been recognized long ago [5][6], it is not consided important by many people. There are many papers about the appilications of parallel coordinates or star diagrams; however, few of them pay enough attention to their principles and properties. In this section, we will give a brief review of the three

263

popular multidimensional multivariate visualizations: scatter plot matrix, star diagrams and parallel coordinates, and point out the three coordinate-based geometrical visualizations have a unified mathematical foundation and presentation ways thus can be combined into a single visualization. A.

Scatter plot and Scatterplot matrix The 2D scatterplot is a classic technique to display the relationship between two variables, which is probably the most popular data mining visualization tool. It is generated by plotting points or markers whose positions are driven by two variables. Tt is easy to see and understand the joint behaviour such as clusters, outliers, trends, and correlations exhbited by the points in the scatterplot. A typical scatter plot is drawn in Fig. 1, showing the vegetable oil data set [7] as depicted in Ggobi [8], with colour used as an additional dimension to reflect different classes in the data set. Due to the linear nature of the plotting algorithm, the number of points that can be plotted is almost unbounded. Brushing and linking are often used to gain additional insights on the data. Zooming, panning, and jittering can be used to improve the visualization when too many points overlap or the resolution of the data causes many data points to lie at the same (x, y) coordinate. The 2D scatterplot can be extended to three, four, five, or more dimensions by animation, different colors, different shapes, interaction and so on. Depending on the user interface, the insight into the higher dimensions is rarely as good as with the standard 2D plot. One reason is that after two dimensions are used for the x and y axes respectively, the other dimensions (z axis, color, shape, animation, and so on.) do not have equal affect on the visualization. A grid of 2D scatter plots is the standard means of extending the scatter plot to higher dimensions. In the scatter plot matrix, n dimensions are projected onto n × (n − 1) scatterplots, where each pair of dimensions has two scatterplots showing their relation. This is useful for looking at all possible two-way interactions or correlations between dimensions. Typically, the visualisation consists of an array of scatterplots arranged in the form of a n × n matrix M. Each dimension of the original data defines one row and one column of the matrix. The entry where row i intersects column j (scatterplot M ij ) is a scatter plot of x i versus x j , where the data records are samples with p attributes or variables. The scatterplot matrix is symmetrical: the data values for a particular dimension are shown both in the respective row on the y-axis and in the respective column on the x-axis. The scatter matrix of vegetable oil data set drawed with the same software is shown in Fig. 2. The standard scatterplot matrix display quickly becomes inadequate for high dimensions, and user interactions of zooming and panning are needed to interpret the scatter plots effectively. Scatterplot matrix has a merit of easy interpretation and preserving euclidean distance, but poor layout and linking capability. There are several variations on the scatter-plot matrix theme, such as Hyper-Slice [9], and HyperBox [10] etc. Although these techniques each have some useful characteristics, other more suitable visual data mining tools are needed.

264

Fig. 1: Scatterplot of the vegetable oil data set

Fig. 2: Scatterplot matrix of the vegetable oil data set B.

Star diagrams The star diagram (glyph) is probably the most widely used glyph based visualization. In the star diagram, the dimensions are represented as equal angular spokes radiating from the center of a circle. The outer end of each spoke (axis) represents the maximum value of the dimension, and the center of the circle represents the minimum dimensional value. An outer line or perimeter is drawn to connect all the data value points on each dimensional radius. In a typical display, there is a star diagram for every n-dimensional data point. For instance, Fig. 3 shows star diagrams of some vegetable oil data samples with five variables. This method provides an overall impression of change of variable value across subjects. However, when there are too many variables and observations, star diagrams will no longer be appropriate.  1.5

1

0.5

0

-0.5

-1

-1.5 -1

-0.5

0

0.5

1

1.5

Fig. 3: The star diagrams of the vegetable oil data set

C.

Parallel coordinates In the approximately 20 years since the formal introduction of parallel coordinates into the referred literature by A. Inselberg [5] as a device for computational geometry and by Wegman [11] for hyper-dimensional data analysis, it has become famous and widely used in statistics, data analysis and visualization domains. The basic premise is very simple: instead of plotting data by using orthogonal axes, the axes are maintained parallel and the data plotted as lines across the axes. Hence, a point in N-dimensional space is equivalent to a broken line through N parallel coordinates in this particular visualisation method. The parallel coordinates plot of the vegetable oil data set is shown in Fig. 4. From the structure of the resulting display, one can draw conclusions for the relationship of the corresponding data values. The duality of points and lines in parallel coordinates has the effect that the amount of data-ink increases relative to the background, but may obscure the display. In addition, since it is common for the axes to be automatically scaled to the maximum and minimum for each variable, it is usually difficult to assess directly the correlation between any pair. A further problem is apparent when the number of variables increases (beyond, say, 15~20) in that it is difficult to separate the axes or associate axes with the respective variables. In fact, as the distance between axes decreases, the display becomes saturated with “data-ink” and any visible pattern is lost. As described above, the three visualizations have their merits and drawbacks respectively. Up to now, the relationshipsbetweenthemarenotfullyexplored.Wewill investigate the three visualization techniques from a geometrical perspective, and points out the possibilityof unitetheminasinglevisualization. 1.1 1 2 3 4 5 6 7

1 0.9

Coordinate Value

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

2

3 Coordinate

4

5

Fig. 4: The parallel coordinates plot of the vegetable oil data set

ċ.

DESIGN SPACE OF COORDINATED-BASED VISUALIZATIONS

From the time when Descartes created the orthogonal coordinates, the union of algebra and geometry is a foundamental thinking of scientific research. For the data objects with one, two, or three dimensions, they can be conviently mapped as the geometrical objects such as points, line, and planes. Thus data analysis can be solved graphically, on the other hand, the investigation of figure features can be done by algebra computation. However, the orthogonal coordinates cannot be gracefully generalized to higher dimensions owing to its orthogonal constraint. Thus, the basic

265

method of presenting multivariate data is mapping a 2-dimension subspace of the multidimensional data to a part of the plane (a paper or a computer screen), which are called a subplot by us. Then these multiple subplots are linked by colors, line segments, or points to present the whole data object. It can be found that the three popular visualizations (scatterplot matrix, star diagrams and parallel coordinates) all correspond to this method, they are differentiated only by the layout of these subplots, the linking ways and the primitives used in a subplot. It is obvious that other visualization techniques (such as Chernoff-Faces [12] has the similar thinking, but it is not coordinated-based, and not proper to be used as a data analysis tool). In order to analyze and compare multidimensional multivariation visualization techniques such as parallel coordinates, scatterplot matrix, and star diagrams from a unified viewpoint, we construct a design space of these visualizations. The three dimension of the design space are respectively the linking, the layout, and the primitive. The premitive means the basic geometric expression of a data object (a data sample) in a 2D projection subplot. For example, a typical scatterplot matrix has n × n subplot and the premitive of it is a point. The layout means the alligning ways of the multiple subplots in the same 2D plane (for example, a paper or the screen of a computer). The linking means the associating ways of the premitives which presenting the same sample in multiple 2D subplots. Several visualization techniques are showed by Fig. 5 in the space spaned by the three dimensions as a big point. Scatterplot matrix is presented in the space by a point A, with the coordinates of (point, grid, color), its premitive is a point, its layout is grid, and it typicaly using the same color or the same glyph to linking the same sample in various subplot. Scatterplot matrix has a good interpretation and is easy to see the bivariate relationships of the data. However, its layout is not economical owing to its grid alignment. Moreover, the linking through color is not easy to show the change trends from one dimension to another dimension. In other words, it does not present a data sample in R m as a single object. While in parallel coordinates, or star diagram, these subplots are gracefully linked together and a data sample in R m can be presented by a polygonal line (in parallel coordinates) or a polygon (in star diagram). Parallel coordinates plot is presented by the point C in fig.5, with the coordinates of (point, parallel, line). Because the linking is by a point, and primitive is a line, it is obvious that the data presentation cost (or data ink) is big and the linking cost (linking ink) is small. The layout of parallel coordinates is parallel, so the samples can be distributed in the space more evenly, and more dimensions can be displayed simultateous. However, owing to using a line as the premitive, with the encreasing of samples number, over-plotting problem become severe, and weaken its capability as a visual data-mining tool. More over, ordinary domain experts who are not familiar with the duality principle of the point and the line, tends to be puzzled and not easy to using the visualization to magnify cognition.

Star diagrams plot is presented by the point B. It is clear that it has great similarity with parallel coordinates except that its layout is circular. So, a data sample is presented by a closed polygonal line (or called a star glyph). The circular layout also has the problem of the area between two coordinates is not uniformly distributed and the samples located near the origin point is more apt to be cluttered. The new visualization technique we proposed is presented in the design space as the point D, its premitive is a point, and the linking could be done by line segments or the same color. It is clear that this visualization technique is different from scatterplot matrix by its parallel layout ways and is different from parallel coordinates from its using point-to-point mapping ways. The most useful characteristic of this method is it can fuse the three visualization techniques in the same plot, and with the help of human and computer interraction and dynamic graphical techniques (such as brushing, fading and animation), the capability for visual data mining could be greatly strenghened. The principle, plotting methods and transformation of these four visualizations is sumarized up in Fig. 6. There are two points (A and B) in a scatterplot, as showned in Fig. 6(b). It can be transformed to Fig. 6(c) by an affine transformation. Then Fig. 6(c) is transformed to Fig. 6(f) by a point-to-line transformation, and the point A and the point B is mapped to the line a’ and the line b’. It is obvious that the line a’ and the line b’ is exactly the star dirgram of the point A and the point B in the Fig. 6(b). Fig. 6(b) can be mapped to figure d through a duality transformation. The point A and the point B in the −

is effectively overcome. •

• B•

••A

gr i d



IV.

• A’

b’

• a’ B’



ALGORITHM OF PARALLEL DUAL VISUALIZATION

In parallel coordinates (abbr. ||-coords), a point in R 2 is presented by a line and a line is represented by a point yielding a fundermental point ↔ line duality [5][11]. The

D

ci r cul ar par al l el



• A" • •• •" B

It is interesting to know that although the four multivariate visualizations (parallel coordinates, scatterplot matrix, star diagrams, and parallel dual plot) are similar in principle, they show differen appearance and flavor in actual application. If we do a search in google, we can find that parallel coordinates plot is the most popular technique in visual data analysis domain (scatter plot certainly is used everywhere, but bot scatter matrix). Perhaps one of the major reasons is their lack of enough mathematical foundation as parallel coordinates plot does. Our parallel dual plot is founded based on parallel coordinates plot, consequently it inherits the superior genes of parallel coordinates and has a well-established mathematical foundation. On the other hand, this plot can be regarded as an adaptation and improvement of traditional scatterplot matrix and star diagram.

pr emi t i ve

O col or • l i ne • poi nt •

AB •

(f) (d) (e) Fig. 6: The relationship and transformation of coordinates-based visualization



•B •C

(c)







• B’

(b)



• A

• A’

•B

(a)



•poi nt

•A

•a B

b

Fig 6(b) is mapped to the lines A and B .

l i ne

A

l ayout

N

p-flats (planes of dimension 2 ≤ p ≤ n − 1 ) in R are presented by indexed points [13]. For non-linear object, the presentation is more complex and apt to clutter the display. Zur Izhakian and Inselberg [14] proposed to map the tangents of the original curve into points to obtain the “point-curve” instead of “line-curve”. We propose a new approach that provides a point-to-point mapping also. It has an intimate relation with star glyph, and it can be constructed either by an algorithm or by geometrically construction from parallel coordinates plot. Our algorithm of parallel dual plot is as follows. For a point x ∈ R M , x = ( x1 , x 2 , x3 ,...x m ) , Step1: normalization xi − min( x1 , x 2 ,..., x m ) xi = + ε (1) max( x1 , x 2 ,..., x m ) − min( x1 , x 2 ,..., x m ) ε > 0 , i = 1,2,..., m The parallelmeter ε > 0 (often using a typical value of 0.1)



l i nki ng

Fig. 5: The design space of several visualization techniques

From a geometrical perspective, we can conclude that these different multivariate visualizations have a common mathematical foundation. Through different geometrical transformation, different layout and linking stratergy, simple multiple 2D projection scatter plot can be combined into a plane as a useful visualization. This idea motivates our development of the parallel dual plot. The parallel dual plot is created by firstly using a point-to-line transformation to map the points to lines and then using another dual transformation to map these lines to points. Thus, a point-point mapping is abtained, as shown in Fig. 6(e), consequently the over-plotting problem of parallel coordinates

266

is used to provide an offset from zero. The normalization step is used to restrict the final resulting points to locate between the two parallel lines. Step2: point to line transformation Define xi , j = ( xi , x j ) , i = 1,2,..., M , j = 1,2,..., M , i ≠ j

characteristic of parallel coordinates. It shows that the xi x j xi , ) DŽConsequently, the coordinates of AB is ( xi + x j xi + x j point AB is the parallel dual plot presentation of the point C.  č. APPLICATION EXAMPLE  As an example in practical application, the parallel dual plot visualization for visual data mining of the vegetable oil data [7][15] is shown in Fig. 8. The parallel dual plot without linking is displayed in Fig. 8(a). Different colours are used for different classes in the data. It has the similar function as scatterplot matrix, for the joint behaviour such as clusters, outliers, trends, and correlations can be exhbited by the points easily. However, the parallel dual plot is superior to classical scatterplot matrix for its mathematical meanings, easy linking, economic layout and relationship with parallel coordinates. As indicated in Fig. 8(b), parallel coordinates can be linked with line segments easily to expose the multivariate relationship and change trends between variables. The data sample is presented by a polygonal line. At the same time, the scatter plots of these data samples are retained and the similarity between them are easier to interpret.

Mapping the point x i, j to a line: L: y = −

xj

x + xj xi Step3: line to point transformation

(2)

In natural homogeneous coordinates, L : y = − is represented by the triple (−

xj xi

xj xi

x + xj

,−1, x j ) and the line is

mapped to a point: −

L:(

xi x j xi , ,1) xi + x j xi + x j

(3)

Step4: layout and linking −

Every L = (

xi x j xi ,1) , so multiple + i − 1, xi + x j xi + x j

are parallel 2D projection views of the point x ∈ R aligned. These points can be linked by color or lines to expose multivariate relationships. It is interesting to see that the above algorithm has a simple geometrical construction method. Moreover, the visualization we create has a nice characteristic of combing the popular multidimensional multivariate visualization techniques into the same views. With the help of interaction and dynamic graphical techniques, it is expeted to benefit the visual mining of massive and high dimensional data. As shown in Fig. 7, a point C : ( x0 , y 0 ) in the Cartesian coordinate plane can be presented by the line linking AC and BC in the parallel coordinates plot. It is clear the point B is presented by the line b and the point A by the line a. Because of the duality characteristic of parallel coordinates plot, the line linking A and B is presented by the point AB (the intersection point of the line a and the line b). M

0.55

0.55

0.5

0.5

0.45

0.45

0.4

0.4

0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1 0.05

0.1

0

0.5

1

1.5

2

2.5

3

3.5

0.05 0

4

0.5

1

1.5

2

2.5

3

3.5

4

(a) Parallel dual plot without linking (b) Parallel dual plot with linking Fig. 8: The parallel dual plot of vegetable oils data

As we depicted above, the parallel dual plot has a merit of intimate relation with parallel coordinates. Parallel dual plot and parallel coordinates of vegetable oil samples are displayed in the same Fig. 9. 1.4

1.2

Y’

Y •BC

c AC • • OC

• OB Px

o

bc

1

•C

ab

b d a

• AB

B

F

ob

0.8

D ac

0.6

oc • OA X’

0.4

x O

oa A

0.2

Py 0

Fig. 7: The duality between parallel coordinates and othgonal coordinates.

If superimpose a Cartesian coordinate axes X Y ψon thePxPyψparallelaxessothattheyψparallelaxishasthe equation x ψ= 1.The Cartesian coordinates of AB can be computed by plane analytic theorems or by the mapping

267

0

0.5

1

1.5

2

2.5

3

3.5

4

Fig. 9: Parallel dual plots and parallel coordinates in the same view

It is clear that the two visualization techniques is complement in merits and disadvantage, and with the help of

human and computer interraction and dynamic graphical techniques (such as brushing, fading and animation), the visualization can be used as a promising visual data mining tool for multidimensional and multivariate data. It is interesting to note that the polygonal lines of parallel coordinates presentation is on the upper part of the view, and the points of parallel dual plot is on the under part of the view (because the certain geometrical relationship of the two visualization), so the two visualization is not apt to clutter in the same view.  Ď. CONCLUSIONS

REFERENCES

In order to overcome the problems of over-plotting and difficult interpretation of parallel coordinates, a new visualization technology called parallel dual plot is developed in this paper. The visualization is founded on the common theoretical basis of scatterplot matrix, star glyph and parallel coordinates. These visualization techniques are analized and compared in a design space with dimensions of linking, layout, and primitive. It shows that the three classical multivariate visualizations have a common mathematical foundation, and can be transformed from each other. Consequently, a visualization based on duality principle and geometrical transformation are created which has the superior features of easy interpretation with point-to-point mapping, and these points are easy to be linked by color or lines to display multivariate relationship. Further more, parallel coordinates and parallel dual plot can coexist in the same view. The switch between these techniques is soft and gracefully because they are related by a simple geometrical relationship. There remain some works to do in the future. The effectiveness and characteristic of parallel dual plot for data mining of massive and high dimensional data will be investigated more deeply. Dynamic graphical and interraction techniques (such as grand tour, projection pursuit, animation) will applied to the parallel dual plot, and enhance its capability as a visual data mining tool. Moreover, because its coordinate-based features, it will be applied for visual analytics, and geometry-based pattern recognition by us in the near future.  ACKNOWLEDGMENT This work is supported by National Science Foundation of China (No. 60605006). Thanks go to also those people whose previously works provide the basis of our method and excite our ideas.

[1] S. K. Card, J. D. Mackinlay and B. Shneiderman. Information Visualization: Using Vision to Think. Morgan-Kaufmann, San Francisco, California, 1999. [2] Pak Chung Wong and R. Daniel Bergeron. “30 years of multidimensional multivariate visualization,” Scientific Visualization Overviews, Methodologies, and Techniques, pp. 3-33, IEEE CS Press, 1997. [3] Maria Cristina Ferreira de Oliveira and Haim Levkowitz, “From visual data exploration to visual data mining: a survey,” IEEE Transactions on Visualization and Computer Graphics, vol. 9, no. 3, pp: 378-393,2003 [4] Daniel A. Keim, George G. Robertson, Jim J. Thomas and Jarke J. van Wijk, “Guest editorial: special section on visual analytics”, IEEE Transactions on Visualization and Computer Graphics, Vol. 12, No. 6, November/December 2006 [5] Inselberg, A., “The plane with parallel coordinates,” The Visual Computer 1:69-91, 1985 [6] Wegman, E. and Solka, J. (2002), “On some mathematics for visualizing high dimensional data”, Sankhya, vol. 64, no. 2, pp. 429-452 [7] Darinka Brodnjak-Voncina, Zdenka Cencic Kodbba and Marjana Novic. “Multivariate data analysis in classification of vegetable oils characterized by the content of fatty acids,” Chemometrics and Intelligent Laboratory Systems, vol.75, pp. 31– 43, 2005 [8] Swayne,D.F.,Temple-Lang,D.,Buja,A.andCook,D., GGobi:EvolvingfromXGobiintoanExtensible FrameworkforInteractiveDataVisualization, Journalof ComputationalStatisticsandDataAnalysis.2002 [9] Jarke J. vanWijk and Robert van Liere. HyperSlice. In Gregory M. Nielson and R. Daniel Bergeron, editors, Proceedings IEEE Visualization ’93, pages 119–125, San Jose, California, October 1993. [10] Bowen Alpern and Larry Carter. Hyperbox. In Gregory M. Nielson and Larry Rosenblum, editors, Proceedings of IEEE Visualization ’91, pages 133–139, San Diego, California, October 1991. [11] Wegman, E., “Hyper-dimensional data analysis using parallel coordinates,” Journal of the American Statistical Association, 85, 664-675 [12] Chernoff, H. (1973). The use of faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association, 68:361–368. [13] J. Eickemeyer. Visualizing p-flats in N-space using Parallel Coordinates. Ph.D. Thesis, Dept. Comp. Sc., UCLA, 1992. [14] Izhakian, Z. An Algorithm For Computing A Polynomial’s Dual Curve In Parallel Coordinates. M.sc thesis, University of Tel Aviv. 2001 [15] Yonghong xu, Wenxue Hong, Jialin Song, Xin Li and Chengwei li. “Classification of vegetable oils based on graphical presentation and bivariate discriminant node model”, Sixth World Congress on Intelligent Control and Automation (WCICA), pp.10157-10161, 2006

268

Suggest Documents